Robotics Research 3031255542, 9783031255540

The proceedings of the 2022 edition of the International Symposium of Robotics Research (ISRR) offer a series of peer-re

827 91 89MB

English Pages 578 [579] Year 2023

Table of contents :
Foreword
Preface
Organization
Contents
Robot Learning
It's Just Semantics: How to Get Robots to Understand the World the Way We Do
1 Introduction
2 Contextual Scene Understanding
2.1 Spatial Predicates: Object-to-Object Relationships
2.2 Affordances: Object-to-Agent and Scene-to-Agent Relationships
2.3 Challenges and Benefits of Embodiment
3 Summary and Outlook
References
Learning Agile, Vision-Based Drone Flight: From Simulation to Reality
1 Introduction
2 High-Speed Flight in the Wild
3 Acrobatic Flight
4 Autonomous Drone Racing
5 Future Work
5.1 Generalization
5.2 Continual Learning
5.3 Autonomous Drone Racing
References
Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping Through Continual Learning
1 Introduction
2 Related Work
3 Continual SLAM
4 Technical Approach
5 Experimental Evaluation
5.1 Evaluation of Pose Accuracy of CL-SLAM
5.2 Evaluation of Continual SLAM
6 Conclusion
References
Efficiently Learning Single-Arm Fling Motions to Smooth Garments*-12pt
1 Introduction
2 Related Work
3 Problem Statement
4 Motion Primitives and Pipeline
5 Self-Supervised Learning Pipeline for Flinging
5.1 Learning Procedure of the Fling Action
5.2 Adapting to New Garments
5.3 Execution-Time Stopping Rules
6 Experiments
6.1 Comparison of Learning Methods
6.2 Comparison of Different Trajectory Parameterization Methods
6.3 Results for Different Grasp Points
6.4 Training and Testing Across Different Garments
6.5 Stopping Time During Execution
7 Conclusion
References
Learning Long-Horizon Robot Exploration Strategies for Multi-object Search in Continuous Action Spaces
1 Introduction
2 Related Work
3 Learning Long-Horizon Exploration
3.1 Problem Statement
3.2 Technical Approach
4 Experimental Evaluations
4.1 Experimental Setup
4.2 Simulation Experiments
4.3 Real World Experiments
5 Conclusion
References
Visual Foresight with a Local Dynamics Model*-12pt
1 Introduction
2 Related Work
3 Problem Statement
4 Method
4.1 Structuring the Transition Model
4.2 Local Dynamics Model
4.3 Policy Learning
4.4 Sampling Diverse Actions
5 Experiments
5.1 Accuracy of the Local Dynamics Model
5.2 Policy Learning
5.3 Generalization
6 Limitations and Future Work
7 Conclusion
References
Robot Vision
Monocular Camera and Single-Beam Sonar-Based Underwater Collision-Free Navigation with Domain Randomization*-12pt
1 Introduction
2 Related Work
3 Approach
3.1 Multi-modal Deep Reinforcement Learning Navigation
3.2 Image Depth Prediction Network
3.3 Transferable Model
4 Experimental Results
4.1 Training Experimental Settings
4.2 Performance Comparison with Different Sensor Configurations
4.3 Ablation Study with Transferability Tests
4.4 Performance Demonstration in Real-World Environment
4.5 Action Prediction from Static Underwater Images
5 Conclusion and Future Work
References
Nonmyopic Distilled Data Association Belief Space Planning Under Budget Constraints-12pt
1 Introduction
2 Related Work
3 Background and Notations
3.1 Inference
3.2 DA-BSP
4 Methodology
4.1 Constructing the Belief Tree skeleton
4.2 Nonmyopic Distilled Data Association BSP
4.3 No Budget Constraints in Inference
4.4 Hard Budget Constraints in Inference
4.5 The Cost Function
5 Results
6 Conclusions
References
SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding*-12pt
1 Introduction
2 Related Work
3 Method
3.1 Preliminaries: Scene Understanding as a Clustering Problem
3.2 Identifying Novel Categories
4 Experimental Evaluation
4.1 Method Implementation
4.2 Adaptation of Baselines
4.3 Evaluation Metrics
4.4 Results
4.5 Design Choice Verification
5 Discussion and Outlook
References
6N-DoF Pose Tracking for Tensegrity Robots
1 Introduction
2 Related Work
3 Problem Formulation
4 Method and Implementation
4.1 Pose Initialization
4.2 Iterative Pose Tracking
4.3 Physical Constraints
4.4 Adaptive Weights
4.5 Noise Filtering
5 Experiments
5.1 Hardware Setup
5.2 Data Collection
5.3 Evaluation Metric
5.4 Baselines and Ablation Studies
5.5 Quantitative Results
5.6 Qualitative Results
5.7 Running Time
6 Limitations
7 Conclusion
References
Scale-Invariant Fast Functional Registration*-12pt
1 Introduction
2 Related Works
3 Notation
4 Functional Point Cloud Registration
4.1 The Delta-Distance Between Unordered Sets
4.2 Continuous Relaxation of the Delta-Distance
4.3 Functional Point Cloud Registration
5 Experiments and Application
5.1 Implementation Details
5.2 Rationale for Test Algorithms
5.3 Rationale for Experiment Design
5.4 Robustness Against Noise
5.5 Robustness Against Partial Overlap and Varying Densities
5.6 Time Efficiency
5.7 Sensitivity to Initialization
5.8 Real World Application
6 Conclusions
References
Towards Mapping of Underwater Structures by a Team of Autonomous Underwater Vehicles*-12pt
1 Introduction
2 Related Work
3 Proximal-Distal Mapping Framework
3.1 Robust State Estimation
3.2 Photorealistic Reconstruction
3.3 AUV Navigation
4 Conclusions, Lessons Learned, and Future Directions
References
Grasping and Manipulation
Contact-Implicit Planning and Control for Non-prehensile Manipulation Using State-Triggered Constraints*-12pt
1 Introduction
1.1 Contributions
1.2 Related Work
2 Background
2.1 Contact Model
2.2 Contact-Implicit Trajectory Optimization
2.3 Quasi-static Planar Pushing
3 Approach
3.1 State-Triggered Constraints
3.2 Automatic Contact Constraint Activation
3.3 A Reformulation of Coulomb Friction
3.4 Quasi-static Pushing Planning and Execution
4 Experiments
4.1 Dynamic Pushing
4.2 Quasi-static Planar Pushing
5 Conclusion
References
Mechanical Search on Shelves with Efficient Stacking and Destacking of Objects
1 Introduction
2 Related Work
2.1 Mechanical Search
2.2 Stacking and Destacking
2.3 Monte Carlo Tree Search
3 Problem Statement
3.1 Overview
3.2 Problem Setup
3.3 Objective
4 Methods
4.1 Occupancy Distribution
4.2 Stacked Environments
4.3 Action Generation
4.4 Distribution Area Reduction for Stacked Shelves (DARSS)
4.5 Monte Carlo Tree Search for Stacked Shelves (MCTSSS)
5 Experiments
5.1 Baseline and Oracle Policies
5.2 Simulated Experiments
5.3 DARSS Failure Mode
5.4 Ablation Experiments
5.5 Sensitivity Experiments
5.6 Physical Experiments
6 Conclusion
References
Multi-object Grasping in the Plane
1 Introduction
2 Related Work
3 Problem Statement
4 Multi-object Grasps
5 Multi-object Push Grasps
5.1 Two Necessary Conditions
5.2 Multi-object Diameter Function
5.3 Intersection Area
6 Picking System
6.1 Picking Algorithm
6.2 Frictionless Multi-Object Grasp Planner (GP)
7 Experiments
7.1 Simulation and Physical Experiments
7.2 Evaluation of Necessary Conditions in Simulation
7.3 Grasp Planning Experiments in Simulation
7.4 Physical Experiments
8 Discussion and Future Work
References
Parameter Estimation for Deformable Objects in Robotic Manipulation Tasks
1 Introduction
2 Background and Related Works
2.1 Simulation
2.2 Closing the Sim-to-Real Gap
3 Methods
3.1 Modeling
3.2 Nonlinear Programming for Parameter Estimation
4 Results
5 Conclusion and Ongoing Work
References
Place-and-Pick-Based Re-grasping Using Unstable Placement*-12pt
1 Introduction
2 Related Work
2.1 Place-and-Pick-Based Re-grasping
2.2 Re-grasping with Support from the Environment
3 Methodology
3.1 Intermediate Placement for Re-grasping
3.2 Stable and Semi-stable Placement Searching
3.3 Gripper Sliding Planning
3.4 Re-grasp Graph Construction
3.5 Re-grasp Action Motion Planning
4 Experiments
4.1 Re-grasping in Simulation
4.2 Re-grasping in Real World
5 Conclusion
References
Safe, Occlusion-Aware Manipulation for Online Object Reconstruction in Confined Spaces*-12pt
1 Introduction
2 Parameters of Occlusion-Aware Manipulation
3 Related Work
4 Problem Formulation
5 Algorithms and Analysis
5.1 Occlusion Dependency Graph (ODG)
5.2 A Probabilistically Complete Algorithm
5.3 A Monotone Algorithm
6 System Design
7 Experiments
7.1 Experimental Setup
7.2 Results and Analysis
8 Conclusion and Future Work
References
Systems and Design
Cooperation of Assistive Robots to Improve Productivity in the Nursing Care Field*-12pt
1 Introduction
2 CARE -Cooperation of Ai-Robot Enablers-
3 Verification Experiment for CARE in the Living Lab
4 Nursing Care Guidance Using AR Glasses
5 Conclusions
References
Adaptive Radiation Survey Using an Autonomous Robot Executing LiDAR Scans in the Large Hadron Collider*-12pt
1 Introduction
2 Methodology
3 Robotic System
4 Data Acquisition and Processing
5 Online Trajectory Planning and Control
6 Conclusion
References
Reference-Free Learning Bipedal Motor Skills via Assistive Force Curricula
1 Introduction
2 Related Work
2.1 RL-Based Legged Controller
2.2 Assistive Control
2.3 Learning with Recovery
3 Problem Formulation
3.1 Notation
3.2 Problem Statement
4 Learning with Assistive Force
4.1 Assistance Placement
4.2 Assistance Quantity
4.3 Assistance Embedment
4.4 Curriculum Stages
4.5 Learning Settings
5 Experiments
5.1 Example 1: Locomotion
5.2 Example 2: Dynamic Skills
6 Conclusion and Discussion
References
Ball-and-Socket Joint Pose Estimation Using Magnetic Field
1 Introduction
2 Hardware
3 Methodologies
3.1 Problem Statement
3.2 Recurrent Neural Network
3.3 Deep Variational Bayes Filters
4 Experiments
4.1 Experiment Setup
4.2 Experiment Results
5 Conclusions
References
BulletArm: An Open-Source Robotic Manipulation Benchmark and Learning Framework
1 Introduction
2 Related Work
3 Architecture
3.1 Design Philosophy
3.2 Environment
3.3 Expert Demonstrations
4 Environments
4.1 Open-Loop Environments
4.2 Close-Loop Environments
5 Benchmark
5.1 Open-Loop 3D Benchmark
5.2 Close-Loop 4D Benchmark
6 Conclusions
References
Optimization-Based Online Flow Fields Estimation for AUVs Navigationpg*-12pt
1 Introduction
2 Related Works
3 Problem Formulation
4 Flow Field Estimation
4.1 Eddy-Based Constraint
4.2 Incompressibility-Based Constraint
5 Experiment
5.1 Flow Field Estimation
5.2 AUV Navigation
6 Conclusions
References
Flying Hydraulically Amplified Electrostatic Gripper System for Aerial Object Manipulationpg*-12pt
1 Introduction
1.1 Objectives
1.2 Related Work
2 Methods
2.1 HASEL Challenges
2.2 Actuator Manufacturing
2.3 RAPTOR Quadcopter
3 Inverse vs. Classical Gripper Approach
4 Actuator Types and Configurations
4.1 Pouch Width Test (PWT)
4.2 Pouch Length Test (PLT)
4.3 Scorpion Concept Test
4.4 Complete Finger Configuration
5 Drone Integration
5.1 Flight Experiment
6 Conclusion and Future Work
References
Control
Hybrid Quadratic Programming - Pullback Bundle Dynamical Systems Control
1 Introduction
2 Background
3 Method
4 Preliminary Results
5 Conclusion
References
Riemannian Geometry as a Unifying Theory for Robot Motion Learning and Control
1 Motivation
2 Geodesic Synergies: Basics and Proof of Concept
2.1 Riemannian Geometry of Mechanical Systems
2.2 Robot Motion Generation with Geodesic Synergies
3 Vision and Challenges
References
Ensured Continuous Surveillance Despite Sensor Transition Using Control Barrier Functions*-12pt
1 Introduction
2 Problem Formulation
2.1 Control Objectives
3 Selection of ux and uy to Solve Control Objective 1
4 Sufficient Conditions to Solve Control Objective 2
5 Control Laws to Solve Control Objective 2
5.1 Bounds on the Time Derivatives, and a Selection for u( t )
5.2 u ( t ) and u ( t )
6 Simulations
7 Conclusion
References
Nonlinear Stochastic Trajectory Optimization for Centroidal Momentum Motion Generation of Legged Robots*-12pt
1 Introduction
2 Background
2.1 Notation
2.2 Robot Dynamics
2.3 Centroidal Momentum Trajectory Optimization
3 Stochastic Optimal Control for Centroidal Momentum Trajectory Optimization
3.1 Individual Chance Constraints Reformulation
3.2 Deterministic Reformulation of Individual Chance Constraints
3.3 Deterministic Reformulation of SOCP
3.4 SCP with L1 Trust Region Penalty Cost
4 Simulations Results
4.1 Simulations Setup
5 Discussion and Conclusions
References
Reactive Anticipatory Robot Skills with Memory*-12pt
1 Introduction
2 Related Work
3 Background
4 Methods
4.1 Extended System Level Synthesis (eSLS)
4.2 Iterative System Level Synthesis (iSLS)
4.3 Time Correlations Between States
4.4 Adaptation to New Reference States
5 Experiments and Results
5.1 Simulated Task
5.2 Pick-and-Place Task
6 Conclusion
References
ROSE: Robust State Estimation via Online Covariance Adaption
1 Introduction
2 Related Work
3 Problem Formulation
4 Proposed Method
4.1 Method Overview
4.2 Preliminary FGO
4.3 Backward Kalman Smoothing
4.4 Online NCM Estimation
4.5 Main FGO
5 Experiments
5.1 Setup
5.2 Simulation Results
5.3 Hardware Results
6 Conclusions and Future Work
References
What Can Algebraic Topology and Differential Geometry Teach Us About Intrinsic Dynamics and Global Behavior of Robots?*-12pt
1 Introduction
1.1 A Very Short Primer on Robot Dynamics
1.2 Mauptertuis' Principle of Least Action
1.3 Geodesics on Riemannian Manifolds
1.4 A Purely Geometric Perspective on Lagrangian Mechanics
1.5 Algorithms: Shooting Versus Extremizing Arc Length
2 String Relaxation
3 Topological Insights into Periodic Orbits
4 A Case Study: Periodic Orbits of the Double Pendulum
4.1 Toroidal Orbits
4.2 Disk Orbits
4.3 Brake Orbits aka Nonlinear Normal Modes
4.4 Classification and Overview on Periodic Orbits
5 Conclusion
References
Planning
Efficient Task Planning Using Abstract Skills and Dynamic Road Map Matching
1 Introduction
1.1 Background
1.2 Contribution
1.3 Related Work
2 Preliminaries
2.1 Task Planning
2.2 Abstract Skills and Abstraction Keys
3 Practical Demonstration: Grid World Planning Domain
4 Planning with Skills as a Constraint Satisfaction Problem
4.1 Skill Matching
4.2 Action Recovery
4.3 The Joint Problem: Skills as Dynamic Constraints
5 Solving the Problem: A Hierarchical Planning Algorithm
5.1 Abstract-Skill Matching for Reach-Avoid Tasks
5.2 Action Recovery Using Dynamic Task Decomposition
6 Experimental Results
7 Conclusion
References
CALIPSO: A Differentiable Solver for Trajectory Optimization with Conic and Complementarity Constraints*-12pt
1 Introduction
2 Related Work
3 Background
3.1 Trajectory Optimization
3.2 Complementarity Constraints
3.3 Linear Independence Constraint Qualification
3.4 Augmented Lagrangian Methods
3.5 Interior-Point Methods
3.6 Differentiable Optimization
4 CALIPSO
4.1 Search Direction
4.2 Line Search
4.3 Outer Updates
4.4 Solution Derivatives
4.5 Implementation
5 Examples
5.1 Contact-Implicit Trajectory Optimization
5.2 State-Triggered Constraints
5.3 Model Predictive Control Auto-Tuning
6 Conclusion
References
Clutter-Resilient Autonomous Mobile Robot Navigation with Computationally Efficient Free-Space Features*-12pt
1 Introduction
2 Problem Formulation
3 Proposed Solution
3.1 Assumptions and Knowledge
3.2 Designing Motion Tubes
3.3 Sweeping to Detect Collision
3.4 Sampling
3.5 Template Samples to Sensor Space
4 Experimental Validation
5 Conclusions and Discussion
References
Robust-RRT: Probabilistically-Complete Motion Planning for Uncertain Nonlinear Systems
1 Introduction
2 Related Work
3 Problem Formulation
4 Robust-RRT
5 Robust-RRT Is Probabilistically Complete (PC)
6 Insights and Practical Considerations
7 Experiments
8 Conclusion
References
Effort Informed Roadmaps (EIRM*): Efficient Asymptotically Optimal Multiquery Planning by Actively Reusing Validation Effortpg*-12pt
1 Introduction
2 Related Work
3 Effort Informed Roadmaps (EIRM*)
3.1 Notation
3.2 Approximation
3.3 Reverse Search
3.4 Forward Search
4 Experiments and Results
4.1 Abstract Scenarios
4.2 Robotic Scenario
4.3 Initial Solutions
4.4 Construction Scenario
5 Discussion
5.1 Initial Solution Time
5.2 Objective Value
6 Conclusion
References
Author Index

Recommend Papers

Robot-Age Knowledge Changeover (Robotics Research and Technology) 1606929054, 9781606929056

The current division between industrialised and developing countries/societies is based on relatively recent paradigms o

586 98 4MB Read more

Robotics

The European Commission's support for the production of this publication does not constitute an endorsement of the

505 6 15MB Read more

Introductory Robotics

440 40 14KB Read more

PIC Robotics

447 75 5MB Read more

Advances in Robotics Research: Theory, Implementation, Application [1 ed.] 9783642012129, 3642012124

The German Workshop on Robotics is a convention of roboticists from academia and industry working on mathematical and al

262 70 1KB Read more

STEM Education with Robotics: Lessons from Research and Practice 1032367563, 9781032367569

This book offers a synthesis of research, curriculum examples, pedagogy models, and classroom recommendations for the ef

228 95 10MB Read more

ROBOTICS ENGINEERING: Basic to Advanced Concepts of Robotics Engineering

542 91 28MB Read more

Handbook of Research on the Internet of Things Applications in Robotics and Automation 1522595740, 9781522595748

With near-universal internet access and ever-advancing electronic devices, the ability to facilitate interactions betwee

488 62 20MB Read more

Robotics Research: The 13th International Symposium ISRR [1 ed.] 3642147429, 9783642147425

The International Symposium of Robotics Research (ISRR) continues to be the premiere meeting of the International Founda

410 22 16MB Read more

Advances in Robotics Research: Theory, Implementation, Application [1 ed.] 9781402052620, 1402052626

The German Workshop on Robotics is a convention of roboticists from academia and industry working on mathematical and al

261 98 6MB Read more

Robotics Research
3031255542, 9783031255540

Author / Uploaded
Aude Billard
Tamim Asfour
Oussama Khatib

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Springer Proceedings in Advanced Robotics 27 Series Editors: Bruno Siciliano · Oussama Khatib

Aude Billard Tamim Asfour Oussama Khatib Editors

Robotics Research

Springer Proceedings in Advanced Robotics

27

Series Editors Bruno Siciliano, Dipartimento di Ingegneria Elettrica e Tecnologie dell’Informazione, Università degli Studi di Napoli Federico II, Napoli, Napoli, Italy Oussama Khatib, Robotics Laboratory Department of Computer Science, Stanford University, Stanford, CA, USA

Advisory Editors Gianluca Antonelli, Department of Electrical and Information Engineering, University of Cassino and Southern Lazio, Cassino, Italy Dieter Fox, Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA Kensuke Harada, Engineering Science, Osaka University Engineering Science, Toyonaka, Japan M. Ani Hsieh, GRASP Laboratory, University of Pennsylvania, Philadelphia, PA, USA Torsten Kröger, Karlsruhe Institute of Technology, Karlsruhe, Germany Dana Kulic, University of Waterloo, Waterloo, ON, Canada Jaeheung Park, Department of Transdisciplinary Studies, Seoul National University, Suwon, Korea (Republic of)

The Springer Proceedings in Advanced Robotics (SPAR) publishes new developments and advances in the fields of robotics research, rapidly and informally but with a high quality. The intent is to cover all the technical contents, applications, and multidisciplinary aspects of robotics, embedded in the fields of Mechanical Engineering, Computer Science, Electrical Engineering, Mechatronics, Control, and Life Sciences, as well as the methodologies behind them. The publications within the “Springer Proceedings in Advanced Robotics” are primarily proceedings and post-proceedings of important conferences, symposia and congresses. They cover significant recent developments in the field, both of a foundational and applicable character. Also considered for publication are edited monographs, contributed volumes and lecture notes of exceptionally high quality and interest. An important characteristic feature of the series is the short publication time and world-wide distribution. This permits a rapid and broad dissemination of research results. Indexed by SCOPUS, SCIMAGO, WTI Frankfurt eG, zbMATH. All books published in the series are submitted for consideration in Web of Science.

Aude Billard · Tamim Asfour · Oussama Khatib Editors

Robotics Research

Editors Aude Billard EPFL STI SMT-GE Ecole Polytechnique Federale de Lausanne Lausanne, Vaud, Switzerland

Tamim Asfour Institute for Anthropomatics and Robotic, KIT Karlsruhe, Baden-Württemberg, Germany

Oussama Khatib Artificial Intelligence Laboratory, Department of Computer Science Stanford University Stanford, CA, USA

ISSN 2511-1256 ISSN 2511-1264 (electronic) Springer Proceedings in Advanced Robotics ISBN 978-3-031-25554-0 ISBN 978-3-031-25555-7 (eBook) https://doi.org/10.1007/978-3-031-25555-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Foreword

At the dawn of the century’s third decade, robotics is reaching an elevated level of maturity and continues to benefit from the advances and innovations in its enabling technologies. These all are contributing to an unprecedented effort to bringing robots to human environment in hospitals and homes, factories and schools; in the field for robots fighting fires, making goods and products, picking fruits and watering the farmland, saving time and lives. Robots today hold the promise for making a considerable impact in a wide range of real-world applications from industrial manufacturing to healthcare, transportation, and exploration of the deep space and sea. Tomorrow, robots will become pervasive and touch upon many aspects of modern life. The Springer Tracts in Advanced Robotics (STAR) was launched in 2002 with the goal of bringing to the research community the latest advances in the robotics field based on their significance and quality. During the latest 15 years, the STAR series has featured publication of both monographs and edited collections. Among the latter, the proceedings of thematic symposia devoted to excellence in robotics research, such as ISRR, ISER, FSR, and WAFR, have been regularly included in STAR. The expansion of our field as well as the emergence of new research areas has motivated us to enlarge the pool of proceedings in the STAR series in the past few years. This has ultimately led to launching in 2016 a sister series in parallel to STAR. The Springer Proceedings in Advanced Robotics (SPAR) is dedicated to the timely dissemination of the latest research results presented in selected symposia and workshops. This volume of the SPAR series brings the proceedings of the eighteenth edition of the International Symposium of Robotics Research (ISRR). The event took place in Geneva, Switzerland, from September 25 to 30, 2022. The volume edited by Aude Billard, Tamim Asfour, and Oussama Khatib is a collection of 36 articles addressing a broad range of topics in robotics ranging from design and systems to control, from vision to learning, from planning to grasping and manipulation. The content of these contributions provides a wide coverage of the current state of robotics research: the advances and challenges in its theoretical foundation and technology basis, and the developments in its traditional and new emerging areas of applications. The diversity, novelty, and span of the work unfolding in these areas reveal the field’s increased pace of development and expanded scope. From its beautiful venue to its excellent program, the eighteenth edition of ISRR culminates with this important reference on the current developments and new directions in the field of robotics—a true tribute to its contributors and organizers! December 2022

Bruno Siciliano Oussama Khatib SPAR Editors

Preface

This volume contains the papers that were presented at the 18th International Symposium of Robotics Research (ISRR). This biennial meeting is sponsored and organized by the International Foundation of Robotics Research (IFRR). The ISRR promotes the development and dissemination of groundbreaking research and technological innovation in robotics useful to society by providing a lively, intimate, forward-looking forum for discussion and debate about the current status and future trends of robotics with great emphasis on its potential role to benefit humankind. This particular meeting took place in Geneva during September 25–30, 2022. As one of the pioneering symposia in robotics, ISRR has established some of the most fundamental and lasting contributions in the field bringing together top experts in robotics from overall the world since 1983. The ISRR 2022 program comprises a combination of distinguished talks, oral and interactive presentations as well as an attractive social program. A total of 22 distinguished talks were given by leading roboticists presenting blue sky ideas and their views on the field including past experience, lessons learned, success and failure stories of robotics, views on the future, latest developments in robotics, and impact of robotics on society. In addition, the program of includes 36 presentations, each presented in oral and interactive sessions, giving time for discussions between the participants. All presented papers were selected in a peer-review process of submitted papers in response to an open call for papers. Scientific interaction is the keyword of ISRR, and the social program was designed to support the rich discussions among participants in the most suitable conditions, culminating in the excursion to CERN on September 26 of this inspiring gathering. In addition, a workshop on robotics was organized at the Ecole Polytechnique Fédérale de Lausanne (EPFL) on September 30 with several talks given by ISSR 2022 invited speakers as well as students from different labs at EPFL working on robotics. We are very grateful to all the contributing authors, distinguished speakers, the program committee, reviewers as well as the organization supporting teams in Lausanne and Karlsruhe for their contributions and support to ISRR 2022. Finally, the meeting would not have been possible without the support of the Joanna Erfani and Loïc Niderhauser from EPFL. We acknowledge and thank them all. Aude Billard Tamim Asfour Oussama Khatib

Organization

Program Committee Arash Ajoudani Tamim Asfour Kostas E. Bekris Dmitry Berenson Aude Billard Jeannette Bohg Mehmet R. Dogar Nadia Figueroa Animesh Garg Sami Haddadin Yasuhisa Hirata Marco Hutter Oussama Khatib Mirko Kovac Danica Kragic Hanna Kurniawati Dinesh Manocha Giorgio Metta Katja Mombaur Todd Murphey Yoshihiko Nakamura Taskin Padir Jaeheung Park Marco Pavone Nancy S. Pollard Nicholas Roy Bruno Siciliano Gaurav Sukhatme Rudolph Triebel Nikos Tsagarakis Abhinav Valada Eiichi Yoshida

Istituto Italiano di Tecnologia Karlsruhe Institute of Technology Rutgers University University of Michigan Ecole Polytechnique Fédérale de Lausanne Stanford University University of Leeds University of Pennsylvania University of Toronto TU Munich Tohoku University ETH Zurich Stanford University Imperial College London KTH – Royal Institute of Technology Australian National University University of Maryland Istituto Italiano di Tecnologia University of Waterloo Northwestern University University of Tokyo Northeastern University Seoul National University Stanford University Carnegie Mellon University MIT University of Naples University of Southern California German Aerospace Center – DLR Istituto Italiano di Tecnologia University of Freiburg National Inst. of AIST

Contents

Robot Learning It’s Just Semantics: How to Get Robots to Understand the World the Way We Do . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jen Jen Chung, Julian Förster, Paula Wulkop, Lionel Ott, Nicholas Lawrance, and Roland Siegwart Learning Agile, Vision-Based Drone Flight: From Simulation to Reality . . . . . . . Davide Scaramuzza and Elia Kaufmann Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping Through Continual Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Niclas Vödisch, Daniele Cattaneo, Wolfram Burgard, and Abhinav Valada Efficiently Learning Single-Arm Fling Motions to Smooth Garments . . . . . . . . . Lawrence Yunliang Chen, Huang Huang, Ellen Novoseller, Daniel Seita, Jeffrey Ichnowski, Michael Laskey, Richard Cheng, Thomas Kollar, and Ken Goldberg Learning Long-Horizon Robot Exploration Strategies for Multi-object Search in Continuous Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fabian Schmalstieg, Daniel Honerkamp, Tim Welschehold, and Abhinav Valada Visual Foresight with a Local Dynamics Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . Colin Kohler and Robert Platt

3

11

19

36

52

67

Robot Vision Monocular Camera and Single-Beam Sonar-Based Underwater Collision-Free Navigation with Domain Randomization . . . . . . . . . . . . . . . . . . . . . Pengzhi Yang, Haowen Liu, Monika Roznere, and Alberto Quattrini Li

85

Nonmyopic Distilled Data Association Belief Space Planning Under Budget Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Moshe Shienman and Vadim Indelman

xii

Contents

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Hermann Blum, Marcus G. Müller, Abel Gawel, Roland Siegwart, and Cesar Cadena 6N-DoF Pose Tracking for Tensegrity Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 Shiyang Lu, William R. Johnson III, Kun Wang, Xiaonan Huang, Joran Booth, Rebecca Kramer-Bottiglio, and Kostas Bekris Scale-Invariant Fast Functional Registration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Muchen Sun, Allison Pinosky, Ian Abraham, and Todd Murphey Towards Mapping of Underwater Structures by a Team of Autonomous Underwater Vehicles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 Marios Xanthidis, Bharat Joshi, Monika Roznere, Weihan Wang, Nathaniel Burgdorfer, Alberto Quattrini Li, Philippos Mordohai, Srihari Nelakuditi, and Ioannis Rekleitis Grasping and Manipulation Contact-Implicit Planning and Control for Non-prehensile Manipulation Using State-Triggered Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 Maozhen Wang, Aykut Özgün Önol, Philip Long, and Ta¸skın Padır Mechanical Search on Shelves with Efficient Stacking and Destacking of Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Huang Huang, Letian Fu, Michael Danielczuk, Chung Min Kim, Zachary Tam, Jeffrey Ichnowski, Anelia Angelova, Brian Ichter, and Ken Goldberg Multi-object Grasping in the Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Wisdom C. Agboh, Jeffrey Ichnowski, Ken Goldberg, and Mehmet R. Dogar Parameter Estimation for Deformable Objects in Robotic Manipulation Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 David Millard, James A. Preiss, Jernej Barbiˇc, and Gaurav S. Sukhatme Place-and-Pick-Based Re-grasping Using Unstable Placement . . . . . . . . . . . . . . . 252 Jiaming Hu, Akanimoh Adeleye, and Henrik I. Christensen Safe, Occlusion-Aware Manipulation for Online Object Reconstruction in Confined Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 Yinglong Miao, Rui Wang, and Kostas Bekris

Contents

xiii

Systems and Design Cooperation of Assistive Robots to Improve Productivity in the Nursing Care Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Yasuhisa Hirata, Jose Victorio Salazar Luces, Ankit A. Ravankar, and Seyed Amir Tafrishi Adaptive Radiation Survey Using an Autonomous Robot Executing LiDAR Scans in the Large Hadron Collider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 Hannes Gamper, David Forkel, Alejandro Díaz Rosales, Jorge Playán Garai, Carlos Veiga Almagro, Luca Rosario Buonocore, Eloise Matheson, and Mario Di Castro Reference-Free Learning Bipedal Motor Skills via Assistive Force Curricula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Fan Shi, Yuta Kojio, Tasuku Makabe, Tomoki Anzai, Kunio Kojima, Kei Okada, and Masayuki Inaba Ball-and-Socket Joint Pose Estimation Using Magnetic Field . . . . . . . . . . . . . . . . 321 Tai Hoang, Alona Kharchenko, Simon Trendel, and Rafael Hostettler BulletArm: An Open-Source Robotic Manipulation Benchmark and Learning Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 Dian Wang, Colin Kohler, Xupeng Zhu, Mingxi Jia, and Robert Platt Optimization-Based Online Flow Fields Estimation for AUVs Navigation . . . . . 351 Hao Xu, Yupu Lu, and Jia Pan Flying Hydraulically Amplified Electrostatic Gripper System for Aerial Object Manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 Dario Tscholl, Stephan-Daniel Gravert, Aurel X. Appius, and Robert K. Katzschmann Control Hybrid Quadratic Programming - Pullback Bundle Dynamical Systems Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Bernardo Fichera and Aude Billard Riemannian Geometry as a Unifying Theory for Robot Motion Learning and Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 Noémie Jaquier and Tamim Asfour

xiv

Contents

Ensured Continuous Surveillance Despite Sensor Transition Using Control Barrier Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404 Luis Guerrero-Bonilla, Carlos Nieto-Granda, and Magnus Egerstedt Nonlinear Stochastic Trajectory Optimization for Centroidal Momentum Motion Generation of Legged Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420 Ahmad Gazar, Majid Khadiv, Sébastien Kleff, Andrea Del Prete, and Ludovic Righetti Reactive Anticipatory Robot Skills with Memory . . . . . . . . . . . . . . . . . . . . . . . . . . 436 Hakan Girgin, Julius Jankowski, and Sylvain Calinon ROSE: Robust State Estimation via Online Covariance Adaption . . . . . . . . . . . . . 452 Seyed Fakoorian, Kyohei Otsu, Shehryar Khattak, Matteo Palieri, and Ali-akbar Agha-mohammadi What Can Algebraic Topology and Differential Geometry Teach Us About Intrinsic Dynamics and Global Behavior of Robots? . . . . . . . . . . . . . . . . . . . . . . . . 468 Alin Albu-Schäffer and Arne Sachtler Planning Efficient Task Planning Using Abstract Skills and Dynamic Road Map Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 Khen Elimelech, Lydia E. Kavraki, and Moshe Y. Vardi CALIPSO: A Differentiable Solver for Trajectory Optimization with Conic and Complementarity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 Taylor A. Howell, Kevin Tracy, Simon Le Cleac’h, and Zachary Manchester Clutter-Resilient Autonomous Mobile Robot Navigation with Computationally Efficient Free-Space Features . . . . . . . . . . . . . . . . . . . . . . . . 522 Rômulo T. Rodrigues, Nikolaos Tsiogkas, Nico Huebel, and Herman Bruyninckx Robust-RRT: Probabilistically-Complete Motion Planning for Uncertain Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 Albert Wu, Thomas Lew, Kiril Solovey, Edward Schmerling, and Marco Pavone

Contents

xv

Effort Informed Roadmaps (EIRM*): Efficient Asymptotically Optimal Multiquery Planning by Actively Reusing Validation Effort . . . . . . . . . . . . . . . . . . 555 Valentin N. Hartmann, Marlin P. Strub, Marc Toussaint, and Jonathan D. Gammell Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573

Robot Learning

It’s Just Semantics: How to Get Robots to Understand the World the Way We Do Jen Jen Chung1,2(B) , Julian F¨ orster2 , Paula Wulkop2 , Lionel Ott2 , 2,3 Nicholas Lawrance , and Roland Siegwart2 1

The University of Queensland, Brisbane, Australia [email protected] 2 ETH Z¨ urich, Z¨ urich, Switzerland {fjulian,pwulkop,lioott,rsiegwart}@ethz.ch 3 CSIRO, Brisbane, Australia [email protected]

Abstract. Increasing robotic perception and action capabilities promise to bring us closer to agents that are eﬀective for automating complex operations in human-centered environments. However, to achieve the degree of ﬂexibility and ease of use needed to apply such agents to new and diverse tasks, representations are required for generalizable reasoning about conditions and eﬀects of interactions, and as common ground for communicating with non-expert human users. To advance the discussion on how to meet these challenges, we characterize open problems and point out promising research directions.

1

Introduction

There is growing excitement surrounding the possibility of service robots soon becoming a staple in hospitals, malls, construction sites or even in the home. This is perhaps not entirely unwarranted given the astonishing technological progress that we’ve observed over just the last few years across all areas of robotics. There are now myriad options for autonomous point-to-point navigation in changing and dynamic human spaces [1–3], similarly so for safe physical interaction between robots and humans or other objects in the environment [4,5]. On top of this, there are also any number of methods that generate the internal scene representations needed to support these tasks [6–10]. Nevertheless, even the most optimistic among us would likely agree that a scenario where we could simply ask a robot (or a team of robots) to “tidy the room” and expect a satisfactory outcome is still a fairly distant dream. The individual skills needed to complete such a task—pick and place of objects [11–15], opening and closing drawers and cupboards [16–18], navigation in clutter [19,20], etc.—may all be available, but there is still a missing link that’s needed to translate the semantic notion of “tidy” into a potentially extensive and complex sequence of robot-legible actions. The ﬁeld of high-level symbolic planning [21] pushes towards this idea by oﬀering frameworks that operate over predicates (binary symbols of relations or c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 3–10, 2023. https://doi.org/10.1007/978-3-031-25555-7_1

4

J. J. Chung et al.

properties, e.g. “teddy on ﬂoor”) [22–24]. By describing the state of the world as a set of predicates and deﬁning the actions available to the robot in terms of preconditions and eﬀects (e.g. the precondition for a grasp action may be the predicate “object in reach” and its resulting eﬀect is “object in gripper”), such planners can accept commands such as “teddy in cupboard”, to output an action sequence {navigate to teddy, grasp teddy, navigate to cupboard, place teddy} that completes the task. While the logic may be straightforward, even such a simple example hides a surprising amount of complexity. Most notably, the semantic concepts of “on the ﬂoor” or “in the cupboard” lead to the same interpretability problem as in the “tidy” example. It may be easier to conceive of a metric computation (a grounding) to evaluate the “on” and “in” predicates compared to “tidy”, however, such a check still needs to be implemented. Indeed, one is needed for every predicate the robot encounters. Currently available solutions either hand-code these rules, which is untenable in the long run, or learn the groundings [25–28], which overﬁts to a speciﬁc task. Alternatively, end-toend approaches short-circuit symbolic planning altogether by directly learning actions from semantic commands without a (human-legible) intermediate representation [29], which precludes a shared human-robot understanding. Moreover, given the incomplete, fuzzy, representation- and context-dependent nature of most predicates (see Fig. 1 for an example of how even “on” can get messy), there is often no clear or easy way to consistently translate these to the metric representation of the environment in which robots operate. So far we’ve only considered the robot-agnostic elements of predicate understanding: how to determine whether a predicate holds given just the state of the world. A second challenge arises from the description of the robot actions, whose preconditions and eﬀects by necessity include robot-speciﬁc predicates. Take the grasp action as an example, previously we suggested that having the target “in reach” would be a suitable precondition. However, whether or not this predicate holds depends entirely on the robot’s available workspace, which varies from setup to setup. “In reach” also doesn’t cover the full set of conditions needed to determine an object’s graspability, since we would also need to factor in elements such as the robot’s grasping mechanism (ﬁngers, jamming, suction, etc.), its sensing capabilities and its payload capacity. A complete and correct set of symbolic action deﬁnitions would need to be based on robot-speciﬁc object aﬀordances, and these, too, must somehow be obtained. Given the diﬃculty of generating analytical translations between semantic predicates and the general metric space of a robot, it seems that the most promising solutions will likely require some degree of learning. Self-supervised strategies [30] may help to alleviate the need for manually labeled data and these can be of particular value when learning robot-centric aﬀordances. Notably, the ﬁdelity of available physics simulators [31] now lets us learn robot skills which transfer remarkably well to real-world execution [32,33]. Of course, we’re still left with all the robot-agnostic elements, which ultimately require some degree of user input, either via labeling [34,35] or by demonstration [36]. As with any works needing user input, the core challenges are how to most eﬃciently gather this

It’s Just Semantics

(a)

(b)

(c)

5

(d)

Fig. 1. (a)–(b) Two examples of where the predicate “container on table” holds. Indeed, we can generally agree that this predicate holds for any position or orientation of the container so long as it is supported from below by the table surface. (c) Does the predicate “lid on container” hold in this case? Typically when we say “put the lid on the container” we would like the outcome shown in (d). But if we were simply identifying the lid, we may describe it in both (c) and (d) as “the lid on the container”.

limited and precious resource, and how to most eﬀectively make use of it. Works that deal with activity classiﬁcation [37] and behavioral cloning from observation [38] can supplement these eﬀorts by exploiting the huge amount of online image and video data, while active learning [39] can help reduce the required amount of labeled data. However, regardless of the data source, additional eﬀort will invariably be needed to verify the correctness of learned semantic predicates and methods for further knowledge reﬁnement will almost surely be required. This paper presents some of the key challenges in consolidating the semantic nature in which humans interpret the world and the actions that we take within it, with the fundamentally metric representations available to our robots. The motivating thesis of this investigation is that by enabling a mutual understanding of the world (or at least a way to map from a human’s internal representation of it to a robot’s and vice versa), we can interact with robots as smoothly as we would with other humans. Potentially arriving at a point where we can get them to tidy our rooms with a single command.

2

Contextual Scene Understanding

Semantics provides a compressed representation of the world at the level of objects and their relationships. They allow us to reason eﬀortlessly over tasks that would otherwise be extremely tedious to describe within a metric representation. Assuming that we are stuck with robots that must, at the end of the day, operate using metric representations, the question is then, how do we convert our semantic, relational and contextual understanding of the environment into such a metric space? Deep learning-based algorithms for computer vision [40–42] and natural language processing [43,44] have already given us a taste of what’s possible at the conjunction of high-powered compute and enormous amounts of data. While this presents an obvious starting point, we posit here that a robot— an artiﬁcial embodied agent—performing long term reasoning and planning in shared human environments will need to undergo continual online adaptation

6

J. J. Chung et al.

of its world representation. This is not only so that it updates its scene understanding according to newly discovered semantic predicates, but also so that it can render its interactions with the world (reasoning about its own capabilities and the eﬀects of its actions) within the same representation as new objects or new capabilities are introduced. From a symbolic planning perspective, this means that we need eﬃcient strategies for discovering and evaluating predicates, as well as methods for generating agent-speciﬁc action deﬁnitions which can be corrected and completed over time given new interaction experiences. 2.1

Spatial Predicates: Object-to-Object Relationships

Thankfully, a large number of predicates that are needed to describe typical tasks undertaken in human spaces can be evaluated spatially or temporally. These involve prepositions that describe the relationship between two objects (e.g. “on”) or two actions (e.g. “after”). The latter is generally well-handled by symbolic planners that apply temporal logic [45,46], while we can conceivably learn the rules for spatial predicates given suﬃcient training examples of relevant sensory data, such as comparing point clouds or bounding boxes of objects [47]. Unfortunately, and perhaps not surprisingly, this is not the end of the story for spatial predicates. Take for instance the examples shown in Fig. 1. While the container is unquestionably “on” the table in all instances, whether or not the lid is considered “on” the container in Fig. 1c and Fig. 1d may diﬀer depending on the intention of the task. Although it’s possible to exhaustively learn the diﬀerent subtleties of each predicate from scratch, this is not a satisfying solution and may be fatally unrealistic depending on the training data and memory requirements. A more principled approach may be to build levels of speciﬁcity into the representation which would allow directly extending and specializing from established predicates. That way, when a robot is told that it failed to “put the lid on the container” as in Fig. 1c, it doesn’t search across all possible conﬁgurations of the lid and container, but rather searches only within those conﬁgurations for which its original understanding of “on” holds. Automatically learned predicate extensions can also provide an avenue for eﬃcient knowledge consolidation. For instance, a sequence of n objects one “on” the other would need n − 1 “on” predicates to deﬁne. Alternatively, a person would typically describe such an arrangement with a single predicate “stack”. Learning to associate jointly observed sets of predicates can potentially fast-track planning and will in general enable more compact state representations. One consideration to note is that some conditions may not be explicitly observable when simply associating predicate sets. For example, the order of the objects in a stack may or may not strictly matter. While this has little impact on being able to determine if a “stack” predicate holds, it can become very important for a planner that needs to know how to generate a stack from a random set of objects. Put another way, “on” is only a necessary condition for “stack”, additional eﬀort may still be needed to learn the full set of suﬃcient conditions for these higher-level predicates.

It’s Just Semantics

2.2

7

Aﬀordances: Object-to-Agent and Scene-to-Agent Relationships

A second set of predicates that are needed are those that relate speciﬁcally to the agent and the actions that it can execute. High-level symbolic actions are deﬁned by preconditions and eﬀects, the former is the set of predicates that must hold in order for the action to be executable while the latter is the set of predicate changes that the action causes. Generally speaking, action preconditions relate to the agent-speciﬁc aﬀordances of the particular object(s) targeted by that action, for example, recall the discussion on graspability from Sect. 1. One option for learning object aﬀordances is to simply rely on manually labeled data such as [35], however the limitations of this may quickly become apparent in an open world setting. In contrast, methods that can eﬃciently update and add to an existing knowledge base from the agent’s own experiences are considerably more powerful and more appealing, especially given our use case in robotics. Some aﬀordance predicates may be immediately observable even without requiring physical interaction with the object. For example, the geometry of an object can already disqualify it from being considered graspable. Beyond geometry, other remotely observable object properties can provide good proxies that allow aﬀordance generalization over unseen objects. Over time, a robot could learn to associate features, such as surface texture, with how easy an object has been to grasp given a particular end-eﬀector. In an ideal scenario, such associations would remove the need for physical interaction and would allow the agent to directly attach an aﬀordance to a new object just based on remote observation. The challenge is then to identify the salient object features that reveal these correlations and also to decide how optimistically (or conservatively) to generalize aﬀordances to newly encountered objects. Ultimately, however, the only way we can truly verify if an object has a particular aﬀordance is to physically interact with it to see if the associated action succeeds. To further complicate things, there are agent-, object- and scene-level reasons why an action may fail. An object may not be graspable because (i) it’s too heavy for the robot to lift; (ii) its current orientation does not allow grasping; or (iii) other objects in the scene are blocking the robot from grasping it. From the perspective of a symbolic planner, it can be quite useful to distinguish between these diﬀerent cases as they each point to potential recovery mechanisms. In the ﬁrst case, the ability to recover lies solely with the agent (increase the robot payload); in the second case the robot may be able to move the object into a graspable orientation; while in the last case, the robot will need to interact with other objects in the scene to change the graspability predicate of the target object. Uncovering the true reason behind why a particular object aﬀordance doesn’t hold will likely require a combination of knowledge extrapolation (to hypothesize what went wrong) and sophisticated exploration strategies to avoid the computational blowup of an exhaustive search. Predicate extension methods, such as those described in Sect. 2.1, will then also be needed to update the state and action representations to enable successful replanning.

8

2.3

J. J. Chung et al.

Challenges and Beneﬁts of Embodiment

Since aﬀordance predicates are inherently tied to the robot’s capabilities, their evaluation is at least directly measurable by the robot (did the action succeed or not?). Of course, knowing that an aﬀordance holds is not the same as knowing how to eﬀect that aﬀordance. Recent works propose methods that exploit simulation to learn this directly from point cloud data [16,17]. However, these approaches stop short of a full evaluation and only explore object interactions with a disembodied gripper, relying instead on downstream motion planners to deal with kinematic and dynamics constraints that arise from the full realization of the robot. It remains to be seen whether learning a more generalized (robot-agnostic) set of object aﬀordances, which are then reﬁned to each speciﬁc setup, or directly learning a robot-centric representation results in better overall functionality. As is typical for learning algorithms, ease of training, ease of use, generalizability and overall performance will all factor into this comparison, and perhaps unique to robotaﬀordance learning, we would also be very interested in how readily the learned representation can be adapted given live, real-world interaction data.

3

Summary and Outlook

Contextual scene understanding is challenging for a robot but it’s a problem worth tackling as it enables robots to truly reason about the world to solve general, large-scale, long-horizon tasks given simple semantic commands. We’ve identiﬁed several major challenges related to predicate-based world representations and outlined promising directions for investigation. These include ideas for automatic predicate extension via specialization of existing predicates or consolidation of jointly observed predicates, as well as ideas for generalizing aﬀordances to unseen objects and exploiting agent embodiment in robotic domains.

References 1. Cadena, C., et al.: Past, present, and future of simultaneous localization and mapping: toward the robust-perception age. IEEE Trans. Robot. 32(6), 1309–1332 (2016) 2. Van den Berg, J., Lin, M., Manocha, D.: Reciprocal velocity obstacles for realtime multi-agent navigation. In: IEEE International Conference on Robotics and Automation (2008) 3. Gao, Y., Huang, C.M.: Evaluation of socially-aware robot navigation. Front. Robot. AI 8(721317), 420 (2021) 4. Alami, R., et al.: Safe and dependable physical human-robot interaction in anthropic domains: State of the art and challenges. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2006) 5. Zacharaki, A., Kostavelis, I., Gasteratos, A., Dokas, I.: Safety bounds in human robot interaction: a survey. Saf. Sci. 127, 104667 (2020) 6. Florence, P., Manuelli, L., Tedrake, R.: Self-supervised correspondence in visuomotor policy learning. IEEE Robot. Autom. Lett. 5(2), 492–499 (2020) 7. Garg, S., et al.: Semantics for robotic mapping, perception and interaction: a surR Robot. 8(1–2), 1–224 (2020) vey. Found. Trends

It’s Just Semantics

9

8. Narita, G., Seno, T., Ishikawa, T., Kaji, Y.: PanopticFusion: online volumetric semantic mapping at the level of stuﬀ and things. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2019) 9. Grinvald, M., et al.: Volumetric instance-aware semantic mapping and 3D object discovery. IEEE Robot. Autom. Lett. 4(3), 3037–3044 (2019) 10. Kothari, P., Kreiss, S., Alahi, A.: Human trajectory forecasting in crowds: a deep learning perspective. IEEE Trans. Intell. Transp. Syst. 23, 7386–7400 (2021) 11. Bohg, J., Morales, A., Asfour, T., Kragic, D.: Data-driven grasp synthesis - a survey. IEEE Trans. Robot. 30(2), 289–309 (2014) 12. Gualtieri, M., ten Pas, A., Saenko, K., Platt, R.: High precision grasp pose detection in dense clutter. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2016) 13. Mahler, J., et al.: Learning ambidextrous robot grasping policies. Sci. Robot. 4(26), eaau4984 (2019) 14. Morrison, D., Corke, P., Leitner, J.: Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 39(2–3), 183–201 (2020) 15. Breyer, M., Chung, J.J., Ott, L., Siegwart, R., Nieto, J.: Volumetric grasping network: real-time 6 DOF grasp detection in clutter. In: Conference on Robot Learning (2021) 16. Mo, K., Guibas, L.J., Mukadam, M., Gupta, A., Tulsiani, S.: Where2Act: from pixels to actions for articulated 3D objects. In: IEEE/CVF International Conference on Computer Vision (2021) 17. Wu, R., et al.: VAT-Mart: learning visual action trajectory proposals for manipulating 3D articulated objects. In: International Conference on Learning Representations (2022) 18. Xu, Z., Zhanpeng, H., Song, S.: UMPNet: universal manipulation policy network for articulated objects. IEEE Robot. Autom. Lett. 7(2), 2447–2454 (2022) 19. Pierson, A., Vasile, C.I., Gandhi, A., Schwarting, W., Karaman, S., Rus, D.: Dynamic risk density for autonomous navigation in cluttered environments without object detection. In: International Conference on Robotics and Automation (2019) 20. Regier, P.: Robot navigation in cluttered environments. Ph.D. thesis, Rheinische Friedrich-Wilhelms-Universit¨ at Bonn (2022) 21. Karpas, E., Magazzeni, D.: Automated planning for robotics. Annu. Rev. Control Robot. Auton. Syst. 3, 417–439 (2019) 22. Fikes, R.E., Nilsson, N.J.: STRIPS: a new approach to the application of theorem proving to problem solving. Artif. Intell. 2(3–4), 189–208 (1971) 23. McDermott, D., et al.: PDDL: the planning domain deﬁnition language. Technical report, Yale Center for Computational Vision and Control (1998) 24. Garrett, C.R., Lozano-P´erez, T., Kaelbling, L.P.: FFRob: leveraging symbolic planning for eﬃcient task and motion planning. Int. J. Robot. Res. 37(1), 104–136 (2018) 25. Konidaris, G., Kaelbling, L.P., Lozano-Perez, T.: From skills to symbols: learning symbolic representations for abstract high-level planning. J. Artif. Intell. Res. 61, 215–289 (2018) 26. Ames, B., Thackston, A., Konidaris, G.: Learning symbolic representations for planning with parameterized skills. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2018) 27. Silver, T., Chitnis, R., Tenenbaum, J., Kaelbling, L.P., Lozano-Per´ez, T.: Learning symbolic operators for task and motion planning. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2021)

10

J. J. Chung et al.

28. Yuan, W., Paxton, C., Desingh, K., Fox, D.: SORNet: spatial object-centric representations for sequential manipulation. In: Conference on Robot Learning (2022) 29. Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. In: Conference on Robot Learning (2022) 30. Nair, A., Bahl, S., Khazatsky, A., Pong, V., Berseth, G., Levine, S.: Contextual imagined goals for self-supervised robotic learning. In: Conference on Robot Learning (2020) 31. Collins, J., Chand, S., Vanderkop, A., Howard, D.: A review of physics simulators for robotic applications. IEEE Access 9, 51416–51431 (2021) 32. Peng, X.B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Sim-to-real transfer of robotic control with dynamics randomization. In: IEEE International Conference on Robotics and Automation, pp. 3803–3810 (2018) 33. Zhao, W., Queralta, J.P., Westerlund, T.: Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In: IEEE Symposium Series on Computational Intelligence (2020) 34. Cohen, V., Burchﬁel, B., Nguyen, T., Gopalan, N., Tellex, S., Konidaris, G.: Grounding language attributes to objects using Bayesian eigenobjects. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2019) 35. Wald, J., Dhamo, H., Navab, N., Tombari, F.: Learning 3D semantic scene graphs from 3D indoor reconstructions. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) 36. Gopalan, N., Rosen, E., Konidaris, G., Tellex, S.: Simultaneously learning transferable symbols and language groundings from perceptual data for instruction following. In: Proceedings of Robotics: Science and Systems, Corvalis, Oregon, USA, July (2020). https://doi.org/10.15607/RSS.2020.XVI.102 37. Rodr´ıguez-Moreno, I., Mart´ınez-Otzeta, J.M., Sierra, B., Rodriguez, I., Jauregi, E.: Video activity recognition: state-of-the-art. Sensors 19(14), 3160 (2019) 38. Torabi, F., Warnell, G., Stone, P.: Behavioral cloning from observation. In: International Joint Conference on Artiﬁcial Intelligence, pp. 4950–4957 (2018) 39. Bıyık, E., Losey, D.P., Palan, M., Landolﬁ, N.C., Shevchuk, G., Sadigh, D.: Learning reward functions from diverse sources of human feedback: optimally integrating demonstrations and preferences. Int. J. Robot. Res. 41(1), 45–67 (2022) 40. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classiﬁcation with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012) 41. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: uniﬁed, real-time object detection. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 42. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: IEEE International Conference on Computer Vision (2017) 43. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems (2017) 44. Brown, T., Mann, B., Ryder, N., Subbiah, M., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems (2020) 45. Belta, C., Bicchi, A., Egerstedt, M., Frazzoli, E., Klavins, E., Pappas, G.J.: Symbolic planning and control of robot motion [Grand challenges of robotics]. IEEE Robot. Autom. Mag. 14(1), 61–70 (2007) 46. Kress-Gazit, H., Fainekos, G.E., Pappas, G.J.: Temporal-logic-based reactive mission and motion planning. IEEE Trans. Robot. 25(6), 1370–1381 (2009) 47. Mo, K., Qin, Y., Xiang, F., Su, H., Guibas, L.: O2O-aﬀord: annotation-free largescale object-object aﬀordance learning. In: Conference on Robot Learning (2022)

Learning Agile, Vision-Based Drone Flight: From Simulation to Reality Davide Scaramuzza(B) and Elia Kaufmann University of Zurich, Z¨ urich, Switzerland [email protected]

Abstract. We present our latest research in learning deep sensorimotor policies for agile, vision-based quadrotor ﬂight. We show methodologies for the successful transfer of such policies from simulation to the real world. In addition, we discuss the open research questions that still need to be answered to improve the agility and robustness of autonomous drones toward human-pilot performance.

1

Introduction

Quadrotors are among the most agile and dynamic machines ever created. However, developing fully autonomous quadrotors that can approach or even outperform the agility of birds or human drone pilots with only onboard sensing and computing is very challenging and still unsolved. Prior work on autonomous quadrotor navigation has approached the challenge of vision-based autonomous navigation by separating the system into a sequence of independent compute modules [1–4]. While such modularization of the system is beneﬁcial in terms of interpretability and allows to easily exchange modules, it results in a substantial increase in latency from perception to action and results in errors being propagated from one module to the next. Furthermore, prior work on autonomous quadrotor navigation heavily relies on control techniques such as PID control [5– 7] or model predictive control (MPC) [8,9]. While these methods have demonstrated impressive feats in controlled environments [5,7,9,10], they require substantial amount of tuning, and are diﬃcult, if not impossible, to scale to complex dynamics models without large penalties in computation time. In this extended abstract, we summarize our latest research in learning deep sensorimotor policies for agile vision-based quadrotor ﬂight. Learning sensorimotor controllers represents a holistic approach that is more resilient to noisy sensory observations and imperfect world models. Training robust policies requires however a large amount of data. Nevertheless, we will show that simulation data, combined with randomization and abstraction of sensory observations, is enough to train policies that generalize to the real world. Such policies enable autonomous quadrotors to ﬂy faster and more agile than what was possible before with only onboard sensing and computation. In addition, we discuss the open research questions that still need to be answered to improve the agility and robustness of autonomous drones toward human-pilot performance. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 11–18, 2023. https://doi.org/10.1007/978-3-031-25555-7_2

12

D. Scaramuzza and E. Kaufmann

Fig. 1. Our drone ﬂying at high speed (10 m/s) through a dense forest using only onboard sensing and computation.

2

High-Speed Flight in the Wild

In [11], we presented an end-to-end approach that can autonomously ﬂy quadrotors through complex natural and human-made environments at high speeds (10 m/s), with purely onboard sensing and computation.1 The key principle is to directly map noisy sensory observations to collision-free trajectories in a receding-horizon fashion. This direct mapping drastically reduces processing latency and increases robustness to noisy and incomplete perception. The sensorimotor mapping is performed by a convolutional network that is trained exclusively in simulation via privileged learning: imitating an expert with access to privileged information. We leverage abstraction of the input data to transfer the policy from simulation to reality [12,13]. To this end, we utilize a stereo matching algorithm to provide depth images as input to the policy. We show that this representation is both rich enough to safely navigate through complex environments and abstract enough to bridge the gap between simulation and real world. By combining abstraction with simulating realistic sensor noise, our approach achieves zero-shot transfer from simulation to challenging real-world environments that were never experienced during training: dense forests, snow-covered terrain, derailed trains, and collapsed buildings. Our work demonstrates that end-to-end policies trained in simulation enable high-speed autonomous ﬂight through challenging environments, outperforming traditional obstacle avoidance pipelines. A qualitative example of ﬂight in the wild is shown in Fig. 1. 1

A video of the results can be found here: https://youtu.be/m89bNn6RFoQ.

Learning Agile, Vision-Based Drone Flight: From Simulation to Reality

13

Fig. 2. Our quadrotor performs a Matty Flip. The drone is controlled by a deep sensorimotor policy and uses only onboard sensing and computation.

3

Acrobatic Flight

Acrobatic ﬂight with quadrotors is extremely challenging. Human drone pilots require many years of practice to safely master maneuvers such as power loops and barrel rolls. For aerial vehicles that rely only on onboard sensing and computation, the high accelerations that are required for acrobatic maneuvers together with the unforgiving requirements on the control stack raise fundamental questions in both perception and control. For this reason, we challenged our drone with the task of performing acrobatic maneuvers [13].2 In order to achieve this task, we trained a neural network to predict actions directly from visual and inertial sensor observations. Training is done by imitating an optimal controller with access to privileged information in the form of the exact robot state. Since such information is not available in the physical world, we trained the neural network to predict actions instead from inertial and visual observations. Similarly to the work described in the previous section, all of the training is done in simulation, without the need of any data from the real world. We achieved this by using abstraction of sensor measurements, which reduces the simulation-to-reality gap compared to feeding raw observations. Both theoretically and experimentally, we have shown that abstraction strongly favours simulation-to-reality transfer. The learned policy allowed our drone to go faster than ever before and successfully ﬂy maneuvers with accelerations of up to 3g, such as the Matty ﬂip illustrated in Fig. 2. 2

A video of the results can be found here: https://youtu.be/2N wKXQ6MXA.

14

D. Scaramuzza and E. Kaufmann

4

Autonomous Drone Racing

Drone racing is an emerging sport where pilots race against each other with remote-controlled quadrotors while being provided a ﬁrst-person-view (FPV) video stream from a camera mounted on the drone. Drone pilots undergo years of training to master the skills involved in racing. In recent years, the task of autonomous drone racing has received substantial attention in the robotics community, which can mainly be attributed to two reasons: (i) The sensorimotor skills required for autonomous racing would also be valuable to autonomous systems in applications such as disaster response or structure inspection, where drones must be able to quickly and safely ﬂy through complex dynamic environments. (ii) The task of autonomous racing provides an ideal test bed to objectively and quantiﬁably compare agile drone ﬂight against human baselines. It also allows us to measure the research progress towards the ambitious goal of achieving super-human performance. One approach to autonomous racing is to ﬂy through the course by tracking a precomputed, potentially time-optimal, trajectory [10]. However, such an approach requires to know the race-track layout in advance, along with highly accurate state estimation, which current methods are still not able to provide. Indeed, visual inertial odometry is subject to drift in estimation over time. SLAM methods can reduce drift by relocalizing in a previously-generated, globally-consistent map. However, enforcing global consistency leads to increased computational demands that strain the limits of on-board processing. Instead of relying on globally consistent state estimates, we deploy a convolutional neural network to identify the next location to ﬂy to, also called waypoints. However, it is not clear a priori what should be the representation of the next waypoint. In our works, we have explored diﬀerent solutions. In our preliminary work, the neural network predicts a ﬁxed distance location from the drone [14]. Training was done by imitation learning on a globally optimal trajectory passing through all the gates. Despite being very eﬃcient and easy to develop, this approach cannot eﬃciently generalize between diﬀerent track layouts, given the fact that the training data depends on a track-dependent global trajectory representing the optimal path through all gates. For this reason, a follow-up version of this work proposed to use as waypoint the location of the next gate [15]. As before, the prediction of the next gate is provided by a neural network. However, in contrast to the previous work, the neural network also predicts a measure of uncertainty. Even though the waypoint representation proposed in [15] allowed for eﬃcient transfer between track layouts, it still required substantial amount of real-world data to train. Collecting such data is generally a very tedious and time consuming process, which represents a limitation of the two previous works. In addition, when something changes in the environment, or the appearance of the gates changes substantially, the data collection process needs to be repeated from scratch. For this reason, in our recent work [16] we have proposed to collect data exclusively in simulation. To enable transfer between the real and the physical world, we randomized all the features which predictions should be robust against,

Learning Agile, Vision-Based Drone Flight: From Simulation to Reality

15

Fig. 3. The perception component of our system, represented by a convolutional neural network (CNN), is trained only with non-photorealistic simulation data.

i.e. illumination, gate shape, ﬂoor texture, and background. A sample of the training data generated by this process, called domain randomization, can be observed in Fig. 3. Our approach was the ﬁrst to demonstrate zero-shot sim-toreal transfer on the task of agile drone ﬂight. A collection of the ideas presented in the above works has been used by our team to successfully compete in the Alpha-Pilot competition [17].3

5

Future Work

Our work shows that neural networks have a strong potential to control agile platforms like quadrotors. In comparison to traditional methods, neural policies are more robust to noise in the observations and can deal with imperfection in sensing and actuation. However, the approaches presented in this extended abstract fall still short of the performance of professional drone pilots. To further push the capabilities of autonomous drones, a specialization to the task is required, potentially through real-world adaptation and online learning. Solving those challenges could potentially bring autonomous drones closer in agility to human pilots and birds. In this context, we present some limitations of the approaches proposed so far and interesting avenues for future work. 5.1

Generalization

One example of such a challenge is generalization: even though this extended abstract presents methods presents promising approaches to push the frontiers of 3

A video of the results can be found here: https://youtu.be/DGjwm5PZQT8.

16

D. Scaramuzza and E. Kaufmann

autonomous, vision-based agile drone navigation in scenarios such as acrobatic ﬂight, agile ﬂight in cluttered environments, and drone racing, it is not clear how these sensorimotor policies can be eﬃciently transferred to novel task formulations, sensor conﬁgurations, or physical platforms. Following the methodologies presented in this paper, transferring to any of these new scenarios would require to retrain the policy in simulation, or perform adaptation using learned residual models. While the former suﬀers from the need to re-identify observation models and dynamics models, the latter is restricted to policy transfer between simulation and reality for the same task. Possible avenues for future work to address this challenge include hierarchical learning [18] or approaches that optimize policy parameters on a learned manifold [19]. 5.2

Continual Learning

The approaches to sensorimotor policy training presented in this paper are static in their nature: after the policy is trained in simulation via imitation learning or reinforcement learning, its parameters are frozen and applied on the real robot. In contrast, natural agents interact fundamentally diﬀerent with their environment; they continually adapt to new experience and improve their performance in a much more dynamic fashion. Designing an artiﬁcial agent with similar capabilities would greatly increase the utility of robots in the real world and is an interesting direction for future work. Recent work has proposed methods to enable artiﬁcial agents to perform few-shot learning of new tasks and scenarios using techniques such as adaptive learning [20,21] or meta learning [22,23]. These methods have shown promising results on simple manipulation and locomotion tasks but remain to be demonstrated on complex navigation tasks such as high-speed drone ﬂight in the real world. 5.3

Autonomous Drone Racing

The approaches presented in this paper addressing the challenge of autonomous drone racing are limited to single-agent time trial races. To achieve true racing behaviour, these approaches need to be extended to a multi-agent setting, which raises novel challenges in perception, planning, and control. Regarding perception, multi-agent drone racing requires to detect opponent agents, which is a challenging task when navigating at high speeds and when onboard observations are subject to motion blur. The planning challenges arise from the need to design game-theoretic strategies [24] that involve maneuvers such as strategic blocking, which are potentially not time-optimal but still dominant approaches when competing in a multi-agent setting. Additionally, the limited ﬁeld of view of the onboard camera renders opponents potentially unobservable, which requires a mental model of the trajectory of opponents. Finally, racing simultaneously with other drones through a race track poses new challenges in modeling and

Learning Agile, Vision-Based Drone Flight: From Simulation to Reality

17

control, as aerodynamic eﬀects induced by other agents need to be accounted for. We envision that many of these challenges can be addressed using deep reinforcement learning in combination with self-play, where agents improve by competing against each other [25,26]. Acknowledgments. This work was supported by the National Centre of Competence in Research (NCCR) Robotics through the Swiss National Science Foundation (SNSF) and the European Union’s Horizon 2020 Research and Innovatifon Program under grant agreement No. 871479 (AERIAL-CORE) and the European Research Council (ERC) under grant agreement No. 864042 (AGILEFLIGHT).

References 1. Cieslewski, T., Kaufmann, E., Scaramuzza, D.: Rapid exploration with multirotors: a frontier selection method for high speed ﬂight. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2135–2142 (2017) 2. Ryll, M., Ware, J., Carter, J., Roy, N.: Eﬃcient trajectory planning for high speed ﬂight in unknown environments. In: International Conference on Robotics and Automation (ICRA), pp. 732–738. IEEE (2019) 3. Mahony, R., Kumar, V., Corke, P.: Multirotor aerial vehicles: modeling, estimation, and control of quadrotor. IEEE Robot. Autom. Mag. 19(3), 20–3 (2012) 4. Shen, S., Mulgaonkar, Y., Michael, N., Kumar, V.: Vision-based state estimation and trajectory control towards high-speed ﬂight with a quadrotor. In: Robotics: Science and Systems (RSS) (2013) 5. Falanga, D., Mueggler, E., Faessler, M., Scaramuzza, D.: Aggressive quadrotor ﬂight through narrow gaps with onboard sensing and computing using active vision. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5774– 5781 (2017) 6. Faessler, M., Franchi, A., Scaramuzza, D.: Diﬀerential ﬂatness of quadrotor dynamics subject to rotor drag for accurate tracking of high-speed trajectories. IEEE Robot. Autom. Lett. 3(2), 620–626 (2018) 7. Sun, S., Sijbers, L., Wang, X., de Visser, C.: High-speed ﬂight of quadrotor despite loss of single rotor. IEEE Robot. Autom. Lett. 3(4), 3201–3207 (2018) 8. Falanga, D., Foehn, P., Lu, P., Scaramuzza, D.: PAMPC: perception-aware model predictive control for quadrotors. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2018) 9. Nan, F., Sun, S., Foehn, P., Scaramuzza, D.: Nonlinear MPC for quadrotor faulttolerant control. IEEE Robot. Autom. Lett. 7(2), 5047–5054 (2022) 10. Foehn, P., Romero, A., Scaramuzza, D.: Time-optimal planning for quadrotor waypoint ﬂight. Sci. Robot. 6(56), eabh1221 (2021). https://robotics.sciencemag.org/ content/6/56/eabh1221 11. Loquercio, A., Kaufmann, E., Ranftl, R., M¨ uller, M., Koltun, V., Scaramuzza, D.: Learning high-speed ﬂight in the wild. Sci. Robot. 6(59), eabg5810 (2021) 12. M¨ uller, M., Dosovitskiy, A., Ghanem, B., Koltun, V.: Driving policy transfer via modularity and abstraction. In: Conference on Robot Learning, pp. 1–15 (2018) 13. Kaufmann, E. Loquercio, A., Ranftl, R., M¨ uller, M., Koltun, V., Scaramuzza, D.: Deep drone acrobatics. In: Robotics: Science and Systems (RSS) (2020)

18

D. Scaramuzza and E. Kaufmann

14. Kaufmann, E., Loquercio, A., Ranftl, R., Dosovitskiy, A., Koltun, V., Scaramuzza, D.: Deep drone racing: learning agile ﬂight in dynamic environments. In: Conference on Robotics Learning (CoRL) (2018) 15. Kaufmann, E., et al.: Beauty and the beast: optimal methods meet learning for drone racing. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 690–696 (2019). https://doi.org/10.1109/ICRA.2019.8793631 16. Loquercio, A., Kaufmann, E., Ranftl, R., Dosovitskiy, A., Koltun, V., Scaramuzza, D.: Deep drone racing: from simulation to reality with domain randomization. IEEE Trans. Robot. 36(1), 1–14 (2019) 17. Foehn, P., et al.: Alphapilot: autonomous drone racing. In: Robotics: Science and System (RSS) (2022). https://link.springer.com/article/10.1007/s11370-01800271-6 18. Haarnoja, T., Hartikainen, K., Abbeel, P., Levine, S.: Latent space policies for hierarchical reinforcement learning. In: International Conference on Machine Learning, pp. 1851–1860. PMLR (2018) 19. Peng, X. B., Coumans, E., Zhang, T., Lee, T.W., Tan, J., Levine, S.: Learning agile robotic locomotion skills by imitating animals. In: Robotics: Science and Systems (RSS) (2020) 20. Smith, L., Kew, J.C., Peng, X.B., Ha, S., Tan, J., Levine, S.: Legged robots that keep on learning: ﬁne-tuning locomotion policies in the real world. arXiv preprint arXiv:2110.05457 (2021) 21. Kumar, A., Fu, Z., Pathak, D., Malik, J.: RMA: rapid motor adaptation for legged robots. arXiv preprint arXiv:2107.04034 (2021) 22. Andrychowicz, M., et al.: Learning to learn by gradient descent by gradient descent. In: Advances in Neural Information Processing Systems, vol. 29 (2016) 23. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International Conference on Machine Learning, pp. 1126– 1135. PMLR (2017) 24. Spica, R., Falanga, D., Cristofalo, E., Montijano, E., Scaramuzza, D., Schwager, M.: A real-time game theoretic planner for autonomous two-player drone racing. In: Robotics: Science and Systems (RSS) (2018) 25. Baker, B., et al.: Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528 (2019) 26. Silver, D., et al.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017)

Continual SLAM: Beyond Lifelong Simultaneous Localization and Mapping Through Continual Learning Niclas V¨odisch1(B) , Daniele Cattaneo1 , Wolfram Burgard2 , and Abhinav Valada1 1

Department of Computer Science, University of Freiburg, Freiburg, Germany [email protected] 2 Department of Engineering, University of Technology Nuremberg, Nuremberg, Germany

Abstract. Robots operating in the open world encounter various different environments that can substantially differ from each other. This domain gap also poses a challenge for Simultaneous Localization and Mapping (SLAM) being one of the fundamental tasks for navigation. In particular, learning-based SLAM methods are known to generalize poorly to unseen environments hindering their general adoption. In this work, we introduce the novel task of continual SLAM extending the concept of lifelong SLAM from a single dynamically changing environment to sequential deployments in several drastically differing environments. To address this task, we propose CL-SLAM leveraging a dual-network architecture to both adapt to new environments and retain knowledge with respect to previously visited environments. We compare CL-SLAM to learning-based as well as classical SLAM methods and show the advantages of leveraging online data. We extensively evaluate CL-SLAM on three different datasets and demonstrate that it outperforms several baselines inspired by existing continual learning-based visual odometry methods. We make the code of our work publicly available at http://continualslam.cs.uni-freiburg.de.

1 Introduction An essential task for an autonomous robot deployed in the open world without prior knowledge about its environment is to perform Simultaneous Localization and Mapping (SLAM) to facilitate planning and navigation [11, 27]. To address this task, various SLAM algorithms based on different sensors have been proposed, including classical methods [28] and learning-based approaches [3, 18]. Classical methods typically rely on handcrafted low-level features that tend to fail under challenging conditions, e.g., textureless regions. Deep learning-based approaches mitigate such problems due to their ability to learn high-level features. However, they lack the ability to generalize This work was funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No 871449-OpenDR. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 3. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 19–35, 2023. https://doi.org/10.1007/978-3-031-25555-7_3

20

N. V¨odisch et al. Oxford, UK

Karlsruhe, DE

Domain Adaptation

...

...

...

Lifelong SLAM

Zurich, CH

Domain Adaptation

Continual SLAM - Online Adaptation

Fig. 1. While lifelong SLAM considers the long-term operation of a robot in a single dynamically changing environment, domain adaptation techniques aim toward transferring knowledge gained in one environment to another environment. The newly defined task of continual SLAM extends both settings by requiring omnidirectional adaptation involving multiple environments. Agents have to both quickly adapt to new environments and effectively recall knowledge from previously visited environments.

to out-of-distribution data, with respect to the training set. For visual SLAM, such outof-distribution data can correspond to images sourced from cities in different countries or under substantially different conditions. In the following, we use the term environment to refer to a bounded geographical area. While different environments can share the same fundamental structure, e.g., urban areas, their specific characteristics prevent the seamless transfer of learned features, resulting in a domain gap between cities [1]. In the context of this work, lifelong SLAM considers the long-term operation of a robot in a single environment (see Fig. 1). Although this environment can be subject to temporal changes, the robot is constrained to stay within the area borders [14], e.g., to obtain continuous map updates [15] within a city. Recent works attempt to relax this assumption by leveraging domain adaptation techniques for deep neural networks, including both regularization [33] and online adaptation of the employed model [19, 20, 23]. While a naive solution for adapting to a new environment is to source additional data, this is not feasible when the goal is to ensure the uninterrupted operation of the robot. Moreover, changes within an environment can be sudden, e.g., rapid weather changes, and data collection and annotation often come at a high cost. Therefore, adaptation methods should be trainable in an unsupervised or self-supervised manner without the need for ground truth data. As illustrated in Fig. 1, the setting addressed in domain adaptation only considers unidirectional knowledge transfer from a single known to a single unknown environment [1] and thus does not represent the open world, where the number of new environments that a robot can encounter is infinite and previously seen environments can be revisited. To address this gap, we take the next step by considering more complex sequences of environments and formulate the novel task of continual SLAM that leverages insights from both continual learning (CL) and lifelong SLAM. We propose a dual-network architecture called CL-SLAM to balance adaptation to new environments and memory retention of preceding environments. To assess

Continual SLAM

21

its efficacy, we define two metrics, adaptation quality and retention quality, and compare CL-SLAM to several baselines inspired by existing CL-based VO methods involving three different environments. We make the code of this work publicly available at http://continual-slam.cs.uni-freiburg.de. The supplementary material can be found at https://arxiv.org/abs/2203.01578.

2 Related Work Visual Odometry/SLAM: Visual odometry (VO) and vision-based SLAM estimate camera motion from a video. Allowing for self-supervised training, monocular VO can be tackled jointly with depth estimation based on photometric consistency. SfMLearner [34] uses an end-to-end approach consisting of two networks to predict depth from a single image and camera motion from two consecutive images. The networks are trained in parallel by synthesizing novel views of the target image. Monodepth2 [6] extends the loss function to account for occluded and static pixels. Other works such as DF-VO [32] eliminate the need for a pose network by leveraging feature matching based on optical flow. While these methods show superior performance [20], computing a gradient of the predicted pose with respect to the input image is not possible using classic point matching algorithms. To reduce drift, DeepSLAM [18] combines unsupervised learning-based VO with a pose graph backend taking global loop closures into account. In this work, we use a trainable pose network with velocity supervision [8] to resolve scale ambiguity. Similar to DeepSLAM, we detect loop closures and perform graph optimization. Continual Learning: Traditionally, a learning-based model is trained for a specific task on a dedicated training set and then evaluated on a hold-out test set sampled from the same distribution. However, in many real-world applications, the data distributions can differ or even change over time. Additionally, the initial task objective might be altered. Continual learning (CL) and lifelong learning [31] address this problem by defining a paradigm where a model is required to continuously readjust to new tasks and/or data distributions without sacrificing the capability to solve previously learned tasks, thus avoiding catastrophic forgetting. Most CL approaches employ one of three strategies. First, experience replay includes rehearsal and generative replay. Rehearsal refers to reusing data samples of previous tasks during adaptation to new tasks, e.g., the replay buffer in CoMoDA [16]. Minimizing the required memory size, the most representative samples should be carefully chosen or replaced by more abstract representations [7]. Similarly, generative replay constructs artificial samples by training generative models [30]. Second, regularization [21] prevents a CL algorithm from overfitting to the new tasks to mitigate forgetting, e.g., knowledge distillation. Third, architectural methods [13] preserve knowledge by adding, duplicating, freezing, or storing parts of the internal model parameters. They further include dual architectures that are inspired by mammalian brains [25], where one model learns the novel task and a second model memorizes previous experience. In this work, we combine architectural and replay strategies by leveraging a dual-network architecture with online adaptation incorporating data rehearsal.

22

N. V¨odisch et al.

Online Adaptation for Visual Odometry and Depth Estimation: Recently, Luo et al. [23] employed a subtask of CL for self-supervised VO and depth estimation, opening a new avenue of research. Online adaptation enables these methods to enhance the trajectory and depth prediction on a test set sourced from a different data distribution than the originally used training set. Both Zhang et al. [33] and CoMoDA [16] primarily target the depth estimation task. While Zhang et al. propose to learn an adapter to map the distribution of the online data to the one of the training data, CoMoDA updates the internal parameters of the depth and pose networks based on online data and a replay buffer. The work in spirit most similar to ours is done by Li et al. [19]. They propose to substitute the standard convolutional layers in the depth and pose networks with convolutional LSTM variants. Then, the model parameters are continuously updated using only the online data. In subsequent work, Li et al. [20] replace the learnable pose network by point matching from optical flow. Note that all existing works purely focus on one-step adaptation, i.e., transferring knowledge gained in one environment to a single new environment. In this paper, we introduce continual SLAM to take the next step by considering more complex deployment scenarios comprising more than two environments and further alternating between them.

3 Continual SLAM Problem Setting: Deploying a SLAM system in the open world substantially differs from an experimental setting, in which parameter initialization and system deployment are often performed in the same environment. To overcome this gap, we propose a new task called Continual SLAM, illustrated in Fig. 1, where the robot is deployed on a sequence of diverse scenes from different environments. Ideally, a method addressing the continual SLAM problem should be able to achieve the following goals: 1) quickly adapt to unseen environments while deployment, 2) leverage knowledge from previously seen environments to speed up the adaptation, and 3) effectively memorize knowledge from previously seen environments to minimize the required adaptation when revisiting them, while mitigating overfitting to any of the environments. Formally, continual SLAM can be defined as a potentially infinite sequence of scenes S = (s1 → s2 → . . . ) from a set of different environments si ∈ {Ea , Eb , . . . }, where s denotes a scene and E denotes an environment. In particular, S can contain multiple scenes from the same environment and the scenes in S can occur in any possible fixed order. A continual SLAM algorithm A can be defined as A : < θi−1 , (s1 , . . . , si ) > → < θi >,

(1)

where (s1 , . . . , si ) refers to the seen scenes in the specified order and θi denotes the corresponding state of the learnable parameters of the algorithm. During deployment, the algorithm A has to update θi−1 based on the newly encountered scene si . For instance, given two environments Ea = {s1a , s2a } and Eb = {s1b }, which comprise two and one scenes, respectively, examples of feasible sequences are S1 = (s1a → s1b → s2a ),

S2 = (s2a → s1a → s1b ),

S3 = (s1b → s2a → s1a ),

(2)

Continual SLAM

23

where the scene subscripts denote the corresponding environment and the superscripts refer to the scene ID in this environment. As described in Sect. 1, the task of continual SLAM is substantially different from lifelong SLAM or unidirectional domain adaptation as previously addressed by Luo et al. [23] and Li et al. [19, 20]. To conclude, we identify the following main challenges: 1) large number of different environments, 2) huge number of chained scenes, 3) scenes can occur in any possible order, and 4) environments can contain multiple scenes. Therefore, following the spirit of continual learning (CL), a continual SLAM algorithm has to balance between shortterm adaptation to the current scene and long-term knowledge retention. This trade-off is also commonly referred to as avoiding catastrophic forgetting with respect to previous tasks without sacrificing performance on the new task at hand. Performance Metrics: To address the aforementioned challenges, we propose two novel metrics, namely adaptation quality (AQ), which measures the short-term adaptation capability when being deployed in a new environment, and retention quality (RQ), which captures the long-term memory retention when revisiting a previously encountered environment. In principle, these metrics can be applied to any given base metric Md that can be mapped to the interval [0, 1], where 0 and 1 are the lowest and highest performances, respectively. The subscript d denotes the given sequence, where the error is computed on the final scene. Base Metrics: For continual SLAM, we leverage the translation error terr (in %) and the rotation error rerr (in ◦ /m), proposed by Geiger et al. [5], that evaluate the error as a function of the trajectory length. To obtain scores in the interval [0, 1], we apply the following remapping: terr rerr , (3) terr = max 0, 1 − , rerr = 1 − 100 180 where we clamp terr to 0 for terr > 100%. The resulting terr and rerr are then used as the base metric M to compute AQtrans /RQtrans and AQrot / RQrot , respectively. Adaptation Quality: The adaptation quality (AQ) measures the ability of a method to effectively adapt to a new environment based on experiences from previously seen environments. It is inspired by the concept of forward transfer (FWT) [22] in traditional CL, which describes how learning a current task influences the performance of a future task. Particularly, positive FWT enables zero-shot learning, i.e., performing well on a future task without explicit training on it. On the other hand, negative FWT refers to sacrificing performance on a future task by learning the current task. In our context, a task refers to performing SLAM in a given environment. Consequently, the AQ is intended to report how well a continual SLAM algorithm is able to minimize negative FWT, e.g., by performing online adaptation. To illustrate the AQ, we consider the simplified example of a set of two environments {Ea , Eb } consisting of different numbers of scenes. We further assume that the algorithm has been initialized in a separate environment Ep . Since the AQ focuses on the cross-environment adaptation, we sample one random scene from each environment sa ∈ Ea and sb ∈ Eb and hold them fixed. Now, we construct the set of all possible deployment sequences D = {(sp → sa ), (sp → sb ), (sp → sa → sb ),

24

N. V¨odisch et al.

(sp → sb → sa )}, where sp ∈ Ep is the data used for initialization. The AQ is then defined as: 1 Md . (4) AQ = |D| d∈D

Retention Quality: To further account for the opposing challenge of the continual SLAM setting, we propose the retention quality (RQ) metric. It measures the ability of an algorithm to preserve long-term knowledge when being redeployed in a previously encountered environment. It is inspired by the concept of backward transfer (BWT) [22] in CL settings, which describes how learning a current task influences the performance on a previously learned task. While positive BWT refers to improving the performance on prior tasks, negative BWT indicates a decrease in the performance of the preceding task. The extreme case of a large negative BWT is often referred to as catastrophic forgetting. Different from classical BWT, we further allow renewed online adaptation when revisiting a previously seen environment, i.e., performing a previous task, as such a setting is more sensible for a robotic setup. It further avoids the necessity to differentiate between new and already seen environments, which would require the concept of environment classification in the open world. To illustrate the RQ, we consider a set of two environments {Ea , Eb } consisting of different numbers of scenes. We further assume that the algorithm has been initialized on data sp of a separate environment Ep . We sample two random scenes from each environment, i.e., s1a , s2a , s1b , and s2b . To evaluate the RQ, we need to construct a deployment sequence S that consists of alternating scenes from the two considered environments. In this example, we consider the following fixed sequence: S = (sp → s1a → s1b → s2a → s2b ).

(5)

We then consider all the subsequences D of S in which the last scene comes from an environment already visited prior to a deployment in a scene of a different environment. In this example, D = (sp → s1a → s1b → s2a ), (sp → s1a → s1b → s2a → s2b ) . The RQ is then defined as the sum over all differences of the base metric in a known environment before and after deployment in a new environment, divided by the size of D. For instance, given the sequence in Eq. 5: 1 Msps1as1bs2a − Msps1as2a + Msps1as1bs2as2b − Msps1as1bs2b . RQ = 2 (6)

4 Technical Approach Framework Overview: The core of CL-SLAM is the dual-network architecture of the visual odometry (VO) model that consists of an expert that produces myopic online odometry estimates and a generalizer that focuses on the long-term learning across environments (see Fig. 2). We train both networks in a self-supervised manner where the weights of the expert are updated only based on online data, whereas the weights of the generalizer are updated based on a combination of data from both the online stream and a replay buffer. We use the VO estimates of the expert to construct a pose graph (see

Continual SLAM

25

Time ...

Generalizer

Update weights

Expert

Update weights

DepthNet

PoseNet

Replay buffer

Fig. 2. Online adaptation scheme of our proposed CL-SLAM that is constructed as a dual-network architecture including a generalizer (left) and an expert (right). While the expert focuses on the short-term adaptation to the current scene, the generalizer avoids catastrophic forgetting by employing a replay buffer comprising samples from the past and the present. Note that both subnetworks contain a single PoseNet, shown twice to reflect pose estimation at different steps. The predicted odometry Ot−1t is sent to the SLAM framework as shown in Fig. 3.

Fig. 3). To reduce drift, we detect global loop closures and add them to the graph, which is then optimized. Finally, we can create a dense 3D map using the depth predicted by the expert and the optimized path. Visual Odometry: We generate VO estimates following the commonly used approach of using a trainable pose network [2, 6, 8, 18] for self-supervised depth estimation with a stream of monocular images. The basic idea behind this approach is to synthesize a novel view of an input image using image warping as reviewed in the supplementary material. In this work, we use Monodepth2 [6] to jointly predict the depth map of an image and the camera motion from the previous timestep to the current. To recover metric scaling of both depth and the odometry estimates, we adapt the original loss function with a velocity supervision term as proposed by Guizilini et al. [8]. As scalar velocity measurements are commonly available in robotic systems, e.g., by wheel odometry, this does not pose an additional burden. Our total loss is composed of the photometric reprojection loss Lpr , the image smoothness loss Lsm , and the velocity supervision loss Lvel : (7) L = Lpr + γLsm + λLvel . Following the common methodology, we compute the loss based on an image triplet {It−2 , It−1 , It } using depth and odometry predictions Dt−1 , Ot−2t−1 , and Ot−1t . We provide more details on the individual losses in the supplementary material. Loop Closure Detection and Pose Graph Optimization: In order to reduce drift over time, we include global loop closure detection and pose graph optimization (see Fig. 3). We perform place recognition using a pre-trained and frozen CNN, referred to as LoopNet. In particular, we map every frame to a feature vector using MobileNetV3

26

N. V¨odisch et al.

Time Expert ...

Dense 3D map

DepthNet PoseNet LoopNet

Loop closure detection

Pose graph optimization

Fig. 3. Full SLAM framework of our proposed CL-SLAM. Global loop closures are detected with a pre-trained CNN. Visual odometry estimates between both consecutive frames and loop closure frames are generated by the PoseNet and added to a pose graph, which is optimized upon the detection of a new loop closure. Finally, a dense 3D map can be created using the predicted depth and the optimized path.

small [10], trained on ImageNet, and store them in a dedicated memory. Then, we compute the cosine similarity of the current feature map with all preceding feature maps: simcos = cos(fcurrent , fprevious ).

(8)

If simcos is above a given threshold, we use the PoseNet to compute the transformation between the corresponding images. During deployment, we continuously build a pose graph [17] consisting of both local and global connections, i.e., consecutive VO estimates and loop closures. Whenever a new loop closure is detected, the pose graph is optimized. Online Adaptation: In this section, we describe the dual-network architecture of the VO predictor in CL-SLAM that effectively addresses the trade-off between short-term adaptation and long-term memory retention, a problem also known as catastrophic forgetting. Subsequently, we detail the training scheme including the utilized replay buffer. Architecture: The dual-network architecture consists of two instances of both the DepthNet and the PoseNet. In the following, we refer to these instances as expert and generalizer. We build upon the architecture of Monodepth2 [6]. The DepthNet has an encoderdecoder topology, comprising a ResNet-18 [9] encoder and a CNN-based decoder with skip connections, and predicts disparity values for each pixel in the input image. The PoseNet consists of a similar structure using a separate ResNet-18 encoder followed by additional convolutional layers to generate the final output that represents translation and rotation between two input images. Further implementation details are provided in Sect. 5. Training Scheme: Before deployment, i.e., performing continual adaptation, we pretrain the DepthNet and the PoseNet using the standard self-supervised training procedure based on the loss functions described in Sect. 4. When deployed in a new environment, we continuously update the weights of both the expert and the generalizer in an online manner, following a similar scheme as Kuznietsov et al. [16]: (1) Create an image triplet composed of the latest frame It and the two previous frames It−1 and It−2 . Similarly, batch the corresponding velocity measurements.

Continual SLAM

27

(2) Estimate the camera motion between both pairs of subsequent images, i.e., Ot−2t−1 and Ot−1t with the PoseNet. (3) Generate the depth estimate Dt−1 of the previous image with the DepthNet. (4) Compute the loss according to Eq. 7 and use backpropagation to update the weights of the DepthNet and PoseNet. (5) Loop over steps (2) to (4) for c iterations. (6) Repeat the previous steps for the next image triplet. Upon deployment, both the expert and the generalizer are initialized with the same set of parameter weights, initially obtained from pre-training and later replaced by the memory of the generalizer. As illustrated in Fig. 2, the weights of the expert are updated according to the aforementioned algorithm. Additionally, every new frame from the online image stream is added to a replay buffer along with the corresponding velocity reading. Using only the online images, the expert will quickly adapt to the current environment. This behavior can be described as a desired form of overfitting for a myopic increase in performance. On the other hand, the generalizer acts as the long-term memory of CL-SLAM circumventing the problem of catastrophic forgetting in continual learning settings. Here, in step (1), we augment the online data by adding image triplets from the replay buffer to rehearse experiences made in the past, as depicted in Fig. 2. After deployment, the weights of the stored parameters used for initialization are replaced by the weights of the generalizer, thus preserving the continuous learning process of CL-SLAM. The weights of the expert are then discarded.

5 Experimental Evaluation Implementation Details: We adopt the Monodepth2 [6] architecture using separate ResNet-18 [9] encoders for our DepthNet and PoseNet. We implement CL-SLAM in PyTorch [29] and train on a single NVIDIA TITAN X GPU. We pre-train both subnetworks in a self-supervised manner on the Cityscapes dataset [4] for 25 epochs with a batch size of 18. We employ the Adam optimizer with β1 = 0.9, β2 = 0.999 and an initial learning rate of 10−4 , which is reduced to 10−5 after 15 epochs. Further, we resize all images during both pre-training and adaptation to 192 × 640 pixels. Additionally, during the pre-training phase, we mask all potentially dynamic objects using bounding boxes generated by YOLOv5m [12], which was trained on the COCO dataset. We observe that on Cityscapes this procedure yields a smaller validation loss than without masking. We set the minimum predictable depth to 0.1 m without specifying an upper bound. To balance the separate terms in the loss, we set the disparity smoothness weight γ = 0.001 and the velocity loss weight λ = 0.05. During adaptation, we utilize the same hyperparameters as listed above. Inspired by the findings of McCraith et al. [26], we freeze the weights of the encoders. Based on the ablation study in Sect. 5.2, we set the number of update cycles c = 5. To enhance the unsupervised guidance, we use the velocity readings to skip new incoming images if the driven distance is less than 0.2 m. We construct the training batch for the generalizer by concatenating the online data with a randomly sampled image triplet of each environment except for the current environment as this is already represented by the online data. Finally, we add the online data to the replay buffer.

28

N. V¨odisch et al. Table 1. Path accuracy on the KITTI dataset

Online adaptation to KITTI KITTI CL-SLAM sequence terr

rerr

CL-SLAM (w/o loops) terr

rerr

Trained on KITTI seq. {0, 1, 2, 8, 9}

No training

DeepSLAM [18]

VO+vel [6, 8] (w/o loops)

ORB-SLAM

terr

terr

terr

rerr

rerr

rerr

4 – – 4.37 0.51 5.22 2.27 10.72 1.69 0.62 0.11 4.30 1.01 4.41 1.33 4.04 1.40 34.55 11.88 2.51 0.25 5 2.53 0.63 3.07 0.73 5.99 1.54 15.20 5.62 7.80 0.35 6 2.10 0.83 3.74 1.91 4.88 2.14 12.77 6.80 1.53 0.35 7 – – 2.22 0.34 10.77 4.45 55.27 9.50 2.96 0.52 10 Translation error terr in [%] and rotation error rerr in [◦ /100 m]. Sequences 4 and 10 do not contain loops. CL-SLAM is pre-trained on the Cityscapes dataset. The paths computed by ORB-SLAM use median scaling [34] as they are not metric scale. The smallest errors among the learning-based methods are shown in bold.

Datasets: To simulate scenes from a diverse set of environments, we employ our method on three relevant datasets, namely Cityscapes [4], Oxford RobotCar [24], and KITTI [5], posing the additional challenge of adapting to changing camera characteristics. Cityscapes: The Cityscapes Dataset [4] includes images and vehicle metadata recorded in 50 cities across Germany and bordering regions. Due to the unsupervised training scheme of our VO method, we can leverage the included 30-frame snippets to pre-train our networks despite the lack of ground truth poses. Oxford RobotCar: The Oxford RobotCar Dataset [24] focuses on repeated data recordings of a consistent route, captured over the period of one year in Oxford, UK. Besides RGB images, it also contains GNSS and IMU data, which we use for velocity supervision. To compute the trajectory error, we leverage the released RTK ground truth positions. KITTI: The KITTI Dataset [5] provides various sensor recordings taken in Karlsruhe, Germany. We utilize the training data from the odometry benchmark, which includes images and ground truth poses for multiple routes. We further leverage the corresponding IMU data from the released raw dataset to obtain the velocity of the vehicle. 5.1

Evaluation of Pose Accuracy of CL-SLAM

Before analyzing how CL-SLAM addresses the task of continual SLAM, we compare its performance to existing SLAM framework. In particular, in Table 1 we report the translation and rotation errors on sequences 4, 5, 6, 7, and 10 of the KITTI Odometry dataset [5] following Li et al. [18]. Since the IMU data of sequence 3 has not been released, we omit this sequence. We compare CL-SLAM to two learning-based and one feature-based approach. DeepSLAM [18] uses a similar unsupervised learning-based approach consisting of VO and graph optimization but does not perform online adaptation. VO+vel refers to Monodepth2 [6] with velocity supervision [8], i.e., it corresponds to the base VO estimator of CL-SLAM without adaptation and loop closure detection. Both learning-based methods produce metric scale paths and are trained on the

Continual SLAM

29

Table 2. Translation and rotation error for computing the AQ and RQ metrics

sequences 0, 1, 2, 8, and 9. Further, we report the results of monocular ORB-SLAM [28] after median scaling [34]. CL-SLAM outperforms DeepSLAM on the majority of sequences highlighting the advantage of online adaptation. Note that CL-SLAM was not trained on KITTI data but was only exposed to Cityscapes before deployment. To show the effect of global loop closure detection, we report the error on sequences 5 to 7 both with and without graph optimization enabled. Note that sequences 4 and 10 do not contain loops. Compared to ORB-SLAM, CL-SLAM suffers from a higher rotation error but can improve the translation error in sequences 6 and 10. The overall results indicate that general SLAM methods would benefit from leveraging online information to enhance performance. 5.2 Evaluation of Continual SLAM Experimental Setup:In order to quantitatively evaluate the performance of our proposed approach, we compute both the adaptation quality (AQ) and the retention quality (RQ) by deploying CL-SLAM and the baseline methods on a fixed sequence of scenes. In particular, we use the official training split of the Cityscapes dataset to initialize the DepthNet and PoseNet, using the parameters detailed in Sect. 5. The pre-training step is followed by a total of four scenes of the Oxford RobotCar dataset and the KITTI dataset. (ct → k1 → r1 → k2 → r2 ),

(9)

where ct refers to the Cityscapes training set. Following the setup of Li et al. [19], we set k1 and k2 to be sequences 9 and 10 of the KITTI Odometry dataset. Note that we omit loop closure detection for this evaluation to prevent graph optimization from masking the effect of the respective adaptation technique. From the Oxford RobotCar dataset, we select the recording of August 12, 2015, at 15:04:18 GMT due to sunny weather and good GNSS signal reception. In detail, we set r1 to be the scene between frames 750 and 4,750 taking every second frame to increase the driven distance between two consecutive frames. Analogously, we set r2 to

N. V¨odisch et al.

Table 3. Comparison of the adaptation quality (AQ)

Smoothed translation error [m]

30

1.0 0.8 0.6 0.4 0.2 0.0 0

10

20

30

40

50

Frame number fixed : RobotCar fixed : KITTI

fixed : Cityscapes

CL-SLAM: Cityscapes

Fig. 4. The translation error on the initial frames of KITTI sequence 4. Bfixed is trained on the different environments indicating the domain gap between them. CL-SLAM overcomes this issue by performing online adaptation.

be the scene between frames 22,100 and 26,100. We use a scene length of 2,000 frames in order to be similar to the length of KITTI sequences: 1,584 frames for k1 and 1,196 for k2 . Baselines: We compare CL-SLAM to three baselines that are inspired by previous works towards online adaptation to a different environment compared to the environment used during training. As noted in Sect. 3, continual SLAM differentiates from such a setting in the sense that it considers a sequence of different environments. First, Bexpert imitates the strategy employed by Li et al. [19], using a single set of network weights that is continuously updated based on the current data. This corresponds to only using the expert network in our architecture without resetting the weights. Second, Bgeneral follows CoMoDA [16] leveraging a replay buffer built from previously seen environments. This method corresponds to only using the generalizer network. Finally, we compute the error without performing any adaptation, i.e., Bfixed utilizes network weights fixed after the pre-training stage. To further illustrate forward and backward transfer and to close the gap to classical CL, we provide results on an additional baseline Boffline in the supplementary material. This baseline is initialized with the same network weights as CL-SLAM but does not perform online adaptation to avoid masking backward transfer. In reality, it resembles data collection followed by offline training after every new environment. Adapting to New Environments: In the initial part of the evaluation sequence (Eq. 9), the algorithm has to adapt to unseen environments. In accordance to the definition of the AQ in Sect. 3, we construct four sequences listed in the upper four rows of Table 2. Next, we deploy CL-SLAM and the baselines, initialized with the same set of model weights pre-trained on Cityscapes, on each of these sequences and compute the translation and rotation errors. Note that we do not apply median scaling since the PoseNet in our work produces metric estimates due to the velocity supervision term. Further note that for the first deployment after pre-training, Bexpert corresponds to CL-SLAM. We observe that Bexpert yields smaller errors than Bgeneral . This indicates the importance of online

Continual SLAM

(a)

31

(b)

500 Relative translation error [m]

400

z [m]

300 200 100 0

0.5 0.4 0.3 0.2 0.1 0.0

0

200

400

600

0

25

x [m] fixed

50

75

100

125

150

Frame number expert

general

CL-SLAM

Ground truth

Fig. 5. (a) Comparison of the trajectory on k2 after previous deployment on k1 and r1 predicted by CL-SLAM and the baseline methods. The hexagon indicates the starting point. (b) Relative translation error of the first 150 frames along k2 . Compared to Bexpert , CL-SLAM reduces the error more quickly due to initialization with the weights of its generalizer network.

adaptation without diluting the updates with data from unrelated environments, if a high performance on the current deployment is the desideratum, and, thus, supports using the expert network in our approach. To compute the AQ score, after remapping using Eq. 3 we sum the errors and divide by the number of sequences: AQ =

1 (Mctk1 + Mctr1 + Mctr1k1 + Mctk1r1 ) . 4

(10)

Comparing the AQ (see Table 3) for all experiments further endorses the previous findings in a single metric. Notably, continual adaptation is strictly necessary to obtain any meaningful trajectory. Finally, we discuss the effect of consecutive deployments to different environments. In Fig. 4, we plot the translation error of the VO estimates on KITTI sequence 4 without online adaptation, separately trained on the considered datasets, and with adaptation, pre-trained on Cityscapes. As expected, without adaptation, the error is substantially higher if the system was trained on a different dataset showing the domain gap between the environments. By leveraging online adaptation, CL-SLAM reduces the initial error and yields even smaller errors than training on KITTI without further adaptation. Having established the existence of a domain gap, we analyze how the deployment to the current environment effects the future deployment to another environment, resembling the concept of forward transfer (FTW) in continual learning (CL). In detail, Table 2 reveals that the performances of all adaptation-based methods decrease when deploying them to an intermediate environment, e.g., (ct → k1 ) versus (ct → r1 → k1 ), where the effect is most pronounced for Bgeneral . In CL, such behavior is referred to as negative FWT. Remembering Previous Environments: In the subsequent phase of the evaluation sequence (Eq. 9), the algorithm is redeployed in a new scene taken from a previously encountered environment. In accordance to the definition of the RQ in Sect. 3, we

32

N. V¨odisch et al.

construct four sequences listed in the lower four rows of Table 2. Note that the first two sequences are part of the original evaluation sequence (Eq. 9) and the other two sequences are used as a reference to measure the effect of mixed environments. Following the same procedure as in the previous section, we compute the translation and rotation errors. The resulting scores (see Table 2) demonstrate the benefit of employing a replay buffer to leverage previously learned knowledge, Bgeneral yields smaller errors than Bexpert on the majority of sequences. To compute the RQ, we follow Eq. 6: 1

(Mctk1r1k2 − Mctk1k2 ) + (Mctk1r1k2r2 − Mctk1r1r2 ) . 2 (11) Comparing the RQ scores in Table 4 clearly shows that the drop in performance when mixing environments is less pronounced for Bgeneral . Our proposed CL-SLAM leverages this advantage due to its generalizer, while the expert still focuses on the current scene, achieving the highest RQ across the board. To bridge the gap to classical CL, we also qualitatively compare the consecutive deployment to scenes from the same environment with introducing an intermediate scene from another environment, e.g., (ct → k1 → k2 ) versus (ct → k1 → r1 → k2 ). In CL, an increase/decrease in performance is known as positive/negative backward transfer (BWT). Whereas we observe positive BWT for Bgeneral and CL-SLAM on the KITTI dataset, the sequence with final deployment on RobotCar suffers from negative BWT. A possible explanation for this inconsistent behavior is structural differences between the sequences of the same dataset inducing small domain gaps within a dataset that require a potentially more fine-grained scene classification. However, by always performing online adaptation independent of previous deployments, CL-SLAM circumvents such issues. In Fig. 5a, we visualize the generated trajectories in k2 given previous deployment in k1 and r1 from our method and the evaluated baselines. Although Bexpert can reproduce the general shape of the trajectory, it requires a warm-up time causing an initial RQ =

Table 4. Comparison of the Retention Quality (RQ)

Table 5. Ablation study on the number of adaptation cycles

Continual SLAM

33

drift, visible up to frame 40 in Fig. 5b. On the other hand, Bgeneral can leverage the experience from k1 due to the rehearsal of the KITTI data from its replay buffer during the previous deployment in r1 . By following this idea, our proposed CL-SLAM combines the advantages of both baseline strategies. Number of Update Cycles: We perform a brief ablation study on the number of update cycles performed during online adaptation, i.e., how often steps (2) to (4) are repeated for a given batch of data (see Sect. 4). For this, we deploy CL-SLAM to both KITTI sequences k1 and k2 and compute the translation and rotation error. As shown in Table 5, using five update cycles yields the most accurate trajectory while resulting in a 75% reduction in speed compared to a single cycle. However, please note that in this work, we do not focus on adaptation speed but on showing the efficacy of the proposed dualnetwork approach to balance the common continual learning trade-off between quick adaptation and memory retention.

6 Conclusion In this paper, we introduced the task of continual SLAM, which requires the SLAM algorithm to continuously adapt to new environments while retaining the knowledge learned in previously visited environments. To evaluate the capability of a given model to meet these opposing objectives, we defined two new metrics based on the commonly used translation and rotation errors, namely the adaptation quality and the retention quality. As a potential solution, we propose CL-SLAM, a deep learning-based visual SLAM approach that predicts metric scale trajectories from monocular videos and detects global loop closures. To balance short-term adaptation and long-term memory retention, CL-SLAM is designed as a dual-network architecture comprising an expert and a generalizer, which leverages experience replay. Through extensive experimental evaluations, we demonstrated the efficacy of our method compared to baselines using previously proposed continual learning strategies for online adaptation of visual odometry. Future work will focus on transferring the proposed design scheme to more advanced visual odometry methods, e.g., using point matching via optical flow. We further plan to address the currently infinite replay buffer to mitigate the scaling problem, e.g., by storing more abstract representations or keeping only the most representative images.

References 1. Beˇsi´c, B., Gosala, N., Cattaneo, D., Valada, A.: Unsupervised domain adaptation for lidar panoptic segmentation. IEEE Robot. Autom. Lett. 7(2), 3404–3411 (2022) 2. Beˇsi´c, B., Valada, A.: Dynamic object removal and spatio-temporal RGB-D inpainting via geometry-aware adversarial learning. IEEE Trans. Intell. Veh. 7(2), 170–185 (2022) 3. Cattaneo, D., Vaghi, M., Valada, A.: Lcdnet: deep loop closure detection and point cloud registration for lidar slam. IEEE Trans. Rob. 38(4), 1–20 (2022) 4. Cordts, M., et al.: The cityscapes dataset for semantic urban scene understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223 (2016)

34

N. V¨odisch et al.

5. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The KITTI vision benchmark suite. In: IEEE Conference on Computer Vision and Pattern Recognition (2012) 6. Godard, C., Aodha, O.M., Firman, M., Brostow, G.: Digging into self-supervised monocular depth estimation. In: International Conference on Computer Vision, pp. 3827–3837 (2019) 7. Gopalakrishnan, S., Singh, P.R., Fayek, H., Ambikapathi, A.: Knowledge capture and replay for continual learning. In: IEEE Winter Conference on Applications of Computer Vision (2022) 8. Guizilini, V., Ambru, R., Pillai, S., Raventos, A., Gaidon, A.: 3D packing for self-supervised monocular depth estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020) 9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 10. Howard, A., et al.: Searching for mobilenetv3. In: International Conference on Computer Vision, pp. 1314–1324 (2019) 11. Hurtado, J.V., Londo˜no, L., Valada, A.: From learning to relearning: a framework for diminishing bias in social robot navigation. Frontiers Robot. AI 8, 69 (2021) 12. Jocher, G.: Yolov5 (2022). https://github.com/ultralytics/yolov5 13. Kemker, R., Kanan, C.: Fearnet: brain-inspired model for incremental learning. In: International Conference on Learning Representations (2018) 14. Kretzschmar, H., Grisetti, G., Stachniss, C.: Lifelong map learning for graph-based slam in static environments. KI - K¨unstliche Intelligenz 24(3), 199–206 (2010) 15. Kurz, G., Holoch, M., Biber, P.: Geometry-based graph pruning for lifelong SLAM. In: International Conference on Intelligent Robots and Systems, pp. 3313–3320 (2021) 16. Kuznietsov, Y., Proesmans, M., Van Gool, L.: CoMoDA: continuous monocular depth adaptation using past experiences. In: IEEE Winter Conference on Applications of Computer Vision (2021) 17. K¨ummerle, R., Grisetti, G., Strasdat, H., Konolige, K., Burgard, W.: G2o: A general framework for graph optimization. In: International Conference on Robotics and Automation, pp. 3607–3613 (2011) 18. Li, R., Wang, S., Gu, D.: DeepSLAM: a robust monocular SLAM system with unsupervised deep learning. IEEE Trans. Industr. Electron. 68(4), 3577–3587 (2021) 19. Li, S., Wang, X., Cao, Y., Xue, F., Yan, Z., Zha, H.: Self-supervised deep visual odometry with online adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020) 20. Li, S., Wu, X., Cao, Y., Zha, H.: Generalizing to the open world: deep visual odometry with online adaptation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 13179–13188 (2021) 21. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2935–2947 (2018) 22. Lopez-Paz, D., Ranzato, M.A.: Gradient episodic memory for continual learning. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 23. Luo, H., Gao, Y., Wu, Y., Liao, C., Yang, X., Cheng, K.T.: Real-time dense monocular SLAM with online adapted depth prediction network. IEEE Trans. Multimedia 21(2), 470– 483 (2019) 24. Maddern, W., Pascoe, G., Linegar, C., Newman, P.: 1 year, 1000 km: the Oxford RobotCar dataset. Int. J. Rob. Res. 36(1), 3–15 (2017) 25. McClelland, J.L., McNaughton, B.L., O’Reilly, R.C.: Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102(3), 419 (1995) 26. McCraith, R., Neumann, L., Zisserman, A., Vedaldi, A.: Monocular depth estimation with self-supervised instance adaptation. arXiv preprint arXiv:2004.05821 (2020)

Continual SLAM

35

27. Mittal, M., Mohan, R., Burgard, W., Valada, A.: Vision-based autonomous UAV navigation and landing for urban search and rescue. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds.) Robotics Research, vol. 20, pp. 575–592. Springer, Cham (2022). https:// doi.org/10.1007/978-3-030-95459-8 35 28. Mur-Artal, R., Montiel, J.M.M., Tard´os, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Rob. 31(5), 1147–1163 (2015) 29. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., et al.: PyTorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems (2019) 30. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative replay. In: Advances in Neural Information Processing Systems, vol. 30 (2017) 31. Thrun, S.: Is learning the n-th thing any easier than learning the first? In: Advances in Neural Information Processing Systems, vol. 8 (1995) 32. Zhan, H., Weerasekera, C.S., Bian, J.W., Reid, I.: Visual odometry revisited: what should be learnt? In: International Conference on Robotics and Automation, pp. 4203–4210 (2020) 33. Zhang, Z., Lathuili`ere, S., Ricci, E., Sebe, N., Yang, J.: Online depth learning against forgetting in monocular videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2020) 34. Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and egomotion from video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6612–6619 (2017)

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments Lawrence Yunliang Chen1(B) , Huang Huang1 , Ellen Novoseller1 , Daniel Seita2 , Jeﬀrey Ichnowski1 , Michael Laskey3 , Richard Cheng4 , Thomas Kollar4 , and Ken Goldberg1 1

The AUTOLab, University of California, Berkeley, Berkeley, CA 94720, USA {yunliang.chen,huangr}@berkeley.edu 2 Carnegie Mellon University, Pittsburgh, PA 15213, USA 3 Electric Sheep Robotics, San Francisco, CA 94131, USA 4 Toyota Research Institute, Los Altos, CA 94022, USA https://autolab.berkeley.edu/ Abstract. Recent work has shown that 2-arm “ﬂing” motions can be eﬀective for garment smoothing. We consider single-arm ﬂing motions. Unlike 2-arm ﬂing motions, which require little robot trajectory parameter tuning, single-arm ﬂing motions are very sensitive to trajectory parameters. We consider a single 6-DOF robot arm that learns ﬂing trajectories to achieve high garment coverage. Given a garment grasp point, the robot explores diﬀerent parameterized ﬂing trajectories in physical experiments. To improve learning eﬃciency, we propose a coarse-to-ﬁne learning method that ﬁrst uses a multi-armed bandit (MAB) framework to eﬃciently ﬁnd a candidate ﬂing action, which it then reﬁnes via a continuous optimization method. Further, we propose novel training and execution-time stopping criteria based on ﬂing outcome uncertainty; the training-time stopping criterion increases data eﬃciency while the execution-time stopping criteria leverage repeated ﬂing actions to increase performance. Compared to baselines, the proposed method signiﬁcantly accelerates learning. Moreover, with prior experience on similar garments collected through self-supervision, the MAB learning time for a new garment is reduced by up to 87%. We evaluate on 36 real garments: towels, T-shirts, long-sleeve shirts, dresses, sweat pants, and jeans. Results suggest that using prior experience, a robot requires under 30 min to learn a ﬂing action for a novel garment that achieves 60–94% coverage. Supplementary material can be found at https://sites.google. com/view/single-arm-ﬂing. Keywords: Deformable object manipulation

1

· Garment smoothing

Introduction

Garment smoothing [1] is useful in daily life, industry, and clothing stores as a ﬁrst step toward garment folding [2–4], and hanging, as well as assistive L. Y. Chen and H. Huang—Equal contribution Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 4. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 36–51, 2023. https://doi.org/10.1007/978-3-031-25555-7_4

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

37

Fig. 1. Single-arm dynamic garment flinging. A UR5 robot ﬂings a blue T-shirt. Top row: the robot holding the garment (1), lifts (2) and shakes it vertically (3 left) and horizontally (3 right) to loosen the garment using gravity. Bottom row: The robot ﬂings the garment using the learned parameterized trajectory (Sect. 4) to smooth it on the surface. This cycle repeats until a stopping criterion is satisﬁed.

dressing [5,6]. However, almost all prior research in robot garment manipulation has focused on quasistatic pick-and-place motions [7–13], which are relatively slow and limited by the robot workspace. Yet, robots can perform dynamic actions such as casting, throwing, vaulting [14–16], and air blowing [17] to eﬃciently place objects such as ropes, garments, and bags in desired conﬁgurations, even when much of the object lies outside the robot workspace. Dynamic manipulation of deformable objects is challenging [18–20] as such objects are partially observable, have inﬁnite-dimensional conﬁguration spaces, and have uncertain and complex dynamics due to deformability, elasticity, and friction. To address these modeling challenges, this work explores a learning-based approach for garment smoothing via high-speed dynamic actions that use inertia to move garments beyond the robot workspace. In a pioneering paper, Ha and Song [16] proposed FlingBot, which used dynamic 2-arm “ﬂing” motions to smooth garments. Using scripted motions at constant speed and air resistance, they achieved over 90% coverage for T-shirts. However, dual-arm manipulators may be too costly and bulky for home use. Many mobile manipulators, such as the HSR [21] and the Fetch [22], only have 1 arm. In this work, we show that a robot can use 1-arm ﬂing motions to achieve a coverage of 60–94 % for 6 garment types. Single-arm ﬂings are more diﬃcult than 2-arm ﬂings, since they cannot straighten one garment edge prior to ﬂinging. We propose a single-arm dynamic smoothing pipeline with novel parameterized primitives including shaking and ﬂinging, and we propose an eﬃcient method to learn these parameters. We begin with the robot gripping a ﬁxed location on a crumpled garment. The robot lifts the garment and shakes it horizontally and vertically to loosen and reduce wrinkles. The robot then executes a learned ﬂing motion. It retries the ﬂing until a stopping rule determines that the coverage is suﬃcient. We propose a self-

38

L. Y. Chen et al.

supervised learning method that uses the multi-armed bandit (MAB) framework with Thompson Sampling (TS) to eﬃciently ﬁnd a good candidate ﬂing motion and then reﬁnes the ﬂing parameters using the cross-entropy method (CEM). The robot learns through real-world experiments. While prior work [15,16] shows that simulation can facilitate learning for robot manipulation, state-ofthe-art deformable object simulators have a large reality gap [23], and dynamic garment ﬂings may require modeling additional eﬀects such as material properties and air resistance. Consequently, we directly optimize ﬂinging through real-world online learning and improve data eﬃciency by introducing a stopping rule for online training to threshold the expected performance improvement. Unlike quasistatic actions, each ﬂing results in signiﬁcant variance in coverage due to the inherent aleatoric uncertainty, i.e., non-repeatable variation in outcomes. Thus, we propose a second stopping rule for execution time that stops the robot ﬂinging when the coverage exceeds a learned threshold or maximum number of attempts. This stopping rule diﬀers from the training-time stopping rule, which increases data eﬃciency. This paper makes 4 contributions: 1. A formulation of single-arm garment ﬂinging and a robot system that uses self-supervised learning in real to eﬃciently smooth garments. 2. An online learning framework that operates under large aleatoric uncertainty and enables the robot to quickly adapt to novel garments by leveraging a learned garment category prior and by optimizing ﬂing parameters in a coarseto-ﬁne manner. 3. A training-time stopping rule based on epistemic uncertainty and executiontime stopping rules that exploit the aleatoric uncertainty of dynamic actions to increase expected execution-time ﬂing coverage. 4. Experiments with 36 real garments (30 for training, 6 for testing) suggesting that the robot can learn eﬀective single-arm ﬂing actions achieving average coverage of 60–94 % on novel garments in 30 min.

2

Related Work

Deformable object manipulation has a long history in robotics research [18,24, 25]. This paper focuses speciﬁcally on manipulation of garments. Much prior research on cloth manipulation uses geometric methods for perception and planning [26–30]. While these works show promising results, they make strong assumptions on garment conﬁguration. To simplify planning, many of these works “funnel” garments into known states, for instance iteratively re-grasping the lowest hanging corner or using gravity for vertical smoothing [2,31]. Learning Garment Manipulation. The complexities of crumpled garments make representing their states diﬃcult [32]. Researchers have thus bypassed precise state estimation by using data-driven methods such as imitation learning [33] and reinforcement learning. Prior learning-based approaches for garments mainly

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

39

use quasistatic actions (e.g., pick-and-place) for smoothing [7,9–13,34] and folding [8,35,36]. While attaining promising results, they may require many short actions and are limited by the robot’s reach. In contrast, dynamic actions can manipulate garments with fewer actions, including out-of-reach portions. Dynamic Manipulation. In dynamic manipulation, robots leverage highspeed actions [19] such as tossing [20], swinging [37], and blowing [17]. There has been considerable progress on deformable dynamic manipulation, for instance high-speed cable knotting [38,39] and learning-based approaches for high-speed manipulations of cables with a ﬁxed endpoint [14] and a free endpoint [15,40,41]. Work on dynamic garment manipulation includes Jangir et al. [42] and Hietala et al. [43], which both leverage simulators, with the latter transferring to real; in contrast, we do not use simulation due to the reality gap and challenges in Sim2Real transfer [23]. While there have been great strides in deformable simulation with quasistatic actions, simulating the dynamic movement of deformable objects including the eﬀect of aerodynamics remains challenging [44]. Recently, Ha and Song presented FlingBot, which learns to perform a ﬁxedparameter dynamic ﬂing motion to smooth garments using two UR5 arms. FlingBot is initially trained in SoftGym simulation [45] and ﬁne-tuned in real, and learns grasp points for both arms. The results suggest signiﬁcant eﬃciency beneﬁts in attaining high coverage compared to quasistatic pick-and-place actions. In contrast, we consider using a single-arm manipulator and directly learn actions in real. While it is much harder to smooth a garment with only one arm—since two can eﬀectively stretch one side of the garment so that the ﬂing must only straighten the other side—single-arm ﬂing motions can reduce the hardware cost or increase eﬃciency via multiple arms running in parallel.

3

Problem Statement

Workspace Assumptions. We deﬁne a Cartesian coordinate frame containing a single-arm UR5 robot and a manipulation surface parallel to the xy-plane (Fig. 3), faced by an overhead RGB camera. We estimate garment coverage via the overhead RGB images with a manually-measured HSV threshold for each garment, and using the OpenCV package. As in some prior work on dynamic robot manipulation [14,15,43], we assume that the robot gripper holds the garment at a ﬁxed point. During experiments, the initial grasp is set by a human. We select the ﬁxed grasping point by studying the performance of various grasp points with a T-shirt (see Sect. 6.3). The back collar (Fig. 5) yields the most promising results. Thus, in all subsequent experiments, the robot grasps the back collar of shirts and dresses. For towels, the robot grasps the center top point, while for jeans and pants, the robot grasps the center of the waistband’s backside, as these locations are most analogous to the back collar of a T-shirt.

40

L. Y. Chen et al.

Task Objective. The objective is to eﬃciently learn single-arm robot ﬂing motions that maximally increase the coverage of an arbitrarily-crumpled garment. We assume access to training and testing sets of garments, denoted Gtrain and Gtest . The ﬂing motions are learned on the training set. Once training concludes, the robot executes the learned ﬂing motions at execution time. During execution, repeated ﬂinging is allowed up to a repetition limit. We evaluate performance via the average garment coverage achieved by the learned ﬂing action, similar to prior work on dynamic garment manipulation [16] and garment manipulation more generally [12,13]. We also aim to reduce the number of ﬂinging trials needed during training, where each trial corresponds to the execution of a ﬂing cycle as described in Sect. 4. For each test garment, we record the number of ﬂings the algorithm takes to learn the best ﬂing action.

4

Motion Primitives and Pipeline

For each selected ﬂing action, we execute one or more ﬂing cycles to smooth the garment. A ﬂing cycle consists of a reset motion, two shaking motions, and a dynamic arm motion. The gripper is closed during this process. Reset Procedure. To learn eﬀective ﬂing motions for diﬀerent initial garment states, the self-supervised learning procedure performs a reset motion at the start of each ﬂing cycle to randomize the garment’s initial state. The reset has two parts: 1) the robot crumples the garment by lifting to a ﬁxed joint conﬁguration qup and then placing it down at qdown ; and 2) the robot returns to qup to help smooth the garment using gravity [2]. While the garment is not in an identical state after each reset (see website), and the folds can assume diﬀerent shapes and directions, the reset places the garment in a stable position with its center of mass below the gripper, increasing repeatability of subsequent dynamic actions. Shaking Motions. Diﬀerent garment layers can often stick together due to friction, making it more diﬃcult for a ﬂing action to expand the garment. In addition, garments often exhibit twists while hanging, resulting in correlated initial states between trials. Thus, immediately following the reset, we use two shaking motions—horizontal and vertical, shown in Fig. 1—to loosen parts of garments that may be stuck together and to reduce inter-trial correlation. Both shaking motions have a predeﬁned amplitude, and each is repeated 3 times. The period T is chosen to be as brief as possible subject to robot motion control constraints. Vertical shaking plays the more important role of loosening the garments, while horizontal shaking mainly helps to reduce any twists. Fling Parameterization. Inspired by observing humans performing single-arm ﬂings, we parameterize the ﬂing motion using a trajectory as shown in Fig. 3. We control the robot’s movement through both the spatial position and angle of joint 3. Joint 3 starts at point 1 (P1), pulls back to point 2 (P2) to create more

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

41

space for the actual ﬂing motion, accelerates from P2 to point 3 (P3) to perform the ﬂing, and ﬁnally lays down the garment by moving to point 4 (P4). All four points lie on the yz-plane (i.e., x = 0). We accelerate learning by reducing the parameter space dimension. During preliminary experiments, varying P2 and P4 did not signiﬁcantly aﬀect results, and hence we ﬁx those points. The ﬂing motion is therefore parameterized by 7 learnable fling parameters, 23 34 23 34 , vmax , P3,y , P3,z , θ, vθ , aθ ) ∈ R7 , where vmax and vmax are maximum joint (vmax velocities for all joints (Fig. 3) while traveling from P2 to P3 and from P3 to P4, respectively; θ and (vθ , aθ ) are the angle, velocity and acceleration of joint 3 at P3; and (0, P3,y , P3,z ) gives the (x, y, z)-coordinates of P3. In FlingBot [16], contrastingly, the authors use a universal scripted ﬂing motion for all garments and instead focus on learning the grasp points for their two arms. We use Ruckig [46], a time-optimal trajectory generation algorithm, to compute a trajectory passing through all waypoints with the desired velocities and accelerations at Points 1–4, while limiting velocity, acceleration, and jerk.

5

Self-Supervised Learning Pipeline for Flinging

To accomplish the task objective, we propose a self-supervised learning procedure to learn the ﬂing parameters for a given garment, optionally using prior experience with similar garments to accelerate learning. 5.1

Learning Procedure of the Fling Action

Due to the high aleatoric uncertainty and continuous parameter space, we propose to explore the ﬂing parameter space in a coarse-to-ﬁne manner. First, we use a multi-armed bandit (MAB) [47] algorithm on discretized bins to quickly identify the best region of the parameter space. Then, we ﬁne-tune the parameters with the cross-entropy method (CEM) [48]. The MAB framework is well-suited for eﬃciently identifying well-performing actions under the high aleatoric uncertainty in ﬂing outcomes, while CEM does not require assuming a speciﬁc form for the underlying distribution. For MAB, we form a set of actions A = {a1 , . . . , aK } by dividing the ﬂinging parameter space into a uniform grid consisting of K grid cells, where each action ak ∈ R7 is a grid cell center and parameterizes a ﬂinging motion. Each parameter’s range is deﬁned based on the robot’s physical limits and human observation (see Sect. 4). During each learning iteration t, the MAB algorithm samples an action a(t) ∈ A to execute in the next ﬂing cycle and observes a stochastic reward rt ∈ [0, 1] measuring post-ﬂing garment coverage, computed using the overhead camera and expressed as a percentage of the garment’s full surface area in its fully-smoothed state. After T steps, the observed actions and rewards form a dataset, DT := {(a(1) , r(1) ), . . . , (a(T ) , r(T ) )}. Learning continues for a predeﬁned iteration limit or until satisfying a stopping rule based on the estimated expected improvement. After the MAB learning procedure, we reﬁne the MAB-chosen action using CEM, a continuous optimization algorithm,

42

L. Y. Chen et al.

where we constrain CEM within the ﬂing parameter grid cell corresponding to the MAB-chosen action. At each iteration, CEM samples a batch of actions from a normal distribution, which are then executed by the robot. The top few actions with the highest coverage (termed the “elites”) are used to update the mean and variance of the CEM sampling distribution for the next CEM iteration. Multi-armed Bandit Algorithm. We use Thompson Sampling (TS) [49] as the MAB algorithm. TS maintains a Bayesian posterior over the reward of each action a ∈ A: p(r | a, Dt ). We use a Gaussian distribution to model both the prior and posterior reward of each action, and initialize the prior to N (0.5, 1) for all actions. On each iteration t, TS draws a reward sample r˜i ∼ p(r | ai , Dt ) for each action ai ∈ A, i ∈ {1, . . . , K}. Then, TS selects the action corresponding ri | i = 1, . . . , K}. to the highest sampled reward: a(t) = aj , j = argmaxi {˜ Stopping Rule During Learning. Physical robot learning can be expensive due to the high execution time and potential risk of collision, making data eﬃciency important. During MAB learning, we balance eﬃciency with performance by stopping online learning when the estimated expected improvement over the highest posterior mean reward of any action falls below a given threshold [50]. Otherwise, the algorithm continues learning until an iteration limit. We calculate the expected improvement analytically [50] as follows, where φ(·) and Φ(·) are the PDF and CDF of the standard normal distribution, respectively, and μ(ai ) and σ(ai ) are the posterior mean and standard deviation of action ai : μ(ai ) − maxi {μ(ai )} . EI(ai ) = μ(ai ) − max{μ(ai )} Φ(Z) + σ(ai )φ(Z), Z = i σ(ai ) 5.2

Adapting to New Garments

To eﬃciently adapt to novel garments, we propose a learning algorithm that leverages prior experience from previously-seen garments. The algorithm consists of a training stage followed by an online learning stage. During training, we collect data from a training dataset Gtrain of m garments from all categories by executing the MAB algorithm and recording the ﬁnal converged distributions of each action’s coverage after MAB learning. During the online learning stage, we use the collected distribution data from garments in Gtrain to set an informed prior for the MAB, rather than using the uninformed initial prior of N (0.5, 1) for all actions. We then apply the MAB with the informed prior for a new test garment from test dataset Gtest . We compare two methods for setting the prior mean and standard deviation based on prior experience: All Garment and Category. All Garment uses the empirical mean and standard deviation of coverage across all training garments for those corresponding actions. Category uses garments of only the same category as the test garment.

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

5.3

43

Execution-Time Stopping Rules

Given a novel garment, we execute the best action abest := argmaxa∈A μ(a) identiﬁed by the online learning procedure in Sect. 5.2; this is the action with the highest posterior mean coverage. However, due to aleatoric uncertainty arising from the garment state and air friction, there is signiﬁcant variance in the outcome. To mitigate this uncertainty, the robot repeatedly ﬂings the garment until the current coverage is above a learned threshold. We propose 3 methods for deciding when to stop. All methods use the model posterior over abest ’s cover age, learned during training: p(r | abest , Dtrain ) = N μ(abest ), σ 2 (abest ) , where Dtrain is the dataset collected during training. In this section, stopping rule refers exclusively to execution-time. The ﬁrst two methods set a hard stopping threshold: if the coverage is above a threshold, the robot stops. As diﬀerent garments may have diﬀerent performance levels, for Stopping Rule 1, we set the threshold based on a z-score determined from abest ’s posterior: μ(abest ) + zσ(abest ) for some z > 0. For Stopping Rule 2, we set the threshold based on the one-step expected improvement calculated from the posterior p(r | abest , Dtrain ) and stop if it falls below a given threshold. Stopping Rule 3 accounts for a maximum ﬂinging budget B. We decide to stop by estimating the expected improvement in coverage achievable during the remaining budget. We estimate this quantity using posterior sampling. Let rjexec be the coverage at step j of execution. We draw L sets of samples exec exec , . . . , r˜B,i ∼ p(r | abest , Dtrain ), i ∈ {1, . . . , L}, and approximate the r˜j+1,i L exec exec rj+1,i , . . . , r˜B,i } − rjexec . expected improvement as: EI(abest ) ≈ L1 i=1 max{˜ One could straightforwardly penalize EI(abest ) based on the cost of performing each ﬂing action. Without such a penalty, however, this more-optimistic stopping rule will tend to ﬂing more times than the one-step lookahead rules.

6

Experiments

We use a UR5 robot arm to perform experiments across a total of 36 garments from 6 categories: T-shirts, long-sleeve shirts, towels, dresses, sweat pants, and jeans, with 30 garments in Gtrain and 6 garments in Gtest . See the project website for examples of garments in Gtest at various smoothness levels. We calculate coverage via color thresholding on RGB images from a Logitech Brio 4K webcam. At the start of each experiment batch with a given garment, the human supervisor manually ﬁxes the grasp location. The gripper never opens during the experi23 34 ∈ [2 m/s, 3 m/s], vmax ∈ ments. We use the following parameter ranges: vmax ◦ [1 m/s, 3 m/s], P3,y ∈ [0.55 m, 0.7 m], P3,z ∈ [0.4 m, 0.55 m], θ ∈ [−40 , 20◦ ], vθ ∈ [−1 m/s, 1 m/s], aθ ∈ [−20 m/s2 , 20 m/s2 ]. Each ﬂing cycle takes ∼45 s. To evaluate problem diﬃculty and our algorithm’s eﬀectiveness, we compare with ﬁve human subjects. Results suggest that the robot achieves similar performance to humans. Details of this experiment are shown in the project website.

44

L. Y. Chen et al.

Fig. 2. Physical experiments showing the coverage of a T-shirt (fully-smoothed = 100%) vs. number of ﬂing trials for MAB with CEM reﬁnement (MAB+CEM) compared with baselines. Left: Coverage achieved over time for each optimization algorithm, smoothed by moving window averaging (window size = 10). For CEM and BO with 3 repetitions per sampled action, the shaded region shows the standard deviation among the 3 trials for each action. MAB+CEM converges more quickly than either continuous optimization algorithm alone. Right: The post-convergence coverage distribution of MAB+CEM signiﬁcantly outperforms random action selection.

6.1

Comparison of Learning Methods

We ﬁrst study diﬀerent learning methods. We use a T-shirt and compare the learning performance of MAB+CEM with 3 baselines on that T-shirt: CEM, Bayesian optimization (BO) [50], and random actions. CEM and BO both perform continuous optimization without MAB, and BO uses a Gaussian Process model and the expected improvement acquisition function. The random baseline samples a random action from the continuous parameter range in each trial. For MAB, we deﬁne the action space A via a uniform grid discretization of the trajectory parameters. To keep |A| small, we ﬁx θ, vθ , and aθ at the centers 23 34 , vmax , of their ranges and vary the other 4 parameters, since we notice that vmax P3,y , and P3,z most strongly aﬀect ﬂing performance. We divide this 4-parameter space into a 24 -cell grid and use the resulting grid cell centers as bandit actions. For BO and CEM, we optimize over the entire continuous parameter range. Since BO and CEM are noise sensitive, we repeat each action 3 times, averaging the coverage values. A CEM iteration samples 5 actions and forms the elite from the top 3. We run BO for 70 iterations (70 × 3 = 210 trials) and run CEM for 14 iterations (14 × 5 × 3 = 210 trials). For MAB+CEM, we run MAB for 50 iterations and then CEM for 10 iterations (50 + 10 × 5 × 3 = 200 trials). We also compare BO and CEM without repetition, for 210 and 42 iterations respectively. Results in Fig. 2 suggest that MAB+CEM outperforms CEM and BO with repetition. Due to the large aleatoric uncertainty in each ﬂing trial, as can be seen from the shaded region and the box plot, BO and CEM without repetition perform the worst and are either unstable or fail to converge to a good action.

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

45

Flinging Trial Coverage (7-dim vs 9-dim Action Space) 0.9

CEM (7-dim)

BO (7-dim)

MAB + CEM (7-dim)

CEM (9-dim)

BO (9-dim)

MAB + CEM (9-dim)

Coverage

0.8 0.7 0.6 0.5 0.4 0.3

0

25

50

75

100

125

150

175

200

Flinging Trial

Fig. 3. 7-Dim trajectory parameterization. The robot end eﬀector begins at point P1, moves to P2, moves (ﬂings) to P3 at a high speed, and travels downwards to P4, bringing the garment to rest. θ is the angle of joint 3 at P3; (vθ , aθ ) are the velocity and acceleration of joint 3 at P3.

Fig. 4. We conduct physical experiments to compare optimizing in 7 and 9-dimensional parameter spaces, where in the latter, the acceleration parameters are also learnable. Each action is repeated 3 times for CEM and BO. We ﬁnd that 7 parameters are suﬃcient. Table 1. Average coverage at each grasp point by the median and best actions among the 16 bandit actions used for MAB learning.

Fig. 5. 8 T-shirt grasp points.

6.2

Grasp point

Median coverage

Max coverage

Bottom center (1-layer) Side seam Bottom corner Sleeve joint Shoulder Bottom center (2-layer) Sleeve hem Back collar

44% 47% 49% 50% 52% 55% 55% 61%

54% 57% 66% 65% 66% 74% 70% 74%

Comparison of Diﬀerent Trajectory Parameterization Methods

To study the trajectory parameterization, we also compare the learning performance of the 7D parameterization with a 9D parameterization on the same T-shirt used in Sect. 6.1. The 9D parameter space consists of the original 7 34 parameters and two acceleration parameters: a23 max and amax , corresponding to the maximum acceleration for all joints while traveling from P2 to P3 and from P3 to P4, respectively. Results in Fig. 4 indicate that the 7D and 9D parameterizations exhibit similar performance for MAB+CEM, while the 9D parameterization makes BO less stable. MAB+CEM outperforms either CEM or BO alone in both parameterizations, suggesting generalizability of our proposed coarse-toﬁne method.

46

L. Y. Chen et al.

6.3

Results for Diﬀerent Grasp Points

The choice of grasping point can aﬀect single-arm ﬂing performance. We study the eﬀectiveness of single-arm ﬂing actions on a T-shirt with various grasp points: the back collar, sleeve hem, sleeve joint, shoulder, side seam, bottom center (1 layer and 2 layers), and bottom corner (Fig. 5). We apply the MAB algorithm for each grasp point. Table 1 reports the average coverage of the median and best actions, and indicates that the back collar is the best grasp point, followed by the sleeve hem and bottom center (2-layer). Thus, we focus on grasping the back collar, while selecting analogous grasp points for other garment types (Sect. 3). 6.4

Training and Testing Across Diﬀerent Garments

We use MAB+CEM to learn ﬂing actions for diﬀerent garments. For MAB, we set the threshold for the expected improvement (EI) to be 1.5 % because empirically, we ﬁnd that the EI starts to decrease signiﬁcantly more slowly past 1 %. We then reﬁne via 2 iterations of CEM (2 × 5 × 3 = 30 trials). For each garment category, the training set Gtrain includes 5 garments. For each garment in Gtrain , we run MAB with 16 actions for Ntrain = 50 trials. We do not perform the CEM step, as the goal is to form a prior on those 16 actions which we use to accelerate the learning on test garments, instead of ﬁnding the best action on each training garment. MAB Action Coverages among Training Garments for Each Category

1.0

Action 0 Action 1 Action 3

0.9

Action 7 Action 5 Action 12

Coverage

0.8 0.7 0.6 0.5 0.4

Jeans

Towels

Athletic Pants

Dresses

T-shirts

Long-sleeve shirts

Fig. 6. From physical experiments, average coverage achieved for 6 of the 16 bandit actions among all garments in the training set Gtrain , separated by garment category (mean ± 1 std). Each category contains 5 garments. The standard deviations indicate performance variability among diﬀerent garments within a category. While some actions tend to outperform others, the best actions diﬀer for diﬀerent garment categories.

Figure 6 shows the average coverage for 6 of the 16 bandit actions, separated by category. We see that within each garment type, diﬀerent actions lead to signiﬁcant diﬀerences in coverage. Moreover, the relative performance of various

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

47

Table 2. Physical experiments with 36 garments demonstrating the eﬃciency of learning ﬂing actions on test garments using category priors. Under each MAB prior, we report the number (#) of bandit trials that MAB takes to reach a stopping threshold (EI < 1.5%), and columns 4, 5, and 9, 10 are the average coverage (mean ± std) achieved by the best MAB-identiﬁed action (prior to CEM) and by MAB+CEM, where we execute each action 20 times. We report identical coverage numbers for MAB+CEM instances in which MAB identiﬁes the same best action, since the CEM procedure is the same. We compare the Uninformed prior (Uninf.) to the All-Garment (All) and Category (Cat.) priors, for which the decrease in MAB trials indicates high garment category informativeness. The p-values are from t-tests comparing whether the mean coverage of the action identiﬁed by MAB+CEM signiﬁcantly improves on the MABidentiﬁed action; p-values compare the diﬀerence in sample means with standard errors, which account for the sample size (20). Thus, while there is large aleatoric uncertainty, as seen from the large standard deviations among the 20 repetitions, the single-digit percentage increase in mean can be statistically signiﬁcant. Prior Type

#

Uninf Jeans

24 94±4% 94±1% 0.509 Towel

MAB +CEM p-val Type

#

32 93±6% 93±7% 0.594

MAB +CEM p-val

All

9 92±1% 93±2% 0.003

16 93±6% 93±7% 0.594

Cat

3 94±4% 94±1% 0.509

12 90±6% 88±4% 0.759

Uninf Dress

40 78±2% 78±2% 0.743 Sweat Pants

30 75±5% 78±4% 0.011

All

30 79±3% 82±3% 0.003

20 77±3% 80±3% 0.016

Cat

6 80±1% 81±2% 0.035

13 77±3% 80±3% 0.016

Uninf T-shirt 32 72±6% 73±3% 0.257 Long-sleeve Shirt 48 57±5% 60±5% 0.060 All

19 73±6% 75±4% 0.115

37 54±6% 61±5% 4e-4

Cat

7 73±6% 75±4% 0.115

14 57±4% 60±4% 0.018

bandit actions is diﬀerent in each category, suggesting that diﬀerent actions are better suited for diﬀerent garment categories. Using the data from Gtrain to inform the MAB prior, we evaluate performance on the garments in Gtest , which contains one new garment from each category. We compare MAB performance with 3 priors: a) the original prior (i.e., “Uninformed”), b) a prior calculated using all the garments in Gtrain (“AllGarment”), and c) a prior calculated only using garments in Gtrain of the same category as the test garment (“Category”). The informed priors (b and c) set the prior mean and standard deviation for each bandit action using the empirical means and standard deviations from training. Table 2 reports results over 20 repetitions. While the diﬀerent priors do not signiﬁcantly aﬀect the ﬁnal coverage achieved by the best-identiﬁed action, the results suggest that the “Category” prior can signiﬁcantly accelerate learning. In particular, with a more informative prior based on Gtrain , a new garment’s EI falls below the termination threshold in 57–87 % fewer MAB iterations than when starting from an uninformed prior. From Table 2, we also see that CEM reﬁnement can usually increase performance by several percentage points over the best MAB-identiﬁed action, with the exception of the jeans and towel, where MAB already identiﬁes high-coverage actions. The p-values test whether the improvements in mean coverage are sta-

L. Y. Chen et al. Stopping Rule 1

Stopping Rule 2 10

8 6 Jeans Towel Dress Sweat Pants T-shirt Long-Sleeve Shirt

4 2

0.0

0.5

1.0

1.5

2.0

z-value

2.5

3.0

Number of Trials Required

Number of Trials Required

10

Stopping Rule 3 Towel Towel Dress Sweat Pants T-shirt Long-Sleeve Shirt

8 6 4 2

0.00

0.02

0.04

0.06

EI Threshold

0.08

0.10

10

Number of Trials Required

48

Jeans Towel Dress Sweat Pants T-shirt Long-Sleeve Shirt

8 6 4 2

0.00

0.02

0.04

0.06

0.08

0.10

EI Threshold

Fig. 7. Number of trials under each execution-time stopping rule as a function of the termination threshold. Left: stopping when the coverage exceeds one standard deviation above the mean. Middle: stopping when the one-step expected improvement falls below a threshold. Right: stopping when the expected improvement over the remaining trial budget falls below a threshold, using a total budget of 10 trials. With repeated ﬂing trials and optimal stopping, the robot can achieve better-than-average results during execution, especially for garments with large aleatoric uncertainty.

tistically signiﬁcant, given the large aleatoric uncertainty of each individual ﬂing trial (whose coverages have standard deviation up to 7 %). We conclude that within 10 MAB iterations (10 trials) and 2 CEM iterations (30 trials), the robot can eﬃciently learn a ﬂing action that achieves 60–94 % coverage for a new garment in under 30 min. For comparison, using two arms, Flingbot [16] achieves 79–93 % coverage on towels and 93–94 % coverage on T-shirts. 6.5

Stopping Time During Execution

To compare the 3 stopping rules proposed in Sect. 5.3 and study the trade-oﬀ between the number of ﬂinging trials and performance, we apply each stopping rule with a range of thresholds by post-processing the execution-time results of the best action found for each garment. For each method, we bootstrap the experiment results and empirically estimate the stopping-time distribution under various stopping thresholds. From the results in Fig. 7, we can see that it takes on average 4 ﬂings to achieve a coverage that is 1 standard deviation above the mean or that results in a one-step EI below 1 % absolute coverage. Stopping based on EI with a budget leads to more optimistic EI and signiﬁcantly more trials. In practice, a heuristically-estimated cost for each additional ﬂing can counter the tendency of this method to ﬂing too many times.

7

Conclusion

We consider single-arm dynamic ﬂinging to smooth garments. We learn ﬂing motions directly in the real world to address complex, diﬃcult-to-model garments such as long-sleeve shirts and dresses. We propose stopping rules during training and execution time. Results suggest that we can learn eﬀective ﬂing actions that achieve up to 94 % average coverage in under 30 min. In future work, we will enhance the current system with quasistatic pick-and-place actions to ﬁne-tune smoothing [9,12] and explore evaluation metrics that judge the quality of dynamic motions by the ease of subsequent quasistatic ﬁne-tuning.

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

49

Acknowledgements. This research was performed at the AUTOLAB at UC Berkeley in aﬃliation with the Berkeley AI Research Lab, the CITRIS “People and Robots” (CPAR) Initiative, and the Real-Time Intelligent Secure Execution (RISE) Lab. The authors were supported in part by donations from Toyota Research Institute, Google, Siemens, Autodesk, and by equipment grants from PhotoNeo, NVidia, and Intuitive Surgical.

References 1. Sun, L., Aragon-Camarasa, G., Rogers, S., Siebert, J.P.: Accurate garment surface analysis using an active stereo robot head with application to dual-arm ﬂattening. In: IEEE International Conference on Robotics and Automation (2015) 2. Maitin-Shepard, J., Cusumano-Towner, M., Lei, J., Abbeel, P.: Cloth grasp point detection based on multiple-view geometric cues with application to robotic towel folding. In: IEEE International Conference on Robotics and Automation (2010) 3. Miller, S., Van Den Berg, J., Fritz, M., Darrell, T., Goldberg, K., Abbeel, P.: A geometric approach to robotic laundry folding. Int. J. Rob. Res. 31, 249–267 (2012) 4. Balaguer, B., Carpin, S.: Combining Imitation and Reinforcement Learning to Fold Deformable Planar Objects. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2011) 5. Erickson, Z., Gangaram, V., Kapusta, A., Liu, C.K., Kemp, C.C.: Assistive gym: a physics simulation framework for assistive robotics. In: IEEE International Conference on Robotics and Automation (2020) 6. Erickson, Z., Clever, H.M., Turk, G., Liu, C.K., Kemp, C.C.: Deep haptic model predictive control for robot-assisted dressing. In: IEEE International Conference on Robotics and Automation (2018) 7. Lin, X., Wang, Y., Huang, Z., Held, D.: Learning visible connectivity dynamics for cloth smoothing. In: Conference on Robot Learning (2021) 8. Weng, T., Bajracharya, S.M., Wang, Y., Agrawal, K., Held, D.: FabricFlowNet: bimanual cloth manipulation with a ﬂow-based policy. In: Conference on Robot Learning (2021) 9. Yilin, W., Yan, W., Kurutach, T., Pinto, L., Abbeel, P.: Learning to manipulate deformable objects without demonstrations. In: Robotics Science and Systems (2020) 10. Yan, W., Vangipuram, A., Abbeel, P., Pinto, L.: Learning predictive representations for deformable objects using contrastive estimation. In: Conference on Robot Learning (2020) 11. Lippi, M., et al.: Latent space roadmap for visual action planning of deformable and rigid object manipulation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2020) 12. Seita, D., et al.: Deep imitation learning of sequential fabric smoothing from an algorithmic supervisor. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2020) 13. Seita, D., et al.: Learning to rearrange deformable cables, fabrics, and bags with goal-conditioned transporter networks. In: IEEE International Conference on Robotics and Automation (2021) 14. Zhang, H., Ichnowski, J., Seita, D., Wang, J., Huang, H., Goldberg, K.: Robots of the lost arc: self-supervised learning to dynamically manipulate ﬁxed-endpoint ropes and cables. In: IEEE International Conference on Robotics and Automation (2021)

50

L. Y. Chen et al.

15. Lim, V., et al.: Real2Sim2Real: self-supervised learning of physical single-step dynamic actions for planar robot casting. In: IEEE International Conference on Robotics and Automation (2022). https://doi.org/10.1109/ICRA46639.2022. 9811651 16. Ha, H., Song, S.: FlingBot: the unreasonable eﬀectiveness of dynamic manipulation for cloth unfolding. In: Conference on Robot Learning (2021) 17. Zhenjia, X., Chi, C., Burchﬁel, B., Cousineau, E., Feng, S., Song, S.: Dextairity: deformable manipulation can be a breeze. In: Robotics Science and Systems (2022) 18. Sanchez, J., Corrales, J.A., Bouzgarrou, B.C., Mezouar, Y.: Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey. Int. J. Rob. Res. 37, 688–716 (2018) 19. Lynch, K.M., Mason, M.T.: Dynamic nonprehensile manipulation: controllability, planning, and experiments. Int. J. Rob. Res. 18(1), 64–92 (1999) 20. Zeng, A., Song, S., Lee, J., Rodriguez, A., Funkhouser, T.: TossingBot: learning to throw arbitrary objects with residual physics. In: Robotics Science and Systems (2019) 21. Hashimoto, K., Saito, F., Yamamoto, T., Ikeda, K.: A ﬁeld study of the human support robot in the home environment. In: IEEE Workshop on Advanced Robotics and its Social Impacts (2013) 22. Wise, M., Ferguson, M., King, D., Diehr, E., Dymesich, D.: Fetch & freight: standard platforms for service robot applications. In: IJCAI Workshop on Autonomous Mobile Service Robots (2016) 23. Jakobi, N., Husbands, P., Harvey, I.: Noise and the reality gap: the use of simulation in evolutionary robotics. In: Mor´ an, F., Moreno, A., Merelo, J.J., Chac´ on, P. (eds.) ECAL 1995. LNCS, vol. 929, pp. 704–720. Springer, Heidelberg (1995). https:// doi.org/10.1007/3-540-59496-5 337 24. Borras, J., Alenya, G., Torras, C.: A grasping-centered analysis for cloth manipulation. arXiv preprint arXiv:1906.08202 (2019) 25. Zhu, J., et al.: Challenges and outlook in robotic manipulation of deformable objects. arXiv preprint arXiv:2105.01767 (2021) 26. Bai, Y., Wenhao, Y., Liu, C.K.: Dexterous manipulation of cloth. In: European Association for Computer Graphics (2016) 27. Li, Y., Xu, D., Yue, Y., Wang, Y., Chang, S. F., Grinspun, E.: Regrasping and unfolding of garments using predictive thin shell modeling. In: IEEE International Conference on Robotics and Automation (2015) 28. Doumanoglou, A., Kargakos, A., Kim, T.K., Malassiotis, S.: Autonomous active recognition and unfolding of clothes using random decision forests and probabilistic planning. In: IEEE International Conference on Robotics and Automation (2014) 29. Kita, Y., Ueshiba, T., Neo, E.S., Kita, N.: Clothes state recognition using 3D observed data. In: IEEE International Conference on Robotics and Automation (2009) 30. Kita, Y., Ueshiba, T., Neo, E.S., Kita, N.: A method for handling a speciﬁc part of clothing by dual arms. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2009) 31. Cusumano-Towner, M., Singh, A., Miller, S., O’Brien, J.F., Abbeel, P.: Bringing clothing into desired conﬁgurations with limited perception. In: IEEE International Conference on Robotics and Automation (2011) 32. Chi, C., Song, S.: GarmentNets: category-level pose estimation for garments via canonical space shape completion. In: The IEEE International Conference on Computer Vision (2021)

Eﬃciently Learning Single-Arm Fling Motions to Smooth Garments

51

33. Argall, B.D., Chernova, S., Veloso, M., Browning, B.: A survey of robot learning from demonstration. Rob. Auton. Syst. 57, 469–483 (2009) 34. Seita, D., et al.: Deep transfer learning of pick points on fabric for robot bedmaking. In: Asfour, T., Yoshida, E., Park, J., Christensen, H., Khatib, O. (eds.) Robotics Research. ISRR 2019. Springer Proceedings in Advanced Robotics, vol. 20, pp. 275–290. Springer, Cham (2019). https://doi.org/10.1007/978-3-03095459-8 17 35. Hoque, R., et al.: VisuoSpatial foresight for multi-step, multi-task fabric manipulation. In: Robotics Science and Systems (2020) 36. Lee, R., Ward, D., Dasagi, V., Cosgun, A., Leitner, J., Corke, P.: Learning arbitrary-goal fabric folding with one hour of real robot experience. In: Conference on Robot Learning (2020) 37. Wang, C., Wang, S., Romero, B., Veiga, F., Adelson, E.: SwingBot: learning physical features from in-hand tactile exploration for dynamic swing-up manipulation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2020) 38. Yamakawa, Y., Namiki, A., Ishikawa, M.: Motion planning for dynamic knotting of a ﬂexible rope with a high-speed robot arm. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2010) 39. Yamakawa, Y., Namiki, A., Ishikawa, M.: Dynamic high-speed knotting of a rope by a manipulator. Int. J. Adv. Rob. Syst. 10, 361 (2013) 40. Zimmermann, S., Poranne, R., Coros, S.: Dynamic manipulation of deformable objects with implicit integration. IEEE Robo. Autom. Lett. 17, 4209–4216 (2021) 41. Chi, C., Burchﬁel, B., Cousineau, E., Feng, S., Song, S.: Iterative residual policy for goal-conditioned dynamic manipulation of deformable objects. In: Robotics Science and Systems (2022) 42. Jangir, R., Alenya, G., Torras, C.: Dynamic cloth manipulation with deep reinforcement learning. In: IEEE International Conference on Robotics and Automation (2020) 43. Hietala, J., Blanco-Mulero, D., Alcan, G., Kyrki, V.: Closing the sim2real gap in dynamic cloth manipulation. arXiv preprint arXiv:2109.04771 (2021) 44. Narang, Y., Li, Y., Macklin, M., McConachie, D., Wu, J.: RSS 2021 workshop on deformable object simulation in robotics (2021). http://sites.google.com/nvidia. com/do-sim. Accessed 12 Sep 2021 45. Lin, X., Wang, Y., Olkin, J., Held, D.: SoftGym: benchmarking deep reinforcement learning for deformable object manipulation. In: Conference on Robot Learning (2020) 46. Berscheid, L., Kr¨ oger, T.: Jerk-limited real-time trajectory generation with arbitrary target states. In: Robotics Science and Systems (2021) 47. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time analysis of the multiarmed bandit problem. Mach. Learn. 47(2), 235–256 (2002). https://doi.org/10.1023/A: 1013689704352 48. Rubinstein, R., Kroese, D.: The Cross-Entropy Method: A Uniﬁed Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning. Springer-Verlag, Cham (2004) 49. Thompson, W.R.: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 25(3/4), 285–294 (1933) 50. Snoek, J., Larochelle, H., Adams, R.P.: Practical Bayesian optimization of machine learning algorithms. In: Neural Information Processing Systems (2012)

Learning Long-Horizon Robot Exploration Strategies for Multi-object Search in Continuous Action Spaces Fabian Schmalstieg(B) , Daniel Honerkamp, Tim Welschehold, and Abhinav Valada Department of Computer Science, University of Freiburg, Freiburg im Breisgau, Germany [email protected]

Abstract. Recent advances in vision-based navigation and exploration have shown impressive capabilities in photorealistic indoor environments. However, these methods still struggle with long-horizon tasks and require large amounts of data to generalize to unseen environments. In this work, we present a novel reinforcement learning approach for multi-object search that combines short-term and long-term reasoning in a single model while avoiding the complexities arising from hierarchical structures. In contrast to existing multi-object search methods that act in granular discrete action spaces, our approach achieves exceptional performance in continuous action spaces. We perform extensive experiments and show that it generalizes to unseen apartment environments with limited data. Furthermore, we demonstrate zero-shot transfer of the learned policies to an office environment in real world experiments.

1 Introduction Exploration and navigation of unmapped 3D environments is an important task for a wide range of applications across both service and industrial robotics. Research in Embodied AI has made substantial progress in integrating high-dimensional observations in a range of navigation and exploration tasks [4, 8, 24]. Recent work has introduced several tasks around multi-object search and exploration [11, 27]. These tasks are particularly challenging in unmapped environments, as they require balancing longterm reasoning about where to go with short-term control and collision avoidance. Without prior knowledge of a floor plan, there is often no obvious optimal policy. Furthermore, the combination of complex observation space and long horizons remains an open problem with success rates quickly decreasing as the distance to the goal or the number of objects in the task increases. F. Schmalstieg and D. Honerkamp—These authors contributed equally. This work was funded by the European Union’s Horizon 2020 research and innovation program under grant agreement No 871449-OpenDR.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 5. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 52–66, 2023. https://doi.org/10.1007/978-3-031-25555-7_5

Learning Long-Horizon Robot Exploration

53

Fig. 1. Starting in an unexplored map and given a set of target objects, the robot faces the complex decision on how to most efficiently find these objects. Our approach continuously builds a semantic map of the environment and learns to combine long-term reasoning with short-term decision making into a single policy by predicting the direction of the path towards the closest target object, shown in green. Note that the agent does not receive the path consisting of the waypoints or the location of the objects.

Previous work has in particular focused on constructing rich map representations and memories for these agents [11, 26, 27] and demonstrated their benefits while acting in granular discrete action spaces. The long-horizon nature of these tasks poses a significant challenge for learning-based methods, as they struggle to learn longer-term reasoning and to explore efficiently over long horizons. This problem is strongly exacerbated in fine-grained continuous action spaces. A common strategy to mitigate this challenge is to instead learn high-level waypoints which are then fed to a low-level controller [5, 7]. While this can simplify the learning problem by acting in a lifted MDP with a much shorter horizon length, it limits the ability of the agent to simultaneously learn to control other aspects such as its camera or arms at a higher control frequency. During inference, high-level actions can be taken at arbitrary frequencies by executing them in a model predictive control style manner but the agent is bound to a low control frequency during training. In this work, we present a novel approach to reason about long-horizon exploration while learning to directly act in continuous action spaces at high control frequencies and demonstrate its effectiveness in multi-object search tasks. Our approach learns to predict the direction of the path towards the closest target object. It then learns a policy that observes this prediction, enabling it to express long-term intentions while taking short-term actions based on the full context. As a consequence, the policy can incorporate expected inaccuracies and uncertainties in the predictions and balance strictly

54

F. Schmalstieg et al.

following its intentions with short-term exploration and collision avoidance. Figure 1 illustrates the task and our approach. To jointly learn this behavior with a single model, we introduce a learning curriculum that balances groundtruth and learned intentions. In contrast to many learning-based navigation approaches building on complex features from color or depth images that make generalization to unseen environments hard and data intensive [15], we learn purely from structured, semantic map representations that have more universal features and are therefore more independent of the specific training environment and show only minimal sim-to-real gap. We perform extensive experimental evaluations and demonstrate that our proposed approach achieves strong generalization to unseen environments from a very small number of training scenes and directly transfers to the real world. Lastly, the combination of these simple inputs and expressive navigation intentions ensures good interpretability of the agent’s decisions and failures. To the best of our knowledge, this is the first real-world demonstration of learning-based multi-object search tasks. To summarise, this paper makes the following contributions: • We present a novel search agent that unifies long-horizon decision making and frequent low-level control into a single time-scale and model. • We propose the first multi-object search task in continuous action space. • We demonstrate that our approach is capable to learn from very limited training data and achieves strong generalization performance in unseen apartment environments and zero-shot transfer to the real world. • We make the code publicly available at http://multi-object-search.cs.uni-freiburg. de.

2 Related Work Embodied AI Tasks: Navigation in unmapped 3D environments has attracted a lot of attention in recent work on embodied AI, covering a range of tasks. In PointGoal navigation [4, 24], the agent at each step receives the displacement vector to the goal that it has to reach. Whereas, in AudioGoal navigation [6, 29], the agent at each step receives an audio signal emitted by a target object. Conversely, in ObjectGoal navigation [5, 10, 22, 31], the agent receives an object category that it has to navigate to. Extending the ObjectGoal task, Beeching et al. [2] propose an ordered k-item task in a Viz Doom environment, in which the agent has to find k items in a fixed order. Similarly in the MultiOn task [27] the agent has to find k objects in realistic 3D environments. Fang et al. [11] propose an object search task in which the agent has to find k items in arbitrary order. Here the target object locations are defined by the given dataset, i.e. are in a fixed location in each apartment. In contrast, we use a random distribution over the target locations, strongly increasing the diversity. While all of these works focus on discrete actions (move forward, turn left, turn right, stop), we show that our approach can directly learn in the much larger continuous action space and demonstrate the direct transfer to the real world.

Learning Long-Horizon Robot Exploration

55

Object Search and Exploration: Approaches for multi-object search and navigation tasks fall into two categories: implicit memory agents and agents that explicitly construct a map of the environment as a memory. Agents without an explicit map include direct visual end-to-end learning from RGB-D images as well as FRMQN [21] and SMT [11] which store an embedding of each observation in memory, then retrieve relevant information with an attention mechanism. On the other side of the spectrum, we have agents that project the RGB-D inputs into a global map, building up an explicit memory. They then commonly extend this representation with semantic object annotations [2, 13]. Wani et al. [27] provide a comprehensive comparison of these methods. While they train separate agents for each number of target objects, we train a single agent that generalizes to different numbers of target objects. SGoLAM [16] combine a mapping and a goal detection module. Then either employ frontier exploration if no goal object is in sight or if a goal is detected, they use a D*-planner to move closer to the goal. They achieve strong results without any learning component. Closely related to object search, previous work has also focused on pure exploration of realistic apartments. This includes frontier based approaches [28] as well as reinforcement learning with the aim to maximise coverage [8]. Learning Long Horizon Goals: Multi-object search and exploration with an embodied agent combine long-horizon thinking with short-horizon control to navigate and avoid collisions. This can pose a challenge for learning-based approaches. This has previously been mitigated by learning higher-level actions at a lower control frequency such as learning to set waypoints or to directly predict task goals [20], which then get passed down to a lower-level planner for navigation [5, 7]. While this shortens the horizon of the MDP the agent is acting in, it makes it difficult to learn to simultaneously control other aspects at a lower time scale, such as controlling a camera joint or a manipulator arm. Our approach directly learns at a high control frequency and as such can directly be extended to integrate such aspects. In our experiments, we furthermore demonstrate that direct prediction of the goal locations does not generalize well for multi-object search. The long horizon problem is further exacerbated in continuous control tasks. While most existing work focuses on granular discrete actions [5, 11, 16, 27], our approach succeeds in a continuous action space. Mapping: Spatial maps built with Simultaneous Localization and Mapping (SLAM) have been used for tasks such as exploration [3, 30] and FPS games [2]. Both occupancy and semantic maps have commonly been used in embodied AI tasks [16, 27] and several works have presented approaches to build such maps in complex environments [4, 5]. In our approach, we assume access to a method to build such maps.

3 Learning Long-Horizon Exploration In this section, we first define the multi-object search task and formulate it as a reinforcement-learning problem in Sect. 3.1. We then introduce our approach for learning a novel predictive task and an effective method to jointly learn low- and high-level reasoning in Sect. 3.2.

56

3.1

F. Schmalstieg et al.

Problem Statement

In each episode, the agent receives a list of up to k object categories, drawn randomly from a total set of c categories. It then has to search and navigate to these objects which are spawned randomly within an unmapped environment. An object is considered found if the agent has seen it and navigates up to a vicinity of 1.3 m of the target object. In contrast to MultiOn [27], we require a single agent to learn to find variable numbers of target objects and focus on unordered search, meaning the agent can find these objects in any order it likes. On one hand, this provides the agent with more freedom. On the other hand, the optimal shortest-path policy is non-trivial, even in a mapped environment, making this a very challenging task to solve optimally. This can be formulated as a goal-conditional Partially Observable Markov Decision Process (POMDP) M = (S, A, O, T (s |s, a), P (o|s), R(s, a, g)), where S, A and O are the state, action and observation spaces, T (s |s, a) and P (o|s) describe the transition and observation probabilities and R(s, a, g) is the reward function. At each step, the agent receives a visual observation o from an RGB-D and semantic camera together with a binary vector g indicating which objects it must find. Its aim is to learn a polT icy π(a|o, g) that maximises the discounted, expected return Eπ [ t=1 γ t R(st , at , g)], where γ is the discount factor. The agent acts in the continuous action space of a LoCoBot robot, controlling the linear and angular velocities of its differential drive. During training, these actions are executed at a control frequency of 10 Hz. The agent receives a sparse reward of 10 whenever it finds a target object. It furthermore receives a dense time penalty of −0.0025 per step, a distance reward for getting closer to the next target object, and a collision penalty of −0.1. An episode ends after successfully finding all objects, exceeding 600 collisions or exceeding 3500 steps. 3.2

Technical Approach

We propose a reinforcement learning approach that consists of three components: a mapping module, a predictive module to learn long-horizon intentions, and a reinforcement learning policy. The full approach is depicted in Fig. 2. Mapping: The mapping module uses the 128 × 128 pixels depth and semantic camera to project the points into a local top-down map. With this and its current localization, it then updates an internal global map which is further annotated with the agent’s trace and encoded into a standard RGB image. In each step, the agent then extracts an egocentric map from it and passes two representations of this map to the encoder: a coarse map of dimension 224 × 224 × 3 at a resolution of 6.6 cm and a fine-grained map of dimension 84 × 84 × 3 at a resolution of 3.3 cm. Meaning they cover 14.8 m × 14.8 m and 2.77 m × 2.77 m, respectively. After the agent has segmented an object correctly, it updates the object’s annotation to a fixed color-coding to mark the corresponding object as “found”. The coarse map is then encoded into a 256-dimensional feature vector and the fine map into a 128-dimensional feature vector using a convolutional neural network, before being concatenated into the feature embedding ft . The coarse map is passed through a ResNet-18 [12] pre-trained on ImageNet [9], while the local map is encoded by a much simpler three-layer CNN with 32, 64, and 64 channels and strides 4, 2, and 1.

Learning Long-Horizon Robot Exploration

57

Fig. 2. Our proposed model architecture. The mapping module aggregates depth and semantic information into a global map. The predictive module learns long-horizon relationships which are then interpreted by a reinforcement learning policy. During training, the agent receives a state vector with either the groundtruth direction to the closest object αt or its prediction αˆt . At test time it always receives its prediction. It furthermore receives its previous predictions α ˆ t−18:t−2 , the variances of its x- and y-position v1 and v2 , the circular variance of its predictions v3 , a collision flag d1 , the sum over the last 16 collisions d2 , its previous action at , and a binary vector g indicating the objects the agents have to find.

Learning Long-Horizon Reasoning: While directly learning low-level actions has shown success in granular discrete environments [11, 27], this does not scale to continuous environments as we show in Sect. 4. We hypothesize that this is due to being unable to explore the vast state-action space efficiently. To learn long-horizon reasoning within a single model, we introduce a predictive task to express the agent’s navigation intentions. In particular, the agent learns to predict the direction of the shortest path to the currently closest target object. It does so by estimating the angle to a waypoint generated by an A∗ planner in a distance of roughly 0.4 m–0.55 m from the agent (varying due to the discretized grid of the planner) as illustrated in Fig. 1. To avoid waypoints that are too close to walls or obstacles for the LoCoBot’s base of radius 0.18 m to reach safely, we generate the plan in a map inflated by 0.2 m. The prediction is parameterized as a neural network α ˆ = fpred (ft , srobot,t , g) that takes a shared feature encoding ft from the map encoding, the robot state srobot,t without the groundtruth vector αt nor the prediction α ˆ t and the goal objects g and predicts a vector of probabilities over the discretized angles to this waypoint. In particular, we discretize the angle into 12 bins and normalize the outputs with a softmax function. This task is used both in the form of an auxiliary task to shape representation learning, propagating gradients into a shared map encoder, as well as a recursive input to the agent, allowing it to guide itself as the agent observes the previous step’s predictions both as an overlay in the ego-centric map and the robot state. The predictive module minimizes the cross-entropy between the predictions and a one-hot encoding of the groundtruth angle α given by Lpred =

N 1 αi log α ˆi. N i=1

(1)

58

F. Schmalstieg et al.

The discrete distribution enables the agent to cover multi-modal hypotheses, for example when standing on a floor with several unexplored directions. As a result, the prediction can vary more smoothly over time in contrast to commonly used uni-modal distributions such as a Gaussian, which can fluctuate rapidly if the most likely direction changes to another door. Policy: The policy receives the concatenated features ft of the flattened map encodings, the robot state, and a history of its predictions. The policy is then learned with Proximal Policy Optimization (PPO) [25]. The actor and critic are parameterized by a two-layer MLP network with 64 hidden units each and a gaussian policy. The total loss L given by (2) L = Lppo + λLpred , is jointly minimized with the Adam optimizer [17]. Training: During training we have the choice to provide the policy either with the groundtruth direction or its prediction: Providing it with the groundtruth vector results in strong coupling between the agent’s navigation and the maximum value of the vector, which points to the waypoint corresponding to the closest next object. Hence, when deploying the agent with the auxiliary prediction, the agent blindly follows its predictions. Instead, we want to learn a policy that can, on one hand, react to suboptimal intentions and on the other hand can learn more optimal paths than simply finding the closest object. For such behavior, it is desirable to train the policy directly with its predictions. To avoid instabilities from initially very suboptimal predictions, we introduce a learning curriculum that balances observing the groundtruth and the agent’s predictions. The curriculum starts with a probability of 16% to observe an entire episode with the prediction and otherwise receives the groundtruth. Once the agent has exceeded a success rate of 50%, we linearly increase this probability by 2% every 4 episodes up to a maximum of 72%. This curriculum enables the robot to react to prediction errors and correct its navigation accordingly. To avoid learning a simple mapping from groundtruth to prediction during training, this α is never observed by the predictive module and only passed to the policy. In specific situations where it is hard to assess a global strategy, the agent sometimes predicts alternating auxiliary angles. We hypothesize that the reason for these highly alternating predictions is that the training episodes run purely with predictions, suffer from suboptimal predictions and navigation and will therefore accumulate more errors contributing to the overall loss. We test this by disabling the gradient flow from the predictive head back into the rest of the network for the episodes with predictions. We observe that this leads to a lower prediction loss. But at the same time the predictions can no longer shape the policy’s representations and thus, the agent at times struggles with “imperfect” episodes, when deployed. This is reflected, in a tremendous drop in the success rate when executing prediction episodes. We conclude that the gradient flow from all episodes is a crucial component.

Learning Long-Horizon Robot Exploration

59

With the focus on structured map inputs and the learning curriculum enables, we are able to train the agent in the comparatively small number of 4,500,000 steps.

4 Experimental Evaluations To demonstrate the effectiveness of our approach, we perform extensive evaluations in simulation and the real world. With these experiments we aim to answer the following questions: • Does the learned behavior generalize to unseen environments? • Does the agent learn to efficiently use the long-term prediction? Does this lead to more efficient exploration than using alternative approaches? • Does the learned behavior generalize to the real world? 4.1 Experimental Setup We train a LoCoBot robot in the iGibson environment [19]. The LoCoBot has a differential drive and is equipped with an RGB-D camera with a field of view of 79◦ and a maximum depth of 5.6 m. The action space consists of the linear and angular velocities for the base. We construct tasks of finding 1–6 target objects, matching the hardest setting in previous work [27]. We use the same eight training scenes as the iGibson challenge1 and use the remaining seven unseen apartments for evaluation. As larger datasets such as Matterport3D or Gibson currently do not support semantic camera observations, we leave evaluations on these to future work. The PPO agent is based on an open-source implementation [23]. Evaluation Metrics: We focus on two metrics: the ability to find all target objects, defined by the success rate, and the optimality of the search paths, measured by the success weighted by Path Length (SPL) [1]. We evaluate each scene for 75 episodes, which results in a total of 600 episodes for the train and 525 for the test set for each approach. For the learning based approaches we evaluate the best performing checkpoint on the training scenes. We report the mean over two training seeds for our approach. Baselines: To test the effectiveness of learning long-range navigation intentions, we compare our approach against a range of action parametrizations as well as a recent non-learning based state-of-the-art approach. Map-only represents the standard end-to-end reinforcement learning approach and conceptually matches competitive baselines from previous work. It receives the same map and robot state inputs as our agent and acts directly in the action space of the differential drive, but does not learn to predict long-horizon intentions. Goal-prediction Instead of predicting the direction to the next waypoint towards the target, we also evaluate directly predicting the location of the next target object. The agent learns to predict the angle and distance of the next closest target object relative to its current base frame (Table 1). 1

http://svl.stanford.edu/igibson/challenge2020.html.

60

F. Schmalstieg et al.

Table 1. Hyperparameters used for training. One sensitive parameter is ppo epoch in combination with the clip parameter. Setting the parameter too high causes some behaviour which is similar to catastrophic forgetting. Parameter

Value Parameter

Value

Clip param

0.1

γ

0.99

Ppo epoch

4

lr

0.0001

Num mini batch 64

Max grad norm 0.5

Value loss coef

0.5

Optimizer

Entropy coef

0.005

Adam

SGoLAM [16] combines non-learning based approaches to achieve very strong performance on the CVPR 2021 MultiOn challenge. It explores the map with frontier exploration until it localizes a target object, then switches to a planner to navigate to the target. We reimplement the author’s approach for continuous action spaces, closely following the original implementation where possible. While the original implementation relies on two threshold values, namely and δ, for goal localization, we directly use the semantic camera which finds objects more reliably. Due to this, our implementation improves the performance of SGoLAM. 4.2

Simulation Experiments

To test the ability to learn long-horizon exploration, we first evaluate the approaches on the seen apartments. Table 2 reports the results for all approaches across different numbers of target objects. While the map-only approach that purely learns raw continuous actions finds around three quarters of the single objects, the success rates quickly deteriorate with more target objects. We hypothesize that this is due to the agent getting lost in the vast continuous action space, being unable to meaningfully explore the space. Directly predicting the goal locations slightly improves the performance, but still fails in the majority of cases with five or more objects. This indicates that the agent is not able to meaningfully predict goal locations, as there are a large number of valid hypotheses while the environment is still largely unexplored. SGoLAM achieves a better performance, yielding an overall success rate of 65.3%. Note that this approach does not rely on a learning component, therefore it has not encountered any of these apartments before. In contrast to the other learning-based approaches, our agent is able to consistently solve this task even for six target objects. In terms of path optimality measured by the SPL in Table 2, we observe a similar case across the approaches. While outperforming the other approaches, both ours and SGoLAM achieve low absolute values. However, note that a perfect SPL would require to directly follow the shortest path to all objects and as such is not achievable without knowledge of the object positions. Additionally, one needs to bear in mind that the SPL metric is not guaranteed to find the most efficient path as it computes the path in a greedy manner.

Learning Long-Horizon Robot Exploration

61

Table 2. Evaluation of seen environments, reporting the success rate (top) and SPL (bottom). Model

1-obj 2-obj 3-obj 4-obj 5-obj 6-obj Avg 1-6

Success Map-only Goal prediction SGoLAM Ours

75.0 78.1 82.0 95.4

70.6 72.3 77.9 90.7

54.7 58.1 62.8 89.0

53.3 54.4 60.5 87.6

40.3 46.7 58.0 85.1

37.2 43.0 51.0 83.4

55.1 58.7 65.3 88.5

SPL

33.5 31.4 41.6 46.4

31.0 29.8 34.5 40.9

26.8 25.9 32.0 42.9

24.9 23.1 33.7 44.2

20.8 18.3 36.6 49.2

21.3 22.3 38.1 53.1

26.3 25.1 36.0 46.1

Map-only Goal prediction SGoLAM Ours

Table 3. Evaluation of unseen environments, reporting the success rate (top) and SPL (bottom). Model

1-obj 2-obj 3-obj 4-obj 5-obj 6-obj Avg 1–6

Success Map-only Goal prediction SGoLAM Ours

74.2 74.7 89.8 93.1

64.5 69.8 85.3 89.4

61.0 66.7 79.9 86.4

60.7 61.9 75.0 82.1

57.9 56.1 74.3 82.5

32.5 44.0 71.1 81.1

58.4 62.2 79.2 85.7

SPL

30.1 29.9 47.7 45.3

27.0 21.9 44.0 40.0

28.6 20.1 43.7 38.5

29.8 21.0 46.2 43.2

29.4 19.7 47.5 46.8

18.4 21.4 49.8 49.6

27.2 22.3 46.4 43.9

Map-only Goal prediction SGoLAM Ours

Subsequently, we evaluate all the agents in unseen environments and present the results in Table 3. Interestingly, all learning-based approaches achieve similar performance as on the seen apartments, indicating that the semantic map modality generalizes well to unseen apartments. This is particularly impressive as it only has access to eight training scenes. On the other hand, SGoLAM performs better than in the train split, indicating that the test split might be less challenging. While the gap between the best baseline, SGoLAM, and our approach shrinks, it remains considerable with an average difference in success rates of 6.5%. In terms of SPL it even performs slightly better. This may be due to the path planner, which, once an object is in sight, executes the optimal path to this object. While SGoLAM achieves very strong performance in apartments with a lot of open areas, its performance drops severely in more complex apartments with many corridors, rooms next to each other, and generally long-drawn layouts. In contrast, our approach maintains a more even performance across the different apartment layouts. Qualitatively, we find that the agent learns efficient navigation behaviors. Figure 3 and the accompanying video show example trajectories in unseen apartments. The agent efficiently looks around rooms and learns maneuvers such as a three-point turn. Where

62

F. Schmalstieg et al. Table 4. Real world multi-object search experiments on the HSR Robot. Model

2-obj 3-obj 4-obj 6-obj Total

Success

6

6

5

5

22

Collision

4

3

5

4

16

Timeout

0

Total episodes 10

1

0

1

2

10

10

10

40

Fig. 3. Example trajectories of our agent (top) and SGoLAM (bottom) in unseen apartments. The maps show the following categories: black: unexplored, blue: free space, green: walls, red: agent trace, grey: (found) target objects, other colors: miscellaneous objects.

confident, it reliably follows its own long-horizon predictions, while deviating if it points into walls, if the predictions are low confidence or if it is possible to explore a lot of additional space with little effort. While SGoLAM randomly picks points on the frontier, often resulting in multiple map crossings and getting lost in very small unexplored spaces, our approach continuously explores and learns to leave out small spots that are unlikely to contain a target object. We further observe an inverse development between the SPL and the success rate with regard to the number of objects for both our approach and SGoLAM. This increase in the SPL most likely stems from a higher number of routes that can be taken which are close to the optimal path. With fewer objects in the scene, large parts of the exploration increase the SPL without finding an object. Nevertheless, this exploration is essential as there is no prior knowledge of the object locations. 4.3

Real World Experiments

To test the transfer to the real world, we transfer the policy onto a Toyota HSR robot. While the HSR has an omnidirectional drive, we restrict its motion to match that of

Learning Long-Horizon Robot Exploration

63

Fig. 4. From top left to bottom right: Floorplan of the real world environment. The HSR robot in the office environment. Example episodes in the real world environment, the (found) target objects are shown in grey. Bottom right: failure case, the agent moves repeatedly back and forth between two rooms until it reaches the timeout.

the LoCoBot’s differential drive in simulation. We run the experiments in our office building, representing a common office environment. We use rooms covering roughly a size of 320 square meters, excluding small corners that cannot be navigated safely. We assume to have access to an accurate semantic camera. For this, we use the robot’s depth camera to construct a map of the explored environment and overlay the explored parts with a previously semantically annotated map. This prerecorded map is created with hector slam [18]. We then map all occupied space to the wall category and add semantic categories for the target objects. We find that the policy generalizes well to

64

F. Schmalstieg et al.

omitting more fine-grained, expensive labelling of other classes. The same map we also use for localization. In each episode, we randomly spawn 1–6 virtual target objects in this map by adding them to the semantic overlay. The actions are computed on the onboard CPU and executed at roughly 7 Hz. We define a maximum episode length of 6 min, roughly matching simulation. The robot is equipped with bumper sensors in its base that stop it upon any collision, in which case we deem the episode unsuccessful. Figure 4 shows the real-world setup. We make the following adaptations for the real world: To minimize collisions, we inflate the map by 5 cm and scale the actions of the agent by a factor of 0.55. Secondly, we reduce the temperature of the softmax activation of the long-horizon predictions to 0.1. We find that this increases the agent’s confidence in its own predictions and leads to more target-driven exploration. We evaluate the agent for a total of 40 episodes, spread across different numbers of target objects. The results from this experiment are presented in Table 4. We observe that the agent successfully transfers to the real world, bridging differences in the robot’s motion model, sensors, and environment layouts. Overall it solves 55% of all episodes successfully with almost no decrease as the number of target objects increases. We find two main difficulties in the real world: while generally navigating smoothly, the agent occasionally collides with door frames or small objects such as posts. This may be caused by the mismatch in the robot’s motions and controllers as well as due to the training environments consisting of mostly clean edges and little unstructured clutter. Secondly, in a very small number of episodes, the agent gets stuck moving back and forth between close spots, as the long-horizon prediction keeps changing back and forth. These results suggest a potential to further increase the success in the real world by finetuning the learned agent on the real robot, in particular to further reduce the number of collisions. Example trajectories from this experiment are shown in Fig. 4 and in the accompanying video.

5 Conclusion In this paper, we proposed a novel reinforcement learning approach for object search that unifies short- and long-horizon reasoning into a single model. To this end, we introduced a multi-object search task in continuous action space and formulated an explicit prediction task which allows the agent to guide itself over long-horizons. We demonstrated that our approach significantly outperforms other learning-based methods which struggle to perform efficient long-term exploration in this continuous space. By focusing on structured semantic map inputs, our approach learns complex exploration behaviors in comparably few steps and generalizes effectively to unseen apartments. It furthermore decouples perception and decision making, mitigating sim-to-real gap and allowing the approach to be easily updated with new arising methods in the quickly progressing field of semantic segmentation. We successfully transferred the approach to the real world and find that the agent indeed bridges the sim-real gap and exhibits the potential for further improvement if given the chance to adapt to the motion model of the real robot. In the future, we aim to further exploit the ability of the approach to learn different actions at a high control frequency. Particular, we will investigate the ability to incorporate control of the head camera which should further improve the agent’s success rates.

Learning Long-Horizon Robot Exploration

65

Furthermore, we are interested in the application to mobile manipulation in which the agent has to simultaneously navigate and control its arms [14]. A third promising direction is the ability of learning-based approaches to incorporate data-driven knowledge such as correlations between semantic classes in real-world environments. Training on much larger environments will provide exciting avenues to exploit this.

References 1. Anderson, P., et al.: On evaluation of embodied navigation agents. arXiv preprint arXiv:1807.06757 (2018) 2. Beeching, E., Debangoye, J., Simonin, O., Wolf, C.: Deep reinforcement learning on a budget: 3D control and reasoning without a supercomputer. In: 25th International Conference on Pattern Recognition (ICPR), pp. 158–165 (2021) 3. Cattaneo, D., Vaghi, M., Valada, A.: LCDNet: deep loop closure detection and point cloud registration for LiDAR SLAM. IEEE Trans. Robot. 38, 2074–2093 (2022) 4. Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural slam. In: International Conference on Learning Representations (ICLR) (2020) 5. Chaplot, D.S., Gandhi, D.P., Gupta, A., Salakhutdinov, R.R.: Object goal navigation using goal-oriented semantic exploration. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), vol. 33, pp. 4247–4258 (2020) 6. Chen, C., et al.: SoundSpaces: audio-visual navigation in 3D environments. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 17–36. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6 2 7. Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S.K., Grauman, K.: Learning to set waypoints for audio-visual navigation. In: International Conference on Learning Representations (ICLR) (2020) 8. Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. In: International Conference on Learning Representations (ICLR) (2019) 9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 248–255. IEEE (2009) 10. Druon, R., Yoshiyasu, Y., Kanezaki, A., Watt, A.: Visual object search by learning spatial context. IEEE Robot. Autom. Lett. 5(2), 1279–1286 (2020) 11. Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 538–547 (2019) 12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 13. Henriques, J.F., Vedaldi, A.: MapNet: an allocentric spatial memory for mapping environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8476–8484 (2018) 14. Honerkamp, D., Welschehold, T., Valada, A.: Learning kinematic feasibility for mobile manipulation through deep reinforcement learning. IEEE Robot. Autom. Lett. 6(4), 6289– 6296 (2021) 15. Hurtado, J.V., Londo˜no, L., Valada, A.: From learning to relearning: a framework for diminishing bias in social robot navigation. Front. Robot. AI 8, 69 (2021) 16. Kim, J., Lee, E.S., Lee, M., Zhang, D., Kim, Y.M.: SGoLAM: simultaneous goal localization and mapping for multi-object goal navigation. arXiv preprint arXiv:2110.07171 (2021)

66

F. Schmalstieg et al.

17. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 18. Kohlbrecher, S., Meyer, J., von Stryk, O., Klingauf, U.: A flexible and scalable slam system with full 3D motion estimation. In: Proceedings of the IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR). IEEE (2011) 19. Li, C., et al.: iGibson 2.0: object-centric simulation for robot learning of everyday household tasks. In: 2021 Conference on Robot Learning (CoRL), Proceedings of Machine Learning Research, vol. 164, pp. 455–465. PMLR (2021). https://proceedings.mlr.press/v164/li22b. html 20. Min, S.Y., Chaplot, D.S., Ravikumar, P.K., Bisk, Y., Salakhutdinov, R.: FILM: following instructions in language with modular methods. In: International Conference on Learning Representations (ICLR) (2022) 21. Oh, J., Chockalingam, V., Lee, H., et al.: Control of memory, active perception, and action in minecraft. In: International Conference on Machine Learning, pp. 2790–2799 (2016) 22. Qiu, Y., Pal, A., Christensen, H.I.: Learning hierarchical relationships for object-goal navigation. In: 2020 Conference on Robot Learning (CoRL) (2020) 23. Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., Dormann, N.: Stablebaselines3: reliable reinforcement learning implementations. J. Mach. Learn. Res. 22(268), 1–8 (2021) 24. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9339–9347 (2019) 25. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 26. V¨odisch, N., Cattaneo, D., Burgard, W., Valada, A.: Continual SLAM: beyond lifelong simultaneous localization and mapping through continual learning. arXiv preprint arXiv:2203.01578 (2022) 27. Wani, S., Patel, S., Jain, U., Chang, A., Savva, M.: MultiON: benchmarking semantic map memory using multi-object navigation. In: Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), vol. 33, pp. 9700–9712 (2020) 28. Yamauchi, B.: A frontier-based approach for autonomous exploration. In: Proceedings 1997 IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA), pp. 146–151 (1997) 29. Younes, A., Honerkamp, D., Welschehold, T., Valada, A.: Catch me if you hear me: audiovisual navigation in complex unmapped environments with moving sounds. arXiv preprint arXiv:2111.14843 (2021) 30. Zhang, J., Tai, L., Boedecker, J., Burgard, W., Liu, M.: Neural SLAM: learning to explore with external memory. arXiv preprint arXiv:1706.09520 (2017) 31. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: International Conference on Robotics and Automation, pp. 3357–3364 (2017)

Visual Foresight with a Local Dynamics Model Colin Kohler(B) and Robert Platt Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA {kohler.c,r.platt}@northeastern.edu Abstract. Model-free policy learning has been shown to be capable of learning manipulation policies which can solve long-time horizon tasks using single-step manipulation primitives. However, training these policies is a time-consuming process requiring large amounts of data. We propose the Local Dynamics Model (LDM) which eﬃciently learns the state-transition function for these manipulation primitives. By combining the LDM with model-free policy learning, we can learn policies which can solve complex manipulation tasks using one-step lookahead planning. We show that the LDM is both more sample-eﬃcient and outperforms other model architectures. When combined with planning, we can outperform other model-based and model-free policies on several challenging manipulation tasks in simulation. Keywords: Spatial action space · Visual dynamics model Reinforcement learning · Robotic manipulation

1

·

Introduction

Real-world robotic manipulation tasks require a robot to execute complex motion plans while interacting with numerous objects within cluttered environments. Due to the diﬃculty in learning good policies for these tasks, a common approach is to simplify policy learning by expressing the problem using more abstract (higher level) actions such as end-to-end collision-free motions combined with some motion primitive such as pick, place, push, etc. This is often called the spatial action space and is used in several works including [24,31,32,35]. By leveraging these open-loop manipulation primitives, model-free policy learning learns faster and can ﬁnd better policies. However, a key challenge with this approach is that a large number of actions need to be considered at each timestep leading to diﬃculties in learning within a large SE(2) workspace or an SE(3) workspace of any size. Due to these challenges, model-based policy learning presents an attractive alternative because it has the potential to improve sample eﬃciency [6,12,28]. Applying model-based methods to robotics, however, has been shown to be diﬃcult and often requires reducing the high-dimensional states provided by sensors to low-dimensional latent spaces. While these methods have been successfully applied to a variety of robotic manipulation tasks [16,29] they also require a large amount of training data (on the order of 10,000 to 100,000 examples). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 67–82, 2023. https://doi.org/10.1007/978-3-031-25555-7_6

68

C. Kohler and R. Platt

This paper proposes the Local Dynamics Model (LDM) which learns the state-transition function for the pick and place primitives within the spatial action space. Unlike previous work which learns a dynamics model in latent space, LDM exploits the encoding of actions into image-space native to the spatial action space to instead learn an image-to-image transition function. Within this image space, we leverage both the localized eﬀect of pick-and-place actions and the spatial equivariance property of top-down manipulation to dramatically improve the sample eﬃciency of our method. Due to this eﬃciency, the dynamics model quickly learns useful predictions allowing us to perform policy learning with a dynamics model which is trained from scratch alongside the policy. We demonstrate this through our use of a one-step lookahead planner which uses the state value function in combination with the LDM to solve many diﬀerent complex manipulation tasks in simulation. We make the following contributions. First, we propose the Local Dynamics Model, a novel approach to deterministic modelling of environmental dynamics by restructuring the transition function. We demonstrate that the LDM is capable of eﬃciently learning both pick and place manipulation primitives. Second, we introduce a online policy learning method which leverages the LDM to solve challenging manipulation tasks. Our experimental results show that our method outperforms other model-based and model-free. Our code is available at https:// github.com/ColinKohler/LocalDynamicsModel.

2

Related Work

Robotic Manipulation: Broadly speaking, there are two common approaches to learning manipulation policies: open-loop control and close-loop control. In closed-loop control, the agent controls the delta pose of the end-eﬀector enabling ﬁne-tune control of the manipulator. This end-to-end approach has been shown to be advantageous when examining contact-rich domains [11,13,17]. In contrast, agents in open-loop control apply predeﬁned action primitives, such as pick, place, or push, to speciﬁed poses within the workspace. This tends to provide more data-eﬃcient learning but comes at the cost of less powerful policies [20]. Spatial Action Space: The spatial action space is an open-loop control approach to policy learning for robotic manipulation. Within this domain, it is common to combine planar manipulation with a fully-convolutional neural network (FCN) which is used as a grasp quality metric [20] or, more generally, a action-value metric [31,35]. This approach has been adapted to a number of diﬀerent manipulation tasks covering a variety of action primitives [2,18,34]. Dynamics Modelling: Model-Based RL improves data-eﬃciency by incorporating a learned dynamics model into policy training. Model-based RL has been successfully applied to variety of non-robotic tasks [3,6,12] but has seen more mixed success in robotics tasks. While model-based RL has been shown to work well in

Visual Foresight with a Local Dynamics Model

69

robotics tasks with low-dimensional state-spaces [16,29], the high-dimensionality of visual state-spaces more commonly seen in robotic manipulation tends to harm performance. More modern approaches learn a mapping from image-space to some underlying latent space and learn a dynamics model which learns the transition function between these latent states [15,21]. More recent work has examined image-to-image dynamics models similar to the video prediction models in computer vision [5]. However, these works typically deal with short-time horizon physics such as poking objects [1] or throwing objects [36]. Paxton et al. [23] and Hoque et al. [10] learn visual dynamics models for pick and place primitives but require a large amount of data and time to learn an accurate model. Our work is most closely related to [2] and [33]. In [2], Berscheid et al., learn a visual transition model using a GAN architecture but only learn pick and push primitives while still requiring a large amount of data. Additionally, they only examine a simple bin picking task in their experiments resulting in a transition model with relatively simple dynamics to learn. Wu et al. [33] learn a visual foresight model tailored to a suction cup gripper and use it to solve various block rearrangement tasks using transporter networks. In contrast, we achieve similar sample eﬃciency using a more complicated parallel jaw gripper across a much more diverse set of objects and tasks using online policy learning.

3

Problem Statement

Manipulation as an MDP in a Spatial Action Space: This paper focuses on robotic manipulation problems expressed as a Markov decision process in a spatial action space, M = (S, A, T, R, γ), where state is a top down image of the workspace paired with an image of the object currently held in the gripper (the in-hand image) and action is a subset of SE(2). Speciﬁcally, state is a pair of c-channel images, s = (sscene , shand ) ∈ Sscene × Shand , where sscene ∈ Sscene ⊆ Rc×h×w is a c × h × w image of the scene and shand ∈ Shand ⊆ Rc×d×d is a c×d×d image patch that describes the conFig. 1. The manipulation scene tents of the hand (Fig. 2). At each time step, sscene is set equal to a newly acquired top-down image of the scene. shand is set to the oriented d × d image patch corresponding to the pose of the last successful pick. If no successful pick has occurred or the hand is empty, then shand is set to be the zero image. Action a ∈ A ⊆ SE(2) is a target pose for an end eﬀector motion to be performed at the current timestep. If action

70

C. Kohler and R. Platt

Fig. 2. MDP state. (a) The manipulation scene. (b) The top-down image of the workspace, sscene . (c) The in-hand image, shand .

a executes when the hand is holding an object (when the in-hand image is not zero), then a is interpreted as a place action, i.e. move and then open the ﬁngers. Otherwise, a is interpreted as a pick, i.e. move and close the ﬁngers. Here, A = Apos × S 1 ⊆ SE(2) spans the robot workspace and Apos ⊆ R2 denotes the position component of that workspace. State and action are related to each other in that each action corresponds to the pixel in the state that is beneath the end eﬀector target pose speciﬁed by the action. We assume we have access to a function h : A → Z2 that maps an action to the pixel corresponding to its position component. Assumptions: The following assumptions can simplify policy learning and are often reasonable in robotics settings. First, we assume that we can model transitions with a deterministic function. While manipulation domains can be stochastic, we note that high value transitions are often nearly deterministic, e.g. a high value place action often leads to a desired next state nearly deterministicly. As a result, planning with a deterministic model is often reasonable. Assumption 1 (Deterministic Transitions). The transition function is deterministic and can therefore be modeled by the function s = f (s, a), i.e. the dynamics model. The second assumption concerns symmetry with respect to translations and rotations of states and actions. Given a transformation g ∈ SE(2), g(s) denotes the state s = (sscene , shand ) where sscene has been rotated and translated by g and shand is unchanged. Similarly, g(a) denotes the action a rotated and translated by g. Assumption 2 (SE(2) Symmetric Transitions). The transition function is invariant to translations and rotations. That is, for any translation and rotation g ∈ SE(2), T (s, a, s ) = T (g(s), g(a), g(s )) for all s, a, s ∈ S × A × S. The last assumption concerns the eﬀect of an action on state. Let R ⊆ Apos be a region of R2 . Given a state s = (sscene , shand ), let sscene = mask(s, R) ∈ Sscene

Visual Foresight with a Local Dynamics Model

71

denote the scene image that is equal to sscene except that all pixels inside R have been masked to zero. In the following, we will be exclusively interested in image masks involving the region Ba , deﬁned as follows: Deﬁnition 1 (Local Region). For an action a = (apos , aθ ) ∈ SE(2), let Ba ⊆ Apos denote the square region with a ﬁxed side length d (a hyperparameter) that is centered at apos and oriented by aθ . We are now able to state the ﬁnal assumption: Assumption 3 (Local Eﬀects). An action a ∈ A does not aﬀect parts of the scene outside of Ba . That is, given any transition s = f (s, a), it is the case that mask(s, Ba ) = mask(s , Ba ). The bottle arrangement task (Figs. 1, 2) is an example of a robotic manipulation domain that satisﬁes the assumptions above. First, notice that high value actions in this domain lead to deterministic pick and place outcomes, i.e. picking up the bottle and placing it with a low probability of knocking it over. Second, notice that transitions are rotationally and translationally symmetric in this problem. Finally, notice that interactions between the hand and the world have local eﬀects. If the hand grasps or knocks over a bottle, that interaction typically aﬀects only objects nearby the interaction.

4

Method

In this section, we ﬁrst introduce the Local Dynamics Model (LDM) detailing its properties and model architecture. We then discuss how we combine the LDM with an action proposal method to perform policy learning through one-step lookahead planning. 4.1

Structuring the Transition Model

We simplify the problem of learning the transition function f : S × A → S by encoding Assumptions 2 and 3 as constraints on the model as follows. First, given a state s = (sscene , shand ), we partition the scene image sscene into a region that is invariant under a, sˇa = mask(s, Ba ), and a region that changes under a, sˆa = crop(s, Ba ). Here, crop(s, R) ∈ Rc×d×d denotes the c-channel d × d image patch cropped from sscene corresponding to region R ⊆ Apos , resized to a d × d image. Using this notation, we can reconstruct the original scene image by combining sˆa and sˇa : sscene = insert(ˆ sa , Ba ) + sˇa ,

(1)

where sˆa = crop(s, Ba ) and insert(ˆ sa , Ba ) inserts the crop into region Ba and sets the pixels outside Ba to zero.

72

C. Kohler and R. Platt

Fig. 3. Local dynamics model. Pick (top) and place (bottom) action primitives. In order to predict the next scene image sscene , we learn a model f¯ that predicts how the scene will change within Ba , a neighborhood around action a. The output of this model is inserted into the original scene image.

4.2

Local Dynamics Model

Instead of learning f directly, we will learn a function f¯ : Rc×d×d → Rc×d×d that maps the image patch sˆa onto a new patch sˆa . Whereas f models the dynamics of the entire scene, f¯ only models changes in the scene within the local region Ba . We refer to f¯ as the local dynamics model (LDM). Given such a model, we can deﬁne a function fscene as: fscene (s, a) = insert(f¯(ˆ sa ), Ba ) + sˇa ,

(2)

where sˆa = crop(s, Ba ). We can reconstruct f as f (s, a) = (fscene (s, a), shand ) where shand denotes the in-hand image obtained using the rules described in Sect. 3. Figure 3 illustrates this process for picking and placing in a block arrangement task. Notice that the model in Eq. 2, fscene , satisﬁes both Assumptions 2 and 3. The fact that it satisﬁes Assumption 3 is easy to see as the local dynamics model f¯ only models changes in the scene within the local region Ba . It also satisﬁes Assumption 2 because sˆa is invariant under transformations g ∈ SE(2) of s and a: sˆa = crop(s, Ba ) = crop(g(s), Bg(a) ), where g(s) rotates state s and g(a) rotates action a. As a result, Eq. 2 is constrained to be equivariant in the sense that g(fscene (s, a)) = fscene (g(s), g(a)).

Visual Foresight with a Local Dynamics Model

73

Fig. 4. LDM architecture. Model architecture used in f¯, the local dynamics model. Each blue box represents a 3 × 3 ResNet Block.

Model Architecture: We model the local dynamics model, f¯, using the UNet model architecture shown in Fig. 4 with four convolution and four deconvolution layers. It takes as input the image patch sˆa ∈ Rc×d×d , and outputs a patch from the predicted next state, sˆa ∈ Rc×d×d . The size of this image patch must be large enough to capture the eﬀects of pick and place actions, but small enough to ignore objects not aﬀected by the current interaction. In our experiments, we set d = 64 pixels which corresponds to roughly 20 cm in the workspace. Loss Function: f¯ is trained using a reconstruction loss, i.e. a loss which measures the diﬀerence between a predicted new state image patch and a ground truth image patch. Typically, this is accomplished using a pixel-wise L2 loss [9]. However, we instead model pixel values as a multinomial probability distribution over 21 diﬀerent possible values for each pixel (in our case, these are depth values since we use depth images). This enables us to use a cross entropy loss, which has been shown to have better performance relative to an L2 loss [30]. We were able to improve performance even further by using a focal loss rather than a vanilla cross entropy loss [19]. This alleviates the large class imbalance issues that arise from most pixels in sˆa having the same value and focuses learning on parts of the pixel space with the most challenging dynamics. 4.3

Policy Learning

While there are a variety of ways to improve policy learning using a dynamics model, here we take a relatively simple one-step lookahead approach. We learn the state value function Vψ (s), and use it in combination with the dynamics ˆ a) = Vψ (f (s, a)). A key challenge here model to estimate the Q function, Q(s, ˆ a) or arg maxa∈A Q(s, ˆ a) over is that it is expensive to evaluate maxa∈A Q(s, large action spaces (such as the spatial action space) because the forward model

74

C. Kohler and R. Platt

must be queried separately for each action. We combat this problem by learning an approximate Q function that is computationally cheap to query and use it to reduce the set of actions over which we maximize. Speciﬁcally, we learn a function Qθ using model-free Q-learning: Qθ (s, a) ← r + γ maxa ∈A Qθ (s , a ). Then, we deﬁne a policy πθ (a|s) = σA (Qθ (s, a)), where σA denotes the softmax function over the action space A with an implicit temperature parameter τ . We sample a small set of high quality actions A¯N ⊆ A by drawing N action samples ˆ a) ≈ maxa∈A¯ Q(s, ˆ a). from πθ (a|s). Now, we can approximate maxa∈A Q(s, N ˆ The target for learning Vψ is now Vψ (s) ← r + maxa∈A¯N Q(s, a). The policy ˆ a)). We schedule exploration under which our agent acts is π(a|s) = σA¯N (Q(s, by decreasing the softmax temperature parameter over the course of learning. We model Qθ using a fully-convolutional neural network which takes as input the top-down heightmap sscene and outputs a 2-channel action-value map (Qpick , Qplace ) ∈ R2×r×h×w where Qpick correlates with picking success and Qplace to placing success. The orientation of the action is represented by discretizing the space of SO(2) rotations into r values and rotating s by each θ value. Vψ is modeled as standard convolutional neural network which takes the state s as input and outputs the value of that state. We use two target networks parameterized by θ− and ψ − which are updated to the current weights θ and ψ every t steps to stabilize training. 4.4

Sampling Diverse Actions ˆ a) and π(a|s) = σA¯ (Q(s, ˆ a)), it is important to When evaluating maxa∈A¯N Q(s, N ¯ sample a diverse set of actions AN . The problem is that σ(Qθ , ·)) can sometimes be a low entropy probability distribution with a small number of high-liklihood peaks. If we draw N independent samples directly from this distribution, we are likely to obtain multiple near-duplicate samples. This is unhelpful since we only need one sample from each mode in order to evaluate it using Vψ (f (s, a)). A simple solution would be to sample without replacement. Unfortunately, as these peaks can include a number of actions, we would have to draw a large number of samples in order to ensure this diversity. To address this problem, we use an inhibition technique similar to non-maximum suppression where we reduce the distribution from which future samples are drawn in a small region around each previously drawn sample. Speciﬁcally, we draw a sequence of samples, a1 , . . . , aN . The ﬁrst sample is drawn from the unmodiﬁed distribution Qθ (s, ·). Each sucj cessive sample j = N is drawn from a distribution Qθ (s, ·) − β i=1 N (ai , σ 2 ), where N denotes the standard normal distribution in R3 , and β and σ 2 are constants. Here, we have approximated SE(2) as a vector space R3 in order to apply the Gaussian. Over the course of training, we slowly reduce β as the optimal policy is learned.

5

Experiments

We performed a series of experiments to test our method. First, we investigate the eﬀectiveness of the Local Dynamics Model (LDM) by training the model

Visual Foresight with a Local Dynamics Model

75

in isolation on pre-generated oﬄine data. Second, we demonstrate that we can learn eﬀective policies across a number of complex robotic manipulation tasks. Network Architecture: A classiﬁcation UNet with bottleneck Resnet blocks [7] is used as the architecture of the LDM. A similar network architecture is used for the Q-value model, Qθ , with the exception of using basic Resnet blocks. The state value model, Vψ , is a simple CNN with basic Resnet blocks and two fullyconnected layers. The exact details for the number of layers and hidden units can be found in our Github repository. Implementation Details: The workspace has a size of 0.4 m × 0.4 m and sscene covers the workspace with a heightmap of size of 128 × 128 pixels. We use 8 discrete rotations equally spaced from 0 to π. The target network is synchronized every 100 steps. We used the Adam optimizer [14], and the best learning rate and its decay were chosen to be 10−3 and 0.95 respectively. The learning rate is multiplied by the decay every 2000 steps. We use the prioritized replay buﬀer [27] with prioritized replay exponent α = 0.6 and prioritized importance sampling exponent B0 = 0.0 annealed to 1 over training. The expert transitions are given a priority bonus of d = 1 as in Hester et al. [8]. The buﬀer has a size of 10000 episodes. Our implementation is based on PyTorch [22].

(a) Block Stacking

(b) House Building

(c) Bottle Arrange

(d) Bin Packing

Fig. 5. Tasks. The window in the top-left corner shows the goal state.

Task Descriptions: For all experiments, both the training and testing is preformed in the PyBullet simulator [4]. In the block stacking domain, three cubes are placed randomly within the workspace and the agent is tasked with placing these blocks into a stable stack. In the house building domain, two cubes and one triangle are placed randomly within the workspace and the agent is tasked with placing the triangle on top of the two cube blocks. In the bottle arrangement domain, the agent needs to gather six bottles in a tray. These three environments have spare rewards (+1 at goal and 0 otherwise). In the bin packing domain, the agent must compactly pack eight blocks into a bin while minimizing the height of the pack. This environment uses a richer reward function and provides a positive reward with magnitude inversely proportional to the highest point in the pile after packing all objects. Example initial and goal conﬁgurations for these domains can be seen in Fig. 5.

76

5.1

C. Kohler and R. Platt

Accuracy of the Local Dynamics Model

Table 1. Dynamics model performance. Final performance for the 4 domains on the diﬀerent dynamics models. The results show the mean and standard deviation averaged over 3 random seeds. L1 denotes the L1-pixelwise diﬀerence between the predicted observation and the true observation. Lower is better. SR denotes the success rate (%) for the action primitive. Higher is better. Method

Block stacking L1 SR

House building L1 SR

Naive

30.3 ± 1.7

38 ± 5.6

30.4 ± 1.5

LDM(128) 14.5 ± 2.2

70 ± 1.8

10.9 ± 0.24 70.7 ± 0.6

39.1 ± 1.9

LDM(64)

8.76 ± 0.1 83.4 ± 0.6 5.88 ± 0.2 77.9 ± 1.1

Method

Bottle arrangement L1 SR

Naive

48.9 ± 0.9

43.8 ± 4.4

LDM(128) 43.6 ± 0.79 58.6 ± 1.3 LDM(64)

32.5 ± 1.8 66 ± 1.9

Bin packing L1 SR 77.2 ± 0.71 35.4 ± 0.8 93.3 ± 1.9

60.4 ± 2.1

48.8 ± 0.9 63.9 ± 0.5

Experiment: We generate 5k steps of noisy expert data for each of the domains in Fig. 5 by rolling out a hand coded stochastic policy. For the block stacking and house building domains we train the models for 5k iterations of optimization. For the bottle arrangement and bin packing domains we train the models for 10k iterations. Metrics: We examine two metrics of model accuracy: 1) the L1-pixelwise difference between the predicted observation and the true observation and 2) the success rate of the action primitives. A pick action is deﬁned as a success if the model correctly predicts if the object will be picked up or not. Similarly, a place action is deﬁned as a success provided the model correctly predicts the pose of the object after placement within a small tolerance (20 mm, 5◦ ). The L1 difference provides a low level comparison of the models whereas the success rate provides a higher level view which is more important for planning and policy learning. Baselines: We compare the performance of three dynamics models. 1. LDM(64): Local Dynamics Model with a crop size of 64 pixels. 2. LDM(128): Local Dynamics Model with a crop size of 128 pixels. 3. Naive: UNet forward model with 128×128 input and output size. The action is encoded by concatenating a binary mask of the action position onto the state s.

Visual Foresight with a Local Dynamics Model

77

Results: In Table 1, we summarize the accuracy of the models in the four domains on a held-out test set. While both LDM(64) and LDM(128) are able to generate realistic images in non-cluttered domains, we ﬁnd that deﬁning a small localized area of aﬀect to be vital in cluttered domains such as bin packing. The most common failure mode occurs when the model overestimates the stability of object placements. For Fig. 6. Sample eﬃciency. Action primitive success example, it has diﬃculties rate for bin packing. Results averaged over three ranin determining the inﬂec- dom seeds. Shading denotes standard error. tion point when stacking blocks which will lead to the stack falling over. Equally important to the ﬁnal performance of the models is how eﬃciently they learn. In Fig. 6, the action primitive success rate is shown over training for the bin packing environment. The sample eﬃciency of LDM(64) makes it much more useful for policy learning as the faster the dynamics model learns the faster the policy will learn. 5.2

Policy Learning

Here, we evaluate our ability to use the local dynamics model to learn policies that solve the robotic manipulation tasks illustrated in Fig. 5. In each of these domains, the robot must execute a series of pick and place actions in order to arrange a collection of objects as speciﬁed by the task. These are sparse reward tasks where the agent gets a non-zero reward only upon reaching a goal state. As such, we initialize the replay buﬀer for all agents with 100 expert demonstration episodes in order to facilitate exbyploration. Baselines: We compare our approach with the following baselines. 1. FC-DQN: Model-free policy learning using a fully-convolutional neural network to predict the q-values for each action in the spatial-action space. Rotations are encoded by rotating the input and output for each θ thereby encoding SE(2) equivariance in a similar manner to the LDM [35]. 2. Random Shooing (RS): RS samples K candidate action sequences from a uniform distribution and evaluates each candidate using the dynamics module. The optimal action sequence is chosen as the one with the highest return [25,26]. Due to the size of the action space, we restrict action sampling to only sample actions which are nearby or on obejcts within the workspace.

78

C. Kohler and R. Platt

3. Dyna-Q: FC-DQN model trained Dyna-style where training iterates between two steps. First, data is gathered using the current policy and used to learn the dynamics model. Second, the policy is improved using synthetic data generated by the dynamics model. At test time only the policy is used [28]. For fairness, all algorithms use the same model architecture. For RS and Dyna-Q, an extra head is added onto the state value model after the feature extraction layers in order to predict the reward for that state. When a model is not used, such as the value model for RS, they are not trained during that run. The forward model is not pretrained in any of the algorithms considered. All algorithms begin training the forward model online using the on-policy data contained in the replay buﬀer – the same data used to train the policy.

Fig. 7. Simulation experiment evaluation. The evaluation performance of greedypolicy in terms of task success rate. The evaluation is performed every 500 training steps. Results averaged over 3 random seeds. Shading denotes standard error.

Results: The results are summarized in Fig. 7. They show that our method (shown in blue) is more sample eﬃcient than FC-DQN in all domains except bin packing. We attribute the under-performance in bin packing to the diﬃcult transition function that the state prediction model must learn due to the varied geometry of the blocks interacting with each other. LDM signiﬁcantly out-preforms the model-based baselines in all domains. RS preforms poorly even with a high quality state prediction model due to the low probability of randomly

Visual Foresight with a Local Dynamics Model

79

Table 2. Generalization experiment. We show the task success rate (%) of zeroshot generalization evaluated on 100 episodes. Higher is better. Method

Block stacking Bottle arrangement 4 Block 5 Block 5 Bottle 6 Bottle

RS

48

23

8

4

FC-DQN 98

89

82

48

LDM

84

86

65

99

sampling a good trajectory in large actions spaces. Dyna-Q performs similarly poorly due to the minute diﬀerences between the simulated experiences and the real experiences cause the policy learned to preform worse on real data. 5.3

Generalization

One advantageous property of model-based RL, is its ability to generalize to unseen environments provided the underlying dynamics of the environments remains similar. In order to test how well LDM generalizes, we trained LDM, FC-DQN, and RS on the block stacking and bottle arrangement domains on a reduced number of objects and evaluated them with an increased number of objects, i.e. zero-shot generalization. These models are trained using online policy learning as in Sect. 5.2. Speciﬁcally, we trained our models on 3 block stacking and evaluated them on 4 and 5 block stacking. Similarly, we trained our models on 4 bottle arrangement and evaluated them on 5 and 6 bottle arrangement. As shown in Table 2, LDM is more eﬀective for zero-shot generalization when compared to both the model-free (FC-DQN) and model-based (RS) baselines.

6

Limitations and Future Work

This work has several limitations and directions for future research. The most glaring of these is our use of a single-step lookahead planner for policy learning. One large advantage of model-based methods is their ability to plan multiple steps ahead to ﬁnd the most optimal solution. For instance in bin packing, our single-step planner will occasionally greedily select a poor action which results in the ﬁnal pack being taller whereas a multi-step planner would be able to avoid this action by examining the future consequences. In terms of the LDM, we believe their are two interesting avenues for future work. First, due to our modeling of the pixels as probability distributions, we can easily estimate the uncertainty of the LDM’s predictions by calculating the pixelwise entropy of the model output. This could prove useful when planning by allowing us to avoid taking actions which the LDM is more uncertain about leading to more robust solutions. Similarly, using a transition model which models a distribution, such as a VAE or GAN, would allow us to remove Assumption 1 improving LDM’s ability to predict complex object dynamics. Secondly, although we encode SE(2)

80

C. Kohler and R. Platt

equivariance into the LDM by restructuring the dynamics function, we could also explore the use of equivriatant CNNs in the LDM architecture. These equviariant CNNs have been shown to greatly improve sample eﬃciency across a wide number of domains and have recently started being applied to robotic manipulation tasks similar to those we present in this work [32]. We view this work as orthogonal to [32] as the primary advantage of our method, and the reason behind its increased sample eﬃciency, lies with its ability to consider multiple promising actions instead of greedy action selection. Furthermore, replacing the conventional CNNs in our models with equivariant CNNs should result in a increased sample eﬃciency for all models, including the LDM.

7

Conclusion

In this paper, we propose the Local Dynamics Model (LDM) approach to forward modeling which learns the state-transition function for pick and place manipulation primitives. The LDM is able to eﬃciently learn the dynamics of many diﬀerent objects faster and more accurately compared to similar methods. This sample eﬃciency is achieved by restructuring the transition function to make the LDM invariant to both objects outside the region near the action and to transformations in SE(2). We show that the LDM can be used to solve a number of complex manipulation tasks through the use of a single-step lookahead planning method. Through the combination of the LDM with our planning method which samples a diverse set of actions, our proposed method is able to outperform the model-free and model-based baselines examined in this work. Acknowledgments. This work is supported in part by NSF 1724257, NSF 1724191, NSF 1763878, NSF 1750649, and NASA 80NSSC19K1474.

References 1. Agrawal, P., Nair, A., Abbeel, P., Malik, J., Levine, S.: Learning to poke by poking: experiential learning of intuitive physics. arXiv preprint arXiv:1606.07419 (2016) 2. Berscheid, L., Meißner, P., Kr¨ oger, T.: Learning a generative transition model for uncertainty-aware robotic manipulation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4670–4677. IEEE (2021) 3. Chua, K., Calandra, R., McAllister, R., Levine, S.: Deep reinforcement learning in a handful of trials using probabilistic dynamics models. CoRR, abs/1805.12114 (2018). http://arxiv.org/abs/1805.12114 4. Coumans, E., Bai, Y.: PyBullet, a python module for physics simulation for games, robotics and machine learning (2016–2021). http://pybullet.org 5. Finn, C., Levine, S.: Deep visual foresight for planning robot motion. CoRR, abs/1610.00696 (2016). http://arxiv.org/abs/1610.00696 6. Gu, S., Lillicrap,T., Sutskever, I., Levine, S.: Continuous deep Q-learning with model-based acceleration. In: International Conference on Machine Learning, pp. 2829–2838. PMLR (2016) 7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR, abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385

Visual Foresight with a Local Dynamics Model

81

8. Hester, T., et al.: Learning from demonstrations for real world reinforcement learning. CoRR, abs/1704.03732 (2017). http://arxiv.org/abs/1704.03732 9. Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 10. Hoque, R., et al.: Visuospatial foresight for multi-step, multi-task fabric manipulation. arXiv preprint arXiv:2003.09044 (2020) 11. Jang, E., Vijayanarasimhan, S., Pastor, P., Ibarz, J., Levine, S.: End-to-end learning of semantic grasping (2017) 12. Kaiser, L., et al.: Model-based reinforcement learning for Atari (2019). https:// arxiv.org/abs/1903.00374 13. Kalashnikov, D., et al.: QT-Opt: scalable deep reinforcement learning for visionbased robotic manipulation (2018) 14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 15. Kossen, J., Stelzner, K., Hussing, M., Voelcker, C., Kersting, K.: Structured objectaware physics prediction for video modeling and planning (2019). https://arxiv. org/abs/1910.02425 16. Lenz, I., Knepper, R.A., Saxena, A.: DeepMPC: learning deep latent features for model predictive control. In: Robotics: Science and Systems. Rome, Italy (2015) 17. Levine, S., Pastor, P., Krizhevsky, A., Quillen, D.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection (2016) 18. Liang, H., Lou, X., Choi, C.: Knowledge induced deep q-network for a slide-to-wall object grasping. arXiv preprint arXiv:1910.03781, pp. 1–7 (2019) 19. Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollar, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017) 20. Mahler, J., et al.: Learning ambidextrous robot grasping policies. Sci. Robot. 4(26), eaau4984 (2019) 21. Minderer, M., Sun, C., Villegas, R., Cole, F., Murphy, K.P., Lee, H.: Unsupervised learning of object structure and dynamics from videos. Adv. Neural Inf. Process. Syst. 32 (2019) 22. Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 (2019) 23. Paxton, C., Barnoy, Y., Katyal, K., Arora, R., Hager, G.D.: Visual robot task planning. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 8832–8838. IEEE (2019) 24. Platt, R., Kohler, C., Gualtieri, M.: Deictic image mapping: an abstraction for learning pose invariant manipulation policies. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 33, pp. 8042–8049 (2019) 25. Richards, A.G.: Robust constrained model predictive control. Ph.D. thesis, Massachusetts Institute of Technology (2005) 26. Ross, S., Gordon, G., Bagnell, D.: A reduction of imitation learning and structured prediction to no-regret online learning. In: Proceedings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, pp. 627–635. JMLR Workshop and Conference Proceedings (2011) 27. Schaul, T., Quan, J., Antonoglou, I., Silverm D.: Prioritized experience replay (2016) 28. Sutton, R.S.: Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bull. 2(4), 160–163 (1991) 29. Tassa, Y., et al.: Deepmind control suite (2018). https://arxiv.org/abs/1801.00690

82

C. Kohler and R. Platt

30. Van Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: International Conference on Machine Learning, pp. 1747–1756. PMLR (2016) 31. Wang, D., Kohler, C., Platt, R.: Policy learning in SE (3) action spaces. In: Proceedings of the Conference on Robot Learning (2020) 32. Wang, D., Walters, R., Zhu, X., Platt, R.: Equivariant q learning in spatial action spaces. In: Conference on Robot Learning, pp. 1713–1723. PMLR (2022) 33. Wu, H., Ye, J., Meng, X., Paxton, C., Chirikjian, G.: Transporters with visual foresight for solving unseen rearrangement tasks. arXiv preprint arXiv:2202.10765 (2022) 34. Wu, J., et al.: Spatial action maps for mobile manipulation. arXiv preprint arXiv:2004.09141 (2020) 35. Zeng, A., Song, S., Welker, S., Lee, J., Rodriguez, A., Funkhouser, T.: Learning synergies between pushing and grasping with self-supervised deep reinforcement learning (2018) 36. Zeng, A., Song, S., Lee, J., Rodriguez, A., Funkhouser, T.: TossingBot: learning to throw arbitrary objects with residual physics. IEEE Trans. Rob. 36(4), 1307–1319 (2020)

Robot Vision

Monocular Camera and Single-Beam Sonar-Based Underwater Collision-Free Navigation with Domain Randomization Pengzhi Yang1 , Haowen Liu2(B) , Monika Roznere2 , and Alberto Quattrini Li2 1

University of Electronic Science and Technology of China, Chengdu 610056, Sichuan, China [email protected] 2 Dartmouth College, Hanover, NH 03755, USA {haowen.liu.gr,monika.roznere.gr,alberto.quattrini.li}@dartmouth.edu Abstract. Underwater navigation presents several challenges, including unstructured unknown environments, lack of reliable localization systems (e.g., GPS), and poor visibility. Furthermore, good-quality obstacle detection sensors for underwater robots are scant and costly; and many sensors like RGB-D cameras and LiDAR only work in-air. To enable reliable mapless underwater navigation despite these challenges, we propose a low-cost end-to-end navigation system, based on a monocular camera and a ﬁxed single-beam echo-sounder, that eﬃciently navigates an underwater robot to waypoints while avoiding nearby obstacles. Our proposed method is based on Proximal Policy Optimization (PPO), which takes as input current relative goal information, estimated depth images, echosounder readings, and previous executed actions, and outputs 3D robot actions in a normalized scale. End-to-end training was done in simulation, where we adopted domain randomization (varying underwater conditions and visibility) to learn a robust policy against noise and changes in visibility conditions. The experiments in simulation and real-world demonstrated that our proposed method is successful and resilient in navigating a low-cost underwater robot in unknown underwater environments. The implementation is made publicly available at https://github. com/dartmouthrobotics/deeprl-uw-robot-navigation. Keywords: Monocular camera and sonar-based 3D underwater navigation · Low-cost AUV · Deep reinforcement learning · Domain randomization

1

Introduction

This paper presents an integrated deep-learning-based system, contingent on monocular images and ﬁxed single-beam echo-sounder (SBES) measurements, for navigating an underwater robot in unknown 3D environments with obstacles. P. Yang and H. Liu—Authors contributed equally. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 7. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 85–101, 2023. https://doi.org/10.1007/978-3-031-25555-7_7

86

P. Yang et al.

Obstacle avoidance is fundamental for Autonomous Underwater Vehicles (AUVs) to safely explore the largely unmapped underwater realms (e.g., coral reefs, shipwrecks). However, the underwater environment itself poses unique challenges in regards to safe navigation, which is still an open problem for AUVs [1]. There are limited sensors and positioning systems (e.g., GPS) that accurately measure the surroundings and operate underwater, thus preventing the use of well-established navigation methods [2] that were originally designed for ground vehicles with sensors like LiDAR. In addition, the sensor conﬁgurations in lowcost AUVs, equipped with monocular camera, inexpensive IMU, compass, and ﬁxed SBES, bear their own individual drawbacks, such as no scale information and drifting/uncertain measurements. These challenges make the classic methods for obstacle avoidance and navigation in unknown environments – i.e., those which (1) estimate the geometry of the space using sensors with direct [3] or indirect [4,5] state estimation methods and (2) apply speciﬁc behaviors or planning in the partial map (e.g., Vector Field Histogram [6], Dynamic Window Approach [7]) – not directly applicable in underwater scenarios. With recent advances in deep reinforcement learning (DRL) [8,9], several end-to-end deep neural network based methods have emerged, from raw images to control outputs. These endto-end methods – typically tasked to endlessly navigate or reach a visual target – demonstrated good performance for ground robots in unknown environments [10]. Comparatively, underwater domains bring problems to learningbased vision navigation due to a more complex image formation model that results in, e.g., backscattering and light atten- Fig. 1. How to guide an underwater robot to 3D uation (Fig. 1). waypoints given only monocular images, ﬁxed echoThis paper proposes a goal- sounder range measurements, and a localization oriented end-to-end DRL nav- system, but no map, while also avoiding obstacles? igation approach, given that classical planning methods are not straightforward to apply as they require accurate maps, which are diﬃcult to obtain due to the underwater perception challenges described above. In particular, we design the ﬁrst multi-modal end-toend underwater navigation system in unstructured 3D environments for which no map is available, based on Proximal Policy Optimization (PPO) [11], which allows for continuous action space. The provided inputs are goal positions, estimated depth images, and range measurements from the ﬁxed SBES. Monocular camera and ﬁxed SBES keep the AUV’s cost low, while exploiting and complementing the individual sensor’ strengths – i.e., large ﬁeld of view from the

Collision-Free Navigation

87

monocular camera that can provide relative scene depth and the absolute range measurement from the SBES. We also propose a method to mitigate the simto-real gap problem by leveraging domain randomization into our system. We generated realistic simulated environments with diﬀerent underwater visibility and randomized training environments, enhancing the model robustness to the changing visual conditions in real underwater domain. Extensive experimental analysis with tests and ablation studies of the proposed navigation system were conducted both in simulation and real-world. Results demonstrated high safety and eﬃciency compared to traditional navigation baselines and other sensor/model conﬁgurations, as well as reliable transferability to new environments.

2

Related Work

Obstacle avoidance and navigation without a prior map has been studied starting with wheeled mobile robots equipped with bumpers and sonar sensors [12] and later branching oﬀ into diﬀerent environments and sensor conﬁgurations. For underwater domains, one of the main challenges is the limit of choices for sensors. While some underwater LiDAR solutions are available [13], they are expensive (US$100,000 or more) and bulky – requiring a laser scanner and a camera. In addition, there is a lack of global positioning systems and the acoustic based positioning systems are aﬀected by noise, making mapping underwater challenging [1]. Our goal is to enable navigation for low-cost AUVs. Therefore, in the following, we discuss applications using sensors (i.e., SBES, cameras) that are typically conﬁgured on low-cost underwater robots. In practice, many underwater navigation systems depend on acoustic, inertial, and magnetic sensors [14–16]. For example, Calado et al. [17] proposed a method where the robot used a SBES to detect obstacles and construct a map of them. However, SBES can only provide a ﬁxed single distance measurement and has high uncertainty given the wide beam cone – around 30◦ . To infer more about the complex scene, the robot must frequently turn in multiple directions, which negatively aﬀects navigation eﬃciency. Alternatively, multi-beam and mechanical scanning sonars can cover a larger ﬁeld of view [18]. Hern´ andez et al. [19] used a multi-beam sonar to simultaneously build an occupancy map of the environment and generate collision-free paths to the goals. Grefstad et al. [20] proposed a navigation and collision avoidance method using a mechanically scanning sonar for obstacle detection. However, a scanning sonar takes a few seconds to scan a 360◦ view. The acoustic sensors’ accuracy depends on the environment structure and the type of reﬂections that arise. In addition, multi-beam and mechanical scanning sonars are signiﬁcantly more expensive than monocular cameras and SBES (in the order of >US$10k vs. US$10 - US$100). While cameras have shown to provide dense real-time information about the surroundings out of the water [21], there are fewer underwater obstacle avoidance methods that use cameras. The underwater domain indeed poses signiﬁcant challenges, including light attenuation and scattering. Most work considers reactive controls, i.e., no goal is speciﬁed. Rodr´ıguez-Teiles et al. [22] segmented RGB images to determine the direction for escape. Drews-Jr et al. [23] estimated a relative depth using the underwater dark channel prior and used that

88

P. Yang et al.

estimated information to determine the action. There has been recent eﬀorts in 3D trajectory optimization for underwater robots. Xanthidis et al. [24] proposed a navigation framework for AUV planning in cases when a map is known or when a point cloud provided by a visual-inertial SLAM system [5] is available. Our proposed method navigates the robot to 3D waypoints without explicit representation of the environment. Recently, deep learning (DL) methods have shown to work well with underwater robots. Manderson et al. [25] proposed a convolutional neural network that takes input RGB images and outputs unscaled, relative path changes for AUV driving. The network was trained with human-labeled data with each image associated with desired changes in yaw and/or pitch to avoid obstacles and explore interesting regions. Later it was extended with a conditional-learning based method for navigating to sparse waypoints, while covering informative trajectories and avoiding obstacles [26]. Our proposed method does not require human-labeled data. Amidst the progress in DRL, there is more research on robots operating out of water with monocular cameras. Some of these methods addressed the problem of safe endless 2D navigation without specifying any target location. Xie et al. [27] trained a Double Deep Q-network to avoid obstacles in simulated worlds and tested it on a wheeled robot. Kahn et al. [28] proposed a generalized computation graph for robot navigation that can be trained with fewer samples by subsuming value-based model-free and model-based learning. Other works provided the goal as a target image instead of a location [29–31]. Some methods, based on an end-to-end network, guided the robot to the goal using LiDAR or RGB-D cameras [10,32–34] and goal’s relative position for path planning. Recently, a DD-PPO based method was used to navigate a robot in an unknown indoor (simulated) environment, using a RGB-D camera, GPS, and compass [11]. Our method will be based on PPO, with the additional challenge of not having depth information directly from the camera. Nevertheless, due to the diﬃculties of applying DRL in real-world environments, most works performed training in simulation. However, policies learned in simulated environments may not transfer well to the real-world environment, due to the existence of reality (sim-to-real) gap [35]. To address this, several methods utilized domain randomization, where parameters of the simulated world were varied so that policies learned remained robust in real-world domain. For example, Sadeghi and Levine [36] proposed a DRL approach for indoor ﬂight collision avoidance trained only in CAD simulation that was able to generalize to the real world by highly randomizing the simulator’s rendering settings. Our approach draws from the advances in DRL: we design an end-to-end pipeline for low-cost underwater robot navigation to address the underwater challenges, combining multiple sensors and applying domain randomization.

3

Approach

The problem considered in this paper is as follows: an underwater robot deployed in an unknown environment needs to navigate to a goal location G ∈ R3 , minimizing the travel time, while avoiding collisions with obstacles.

Collision-Free Navigation

89

Fig. 2. Flowchart for the proposed end-to-end underwater 3D navigation system. The pipeline includes two stages: a depth prediction module (DPT) followed by a decision making module (PPO). During training, at each episode i, the robot is deployed in a of the raw RGB randomized simulated environment. Predicted depth map oimageDepth t , relative goal position ogoal , echo-sounder reading orange , and previous image oimageRGB t t t executed action at−1 are stacked with past k observations from the previous times steps to feed into the PPO network (solid lines). The robot performs the action sampled from the output policy distribution. New observations (dashed lines) are then obtained for computing the next action at time step t + 1. During real-world deployment, DPT’s computationally less expensive counterpart MiDaS was used as the depth prediction module for real-time inference.

To develop a mapless navigation solution for low-cost robots, we consider an underwater thruster-vectored robot that has an inexpensive sensor suite composed of: (1) a monocular camera, (2) a SBES placed below the camera and looking forward, (3) a compass, (4) pressure sensor for water depth, and (5) a (noisy) localization system. Selecting this sensor conﬁguration allows us to exploit the larger ﬁeld of view (FOV) covered by the camera while obtaining absolute front distance estimates with the ﬁxed SBES. For a general solution, robust to noise and changing visual conditions, we approach the real-time 3D navigation problem by devising an end-to-end system see Fig. 2 based on a neural network for dense depth prediction from monocular images and on a deep reinforcement learning method that takes as input the sensor suite data and outputs vertical and steering commands. We consider a window of prior measurements and executed actions given the absence of prior knowledge of the environment. In the remainder of this section, we describe in detail the RL approach, the depth prediction network, and how to address the sim-to-real gap. 3.1

Multi-modal Deep Reinforcement Learning Navigation

Given an unknown environment, the navigation problem can be formulated as a Partially Observable Markov Decision Process (POMDP), deﬁned with a 6tuple: state space S that cannot be directly observed by the robot, action space A modifying the current state of the robot, observation space Ω, a state-transition model T , the observation probability distribution O, and a reward function R which returns the reward after a state transition.

90

P. Yang et al.

Observation Space. The observation Ot at time step t consists of: (1) the predicted depth image oimageDepth ∈ R128×160 ; (2) an SBES range measuret range ment ot ∈ R; (3) the current relative goal position ogoal ∈ R3 – speciﬁcally, t h v h h v [Dt , Dt , θt ] , where Dt , Dt are robot’s current horizontal, vertical distances to the goal and θth represents the relative yaw heading diﬀerence; and (4) the past ∈ R2 . We stack observations considering a time winexecuted actions oaction t dow k to capture the robot’s progress towards the goal and to avoid obstacles that left the periphery view. In experiments, model using 5 time steps (decision period lasts 0.5 s for each step) showed good performance without adding too much computational expense. Action Space. The action space is at = [vt , ωt ] ∈ R2 , where vt is the vertical linear velocity and ωt is the yaw angular velocity. To generalize the applicability of the learned behavior to diﬀerent robots, we consider the actions to be in a range of [−1.0, 1.0] which will be linearly mapped to the range of velocities of a speciﬁc robot. Note that while we could include the horizontal forward linear velocity, we decided to keep it constant to facilitate surveying missions that require the same velocity to collect consistent high-quality measurements. The action is then given by the policy: at = π(Ot )

(1)

The goal is to ﬁnd the optimal policy π ∗ which maximizes the navigation policy’s expected return over a sequence τ of observations, actions, and rewards: (2) π ∗ = arg max Er∼p(τ |π) γ t rt π

where γ ∈ [0, 1.0] is the discount factor. The optimal policy would translate in a path that is safe and minimizes the time it takes to travel to the goal. Reward Function. Our reward function rt at time t encodes the objectives to stay not too close to any obstacle (rtobs ) and to reach the goal area as soon as possible (rtgoal ). When the robot is close to an obstacle, it will compute a negative reward: ⎧ dht < δh ∨ dvt < δv ∨ dsur < δv ⎨ −rcrash , t obs h δh ≤ dht < 2δh rt = −s0 (2δh − dt ), (3) ⎩ 0 otherwise where δh , δv represent the thresholds for the distances of the robot to the closest obstacle dht , dvt – horizontally or vertically, respectively. We also check the distance to the water surface dsur t , as there might be surface obstacles that cannot be detected given the sensor conﬁguration of the robot. The threshold values δh , δv should consider the robot’s size and turning radius. When any of the constraints are met – i.e., the robot is too close to an obstacle or the surface – the current episode terminates with a large negative constant reward −rcrash . In addition, to guarantee safety, a penalty for motions within a range [δh , 2δh ) of distance

Collision-Free Navigation

91

to nearby obstacles is given according to the current distance. Otherwise, if the robot is far from the obstacles, no negative reward is applied. To guide the robot towards the goal both horizontally and vertically, we split the goal-based reward into two parts. First, the horizontal goal-based reward: −s1 |θth |, Δh < Dth (4) rtgoalh = h rsuccess − s2 |θt |, otherwise If the robot’s horizontal distance to the goal Dht is greater than a threshold Δh , then the penalty is based on the robot’s orientation to the goal – i.e., a robot already facing the goal gets a smaller penalty, as the constant forward velocity will ensure shorter arrival time. Otherwise, if the robot is within the goal area, then there is a positive reward with a preference to the robot’s orientation towards the goal. Likewise, the vertical goal-based reward: ⎧ ⎨ s3 |D˙ tv |, D˙ tv ≤ 0 ∧ Δh < Dth goalv = −s3 |D˙ tv |, D˙ tv > 0 ∧ Δh < Dth rt (5) ⎩ −s4 |Dtv |, otherwise When the robot is not near the goal, the vertical goal-based reward is a positive value if the change in vertical distance over time D˙ tv is negative or 0 – i.e., the robot is getting closer to the target depth. On the contrary, it is a negative value if the change is positive – i.e., the robot is getting farther from the target depth. Otherwise, if the robot is within goal area, the negative reward is relative to the distance to the target depth. This split (horizontal and vertical) of the goal reward showed better stability in experiments than when a single combined goal reward was applied, potentially due to the separate focus of two mostly independent actions. The above obstacle- and goal-based rewards conﬂict with each other; they could lead to oscillations at local optima when an obstacle is nearby. Thus, we devised a priority-based strategy (when the robot is not in the goal area) that focuses on moving away from the obstacle by scaling rtgoalh : rtgoalh ∗ = s5 (dht − δh )/δh , Δh < Dth ∧ δh ≤ dht < 2δh

(6)

In all the reward equations, s0 , . . . , s5 are positive scaling factors. Intuitively, they are set so that rewards are in an appropriate scale for a balanced training performance. Finally, the collective reward at time t can be obtained as: rt = rtobs + rtgoalh + rtgoalv

(7)

Network Architecture. The network structure depicted in Fig. 3(left) illustrates how we integrate the information vectors from the sensors. First, the stacked predicted depth images are processed by three convolutional layers, then

92

P. Yang et al.

Fig. 3. (left) Network architecture. Predicted depth images are processed by three layers of convolutional layers (orange). Its output is ﬂattened and concatenated with feature vectors (green) representing the stacked relative goal positions, echo-sounder readings, and past actions. The ﬁnal fully-connected layer outputs a navigation policy and state value. (right) Top View of the training env. Our model was trained in the above simulated environment in area A (inside area with fewer obstacles and smaller space) and B (outside area with more obstacles and larger space).

the ﬂattened output ∈ R512 is concatenated with processed feature vectors consisting of the stacked relative goal positions ∈ R96 , SBES readings ∈ R32 , and past actions ∈ R64 . Speciﬁcally, the combined echo-sounder readings provide an implicit scale on the relative depth prediction without requiring calibration. The network will produce a navigation policy and state value. 3.2

Image Depth Prediction Network

Accurate image depth predictions is important for our navigation pipeline to work. Previous work used ground truth simulated depth images with Gaussian noise as input for training and applied depth estimation during deployment [27]. However, this broadens the sim-to-real gap as real-world noise in depth predictions is more complex than implemented simulated noise models [37]. Instead, we utilized one of the latest monocular depth prediction networks, Dense Prediction Transformer (DPT) [38], which has an encoder-decoder design and applies a transformer as the encoder’s main building block. We selected DPT over other deep neural networks for depth prediction for its state-of-the-art performance in single-view depth estimation and robustness across diverse environments. 3.3

Transferable Model

DRL often has the problem of generalization: models trained in one domain fail to transfer to other domains even if there are small diﬀerences between the domains [39]. Unlike in-air, images taken underwater will look drastically diﬀerent across various environments due to the more complex lighting and backscattering eﬀects [40]. Thus, training the model in a single ﬁxed environment would lead to over-ﬁtting to that environment’s visual conditions. One solution is to retrain the depth prediction network with an existing underwater image depth dataset, which, however, is not available. Another solution

Collision-Free Navigation

93

is to enhance the input underwater images to its approximate in-air counterpart [40,41]. Yet, most image enhancement techniques require diﬃcult-toretrieve information (e.g., water attenuation coeﬃcients, depth maps). Our approach is to integrate underwater features into the simulation used for training. We modiﬁed an existing underwater simulator framework for games to create the training and testing simulations for our proposed approach. The framework contains custom shaders that incorporates a light transmission model to simulate underwater optical eﬀects, thus providing a good amount of realism. Domain Randomization. We integrated domain randomization to generate underwater environments with diﬀerent visual conditions, thus enabling transferability. In particular, at the start of every training episode, we randomize the underwater visibility – the gradient and conditions in visibility over distance. Visibility was selected as it signiﬁcantly impacts the relative depth estimation, thus aﬀecting to a large extent how the robot perceives its surroundings. We decided not to apply domain adaptation [42] – i.e., the process of learning diﬀerent environment encoding and corresponding adapted policy during training, so that during testing the best environment encoding will be found with the corresponding adapted policy – because searching the best environment encoding is not very practical for underwater deployments. For instance, the search would require robot motions towards obstacles to identify the (potentially changing) visibility feature of the speciﬁc environment. Multi-scenario Training. We built the simulated training environment via Unity Engine1 . We generated two activity areas to represent two classes of environments that an AUV might encounter: A – a small area with fewer obstacles, and B – a big cluttered area with obstacles at various positions and heights (see Fig. 3(right)). In each training episode, the robot’s starting pose and goal location are randomly reset in the environment. This exposure to diﬀerent training scenarios ensures that the learned policy will be more likely to handle more complex environments [35].

4

Experimental Results

We trained and performed experiments in simulation, in real-world with a vectorthruster underwater robot, and with underwater datasets to validate our DRLbased multi-modal sensor navigation system. We performed comparisons and ablation studies with other methods. Our framework is publicly available2 . 4.1

Training Experimental Settings

Our model was ﬁrst trained and tested on a workstation with two 12 GB NVIDIA 2080Ti GPUs. It was implemented with PyTorch and Adam optimizer [43]. 1 2

http://www.unity.com/. https://github.com/dartmouthrobotics/deeprl-uw-robot-navigation.

94

P. Yang et al.

Fig. 4. Partial top view of runs in Cluttered Env. (left): Bug2 (second), Our Model w/o SBES (third), and Our Model w/ SBES (right). Legend: robot’s start pose (green dot); obstacles (black dots); waypoints to reach in order (circled numbers).

In simulation, the robot’s forward velocity, vertical velocity range, and yaw angular velocity range were set to 0.345 m/s, (–0.23, 0.23) m/s, (–π/6π/6) rad/s, respectively. While the training environment allows for higher velocities, we chose low velocities to avoid any “jerky” motion that could happen with the AUV at high speed. The camera’s horizontal and vertical FOVs were set to 80◦ and 64◦ . The simulated echo-sounder’s max detection range was set to 4 m, which are all consistent with the real-world sensor conﬁguration. The simulation environments’ visibility value was randomly chosen within the range of (339) m. We trained for 250 iterations – each with at least 2048 time steps – and observed the reward was stable after around 120 iterations (learning rate of 3e5). The detailed constant and threshold values for the reward function – i.e., rsuccess , rcrash , Δh , δh , and δv – were set to 10, 10, 0.6 m, 0.5 m and 0.3 m, while the scaling factors s0 , s1 , . . . , s5 were set to 2.0, 0.1, 1.0, 1.0, 8.0, 1.0. 4.2

Performance Comparison with Diﬀerent Sensor Conﬁgurations

We ﬁrst tested the eﬃciency of our proposed multi-modal low-cost navigation approach against a traditional metric-based goal-oriented navigation method that does not require any map, given that no map of the underwater environment is available. In particular, we selected Bug2 algorithm given its guarantees on the path length. To have Bug2 work eﬀectively, we employed a multi-beam sonar Table 1. Waypoint tests results. 10 runs for each of the three methods: Bug2 with multi-beam sonar, our model trained without ﬁxed single-beam echo-sounder, and our proposed model. The travel time average and standard deviation (in seconds) of successful runs for each waypoint were calculated, as well as the overall success ratio to reach all ﬁve waypoints. Method

Sensors

Traveling Time/s (less is better) wp1

Bug2

MBS

Ours w/o SBES Monocular Camera Ours w/ SBES

wp2

wp3

wp4

57.6 ± 0.3

66.95 ± 0.15 41.15 ± 0.45 69.8 ± 0.9

51.8 ± 5.94

56.5 ± 2.09

wp5

Success ratio (higher is better)

77.65 ± 0.45 100%

35.62 ± 8.07 47.0 ± 2.03 76.0 ± 2.21

40%

Monocular Camera & SBES 38.35 ± 0.45 49.8 ± 0.78 29.3 ± 0.78 44.3 ± 0.6 67.25 ± 0.6 100%

Collision-Free Navigation

95

Fig. 5. Example of trajectories in diﬀerent scenes with diﬀerent training. Legend: robot’s initial position and goal waypoint (green and red dots); robot collision (red “X”); obstacles (approximated with polygons in the plots for simplicity).

(MBS), a common but expensive sensor for underwater obstacle avoidance, which emits multiple beams in a plane with a typical horizontal FOV of 120◦ . We also considered our model trained without the echo-sounder as ablation study to observe the eﬀect of the SBES. We generated a test environment in simulation with multiple obstacles. The robot’s task was to navigate to ﬁve randomly set consecutive waypoints. We set all waypoints at the same depth, as typical navigation with an MBS involves the robot ﬁrst arriving to the target depth and then navigating along the 2D plane. Figure 4 shows the trajectories of the three navigation methods and Table 1 reports the quantitative results measured in terms of traveling time and success ratio. Our proposed system with inexpensive monocular camera and SBES achieved the highest navigation eﬃciency with comparable safety to Bug2 with MBS. While the Bug2 trajectory appeared not to be aﬀected by noise, it spent the longest navigation time especially when moving along the obstacles. Note the echo-sounder played a fundamental role in safe navigation. If the echo-sounder was excluded, the model relied solely on relative monocular image depth estimation to detect surrounding obstacles. As a result, at times the chosen action might be conservative, leading to sub-optimal paths in terms of distance, or too aggressive, increasing the likelihood of collision. 4.3

Ablation Study with Transferability Tests

To show the transferability of our proposed model to diﬀerent environments and visibilities, we performed an ablation study with the same hyper-parameters and protocols, but considering the following combinations of training settings in a simulated underwater environment: (1) Rand : proposed domain randomization, (2) No Rand (Water): ﬁxed underwater visibility (approximately 11 m), and (3) No Rand (Air): no underwater features. To ﬁrstly exhibit the models’ generalizability, another simulated environment3 was employed for testing. With 3

https://github.com/Scrawk/Ceto.

96

P. Yang et al.

Table 2. Quantitative results for transferability tests. 10 runs for the three models in three scenes with diﬀerent visual conditions. Note: N/A means the method failed to reach the goal during the runs and bold means the best result. Method

Scene1 Blurry

No Rand (Air)

Reward 5.74 ± 2.17 Success 0% Trav. time N/A

Medium

Clear

Scene2 Blurry

Medium

6.5 ± 5.95 10% 70.0

28.14 ± 2.85 100% 67.2 ± 0.84

0.43 ± 2.26 0% N/A

No Rand (Water) Reward 25.27 ± 8.42 18.35 ± 11.18 13.46 ± 14.51 2.19 ± 1.78 Success 90% 90% 40% 0% Trav. time 70.5 ± 4.93 88.17 ± 18.36 69.25 ± 1.35 N/A Rand

Scene3 Blurry

Medium

Clear

10.93 ± 11.31 12.05 ± 8.92 40% 50% 53.12 ± 0.65 55.2 ± 2.84

24.64 ± 10.19 70% 63.29 ± 0.88

20.58 ± 13.7 60% 66.5 ± 4.53

29.18 ± 8.01 90% 66.11 ± 1.07

−1.58 ± 5.94 10% 115.0

18.03 ± 11.32 60% 71.42 ± 6.9

30.14 ± 7.5 90% 73.39 ± 2.63

29.42 ± 3.27 100% 65.35 ± 0.78

Clear

15.04 ± 10.6 70% 59.79 ± 8.25

Reward 24.66 ± 9.3 28.39 ± 2.26 29.56 ± 2.58 21.68 ± 9.61 23.36 ± 7.49 24.86 ± 2.92 29.17 ± 11.34 30.26 ± 9.25 36.26 ± 0.83 Success 90% 100% 100% 80% 90% 100% 80% 90% 100% Trav. time 67.56 ± 0.44 68.45 ± 0.72 67.05 ± 1.27 52.0 ± 0.35 53.44 ± 1.23 50.75 ± 0.46 60.75 ± 0.56 62.56 ± 0.98 61.05 ± 0.57

diﬀerent materials, textures, lightings and custom shaders, it had a diﬀerent visual appearance compared to the training environment. In this environment, the models were tested in three diﬀerent scenes, constructed to resemble possible underwater obstacles present in the real-world, such as natural structures (Scene1), submerged wrecks (Scene2) and man-made structures (Scene3). We considered three visibility scenarios: blurry, medium, and relatively clear, with maximum visibility ranges of 8 m, 12 m, and 20 m, respectively. Figure 5 shows snapshots of each scene and the resulting trajectories in some sample runs. Comparison Metrics. The following metrics were used to compare the three methods’ performances (see Table 2): 1) Rewards (higher is better): cumulative reward average and standard deviation over 10 runs, 2) Success Ratio (higher is better): number of times the robot reached the goal with no collision over 10 runs, 3) Travel Time (less is better): average and standard deviation traveling time (s). Failed runs were not considered. From the results, training with underwater features has the highest gain. Adding domain randomization allows a further increase of the cumulative rewards, success rate, and travel time. Models trained without randomization did not previously encounter abundant visual conditions, thus explored a limited observation space. Accordingly, they would not be easily applicable to diﬀerent visibility conditions and are more vulnerable to noise especially in low-visibility environments when depth estimations are inaccurate. Scene3 in particular was challenging with blurry visibility, due to the narrow passage between the logs. 4.4

Performance Demonstration in Real-World Environment

We conducted real-world experiments with a BlueROV2 in a swimming pool. The robot was equipped with a Sony IMX322LQJ-C camera4 with a resolution 4

https://www.bluerobotics.com/store/sensors-sonars-cameras/cameras/cam-usblow-light-r1/.

Collision-Free Navigation

97

Fig. 6. Pool experiment. Navigation trajectories with localization noise smoothing (legend: Start and goal green and red dots; obstacles, cuboids) and images from the robot’s camera. Red arrows point to the approximate goal locations behind the boxes.

Fig. 7. Single image action and depth prediction. 1st row: images from Lake Sunapee and Caribbean Sea. 2nd row: their respective depth predictions. Direction and magnitude of the action predicted (red arrow); approximate goal location (yellow arrow).

of 5 MP, a horizontal and vertical FOV of 80◦ and 64◦ . The ﬁxed SBES has a 30◦ beam width and a maximum range set to 4 m. The (noisy) robot’s pose was provided by an on-board compass, a water-pressure sensor to recover water depth, and a short baseline acoustic positioning system (SBL)5 . A 2.8 GHz Intel i7 laptop with Nvidia Quadro M1200 was used for running the inference network through the Robot Operating System (ROS). For real-time inference, DPT was replaced with its computationally less expensive counterpart MiDaS [44] as our depth prediction network – about 0.08 s per inference. The swimming pool was about 20 m by 7 m in size with a shallow (1 m) and deep (3 m) end, and a slope in the middle. Two black boxes (approximate size: 0.8 × 0.5 × 0.3 m were placed in two diﬀerent conﬁgurations: side by side as a large obstacle and with a 1 m separation to create a channel. Resulting paths and reference images are shown in Fig. 6. Our proposed navigation approach successfully drove the BlueROV2 to diﬀerent 3D waypoints, 5

https://waterlinked.github.io/explorer-kit/introduction/.

98

P. Yang et al.

avoiding obstacles by going around, above, or through a channel (see Fig. 6). We observed that the SBL provided noisier position information compared to in simulation – at times the robot’s location jumped up to a meter. While the noise aﬀected the calculation of the relative position to the goal, our approach does not depend on the absolute robot location to infer obstacle distance, so the robot was able to avoid obstacles. 4.5

Action Prediction from Static Underwater Images

We also tested joint image and SBES reading data from past ﬁeld trials (in the Caribbean Sea and lake) as input to our model for action prediction. Figure 7 shows a sample of such images with corresponding depth predictions, locations of the goal, and predicted actions. As expected, with obstacles nearby the predicted action prioritized obstacle avoidance, steering the robot away, otherwise, the action’s direction pointed towards the goal. This qualitative test demonstrates our model’s generalizability to real-world applications.

5

Conclusion and Future Work

We presented the ﬁrst 3D map-less underwater navigation approach, based on Proximal Policy Optimization Network (PPO) and domain randomization, for low-cost underwater robots with a monocular camera and a ﬁxed single-beam echo-sounder. By choosing deep reinforcement learning over classic methods, we were able to address the intrinsic challenges of seamless underwater navigation (e.g., lack of low-cost eﬃciency sensor and diﬃculty in generating a map given noisy positioning and perception data). We validated our approach with several comparisons and ablation studies in diﬀerent simulated environments, as well as real-world validation in a swimming pool and with static underwater images. Results showed that the robot is able to navigate to arbitrary 3D goals while avoiding obstacles inferred from estimated depth images and sonar readings. In the future, we will investigate explicit sensor fusion of camera and SBES data to achieve better depth prediction with absolute scale, e.g. early fusion [45], as well as controller and SBL data. In addition, we will consider the generation of more complex environments, other real-world experiments, and the design of integrated models for diﬀerent sensor conﬁgurations (e.g., stereo cameras) and dynamic models to adapt our method to heterogeneous underwater robots. Acknowledgments. We thank Devin Balkcom for access to the pool for experiments, and Bo Zhu, Mary Flanagan, and Sukdith Punjasthitkul for GPU access. This work is supported in part by the Burke Research Initiation Award and NSF CNS-1919647, 2024541, 2144624, OIA-1923004.

References 1. Petillot, Y.R., Antonelli, G., Casalino, G., Ferreira, F.: Underwater robots: from remotely operated vehicles to intervention-autonomous underwater vehicles. IEEE Robot. Autom. Mag. 26, 94–101 (2019)

Collision-Free Navigation

99

2. Pfrunder, A., Borges, P.V., Romero, A.R., Catt, G., Elfes, A.: Real-time autonomous ground vehicle navigation in heterogeneous environments using a 3D LiDAR. In: Proceedings of the IROS (2017) 3. Engel, J., Koltun, V., Cremers, D.: Direct sparse odometry. IEEE Trans. Pattern Anal. Mach. Intell. 40, 611–625 (2017) 4. Campos, C., Elvira, R., Rodr´ıguez, J.J.G., Montiel, J.M., Tard´ os, J.D.: ORBSLAM3: an accurate open-source library for visual, visual-inertial, and multimap SLAM. IEEE Trans. Robot. 37, 1874–1890 (2021) 5. Rahman, S., Quattrini Li, A., Rekleitis, I.: SVIn2: an underwater SLAM system using sonar, visual, inertial, and depth sensor. In: Proceedings of the IROS (2019) 6. Panagou, D.: Motion planning and collision avoidance using navigation vector ﬁelds. In: Proceedings of the ICRA (2014) 7. Fox, D., Burgard, W., Thrun, S.: The dynamic window approach to collision avoidance. IEEE Robot. Autom. Mag. 4, 23–33 (1997) 8. Kober, J., Bagnell, J.A., Peters, J.: Reinforcement learning in robotics: a survey. Int. J. Robot. Res. 32, 1238–1274 (2013) 9. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015) 10. Xie, L., Wang, S., Rosa, S., Markham, A., Trigoni, N.: Learning with training wheels: speeding up training with a simple controller for deep reinforcement learning. in: Proceedings of the ICRA (2018) 11. Wijmans, E., et al.: DD-PPO: learning near-perfect pointgoal navigators from 2.5 billion frames. In: Proceedings of the ICLR (2020) 12. Choset, H.M., et al.: Principles of Robot Motion: Theory, Algorithms, and Implementation. MIT Press, Cambridge (2005) 13. McLeod, D., Jacobson, J., Hardy, M., Embry, C.: Autonomous inspection using an underwater 3D LiDAR. In: Proceedings of the OCEANS (2013) 14. Kinsey, J.C., Eustice, R.M., Whitcomb, L.L.: A survey of underwater vehicle navigation: Recent advances and new challenges. In: IFAC Proceedings of the Conference of Manuevering and Control of Marine Craft (2006) 15. Williams, S.B., Newman, P., Rosenblatt, J., Durrant-Whyte, H.: Autonomous underwater navigation and control. Robotica 19, 481–496 (2001) 16. Paull, L., Saeedi, S., Seto, M., Li, H.: AUV navigation and localization: a review. IEEE J. Ocean. Eng. 39, 131–149 (2013) 17. Calado, P., et al.: Obstacle avoidance using echo sounder sonar. In: Proceedings of the OCEANS (2011) 18. Petillot, Y., Ruiz, I.T., Lane, D.M.: Underwater vehicle obstacle avoidance and path planning using a multi-beam forward looking sonar. IEEE J. Ocean. Eng. 26, 240–251 (2001) 19. Hern´ andez, J.D., Vidal, E., Vallicrosa, G., Galceran, E., Carreras, M.: Online path planning for autonomous underwater vehicles in unknown environments. In: Proceedings of the ICRA (2015) 20. Grefstad, Ø., Schjølberg, I.: Navigation and collision avoidance of underwater vehicles using sonar data. In: IEEE/OES Autonomous Underwater Vehicle Workshop (AUV), pp. 1–6 (2018) 21. Liu, F., Shen, C., Lin, G., Reid, I.: Learning depth from single monocular images using deep convolutional neural ﬁelds. IEEE Trans. Pattern Anal. Mach. Intell. 38(10), 2024–2039 (2015) 22. Rodr´ıguez-Teiles, F.G., et al.: Vision-based reactive autonomous navigation with obstacle avoidance: Towards a non-invasive and cautious exploration of marine habitat. In: Proceedings of the ICRA (2014)

100

P. Yang et al.

23. Drews-Jr, P., Hern´ andez, E., Nascimento, E.R., Campos, M.: Real-time monocular obstacle avoidance using underwater dark channel prior. In: Proceedings of the IROS (2016) 24. Xanthidis, M., et al.: Navigation in the presence of obstacles for an agile autonomous underwater vehicle. In: Proceedings of the ICRA (2020) 25. Manderson, T., Higuera, J.C.G., Cheng, R., Dudek, G.: Vision-based autonomous underwater swimming in dense coral for combined collision avoidance and target selection. In: Proceedings of the IROS (2018) 26. Manderson, T., et al.: Vision-based goal-conditioned policies for underwater navigation in the presence of obstacles. In: Proceedings of the RSS (2020) 27. Xie, L., Wang, S., Markham, A., Trigoni, N.: Towards monocular vision based obstacle avoidance through deep reinforcement learning. In: RSS workshop on New Frontiers for Deep Learning in Robotics (2017) 28. Kahn, G., Villaﬂor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement learning with generalized computation graphs for robot navigation. In: Proceedings of the ICRA (2018) 29. Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proceedings of the ICRA (2017) 30. Devo, A., Mezzetti, G., Costante, G., Fravolini, M.L., Valigi, P.: Towards generalization in target-driven visual navigation by using deep reinforcement learning. IEEE Trans. Robot. 36, 1546–1561 (2020) 31. Wu, Q., Gong, X., Xu, K., Manocha, D., Dong, J., Wang, J.: Towards target-driven visual navigation in indoor scenes via generative imitation learning. IEEE Robot. Autom. Lett. 6, 175–182 (2020) 32. Pfeiﬀer, M., Schaeuble, M., Nieto, J., Siegwart, R., Cadena, C.: From perception to decision: a data-driven approach to end-to-end motion planning for autonomous ground robots. In: Proceedings of the ICRA (2017) 33. Zhang, J., Springenberg, J.T., Boedecker, J., Burgard, W.: Deep reinforcement learning with successor features for navigation across similar environments. In: Proceedings of the IROS (2017) 34. Liang, J., Patel, U., Sathyamoorthy, A.J., Manocha, D.: Crowd-steer: realtime smooth and collision-free robot navigation in densely crowded scenarios trained using high-ﬁdelity simulation. In: Proceedings of the IJCAI (2021) 35. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., Abbeel, P.: Domain randomization for transferring deep neural networks from simulation to the real world. In: Proceedings of the IROS (2017) 36. Sadeghi, F., Levine, S.: CAD2RL: real single-image ﬂight without a single real image. In: Proceedings of the RSS (2017) 37. Sweeney, C., Izatt, G., Tedrake, R.: A supervised approach to predicting noise in depth images. In: Proceedings of the ICRA (2019) 38. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the ICCV (2021) 39. Cobbe, K., Klimov, O., Hesse, C., Kim, T., Schulman, J.: Quantifying generalization in reinforcement learning. In: Proceedings of the ICML (2019) 40. Akkaynak, D., Treibitz, T.: A revised underwater image formation model. In: Proceedings of the CVPR (2018) 41. Roznere, M., Quattrini Li, A.: Real-time model-based image color correction for underwater robots. In: Proceedings of the IROS (2019) 42. Peng, X.B., Coumans, E., Zhang, T., Lee, T.-W., Tan, J., Levine, S.: Learning agile robotic locomotion skills by imitating animals. In: Proceedings of the RSS (2020)

Collision-Free Navigation

101

43. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Proceedings of the ICLR (2014) 44. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44, 1623–1637 (2019) 45. Roznere, M., Quattrini Li, A.: Underwater monocular depth estimation using single-beam echo sounder. In: Proceedings of the IROS (2020)

Nonmyopic Distilled Data Association Belief Space Planning Under Budget Constraints Moshe Shienman(B) and Vadim Indelman Technion - Israel Institute of Technology, 32000 Haifa, Israel [email protected], [email protected] Abstract. Autonomous agents operating in perceptually aliased environments should ideally be able to solve the data association problem. Yet, planning for future actions while considering this problem is not trivial. State of the art approaches therefore use multi-modal hypotheses to represent the states of the agent and of the environment. However, explicitly considering all possible data associations, the number of hypotheses grows exponentially with the planning horizon. As such, the corresponding Belief Space Planning problem quickly becomes unsolvable. Moreover, under hard computational budget constraints, some non-negligible hypotheses must eventually be pruned in both planning and inference. Nevertheless, the two processes are generally treated separately and the eﬀect of budget constraints in one process over the other was barely studied. We present a computationally eﬃcient method to solve the nonmyopic Belief Space Planning problem while reasoning about data association. Moreover, we rigorously analyze the eﬀects of budget constraints in both inference and planning. Keywords: Planning under uncertainty

1

· Robust perception · SLAM

Introduction

Intelligent autonomous agents and robots are expected to operate reliably and eﬃciently under diﬀerent sources of uncertainty. There are various possible reasons for such uncertainty, including noisy measurements; imprecise actions; and dynamic environments in which some events are unpredictable. In these settings, autonomous agents are required to reason over high-dimensional probabilistic states known as beliefs. A truly autonomous agent should be able to perform both inference, i.e. maintain a belief over the high-dimensional state space given available information, and decision making under uncertainty. The latter is also known as the Belief Space Planning (BSP) problem, where the agent should autonomously determine its next best actions while reasoning about future belief evolution. However, both inference and BSP are computationally expensive and practically infeasible in real-world autonomous systems where the agent is required to operate in real time using inexpensive hardware. This research was partially supported by US NSF/US-Israel BSF, and by Israel PBCVATAT and the Technion Artiﬁcial Intelligence Hub (Tech.AI). c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 102–118, 2023. https://doi.org/10.1007/978-3-031-25555-7_8

Nonmyopic Distilled Data Association BSP Under Budget Constraints

103

In real-world scenarios, an autonomous agent should also be resilient to the problem of ambiguous measurements. These ambiguities occur when a certain observation has more than one possible interpretation. Some examples include the slip/grip behavior of odometry measurements; the loop closure problem in visual Simultaneous Localization and Mapping (SLAM); and unresolved data association. The latter is deﬁned as the process of associating uncertain measurements to known tracks, e.g. determine if an observation corresponds to a speciﬁc landmark within a given map. Most existing inference and BSP algorithms assume data association to be given and perfect, i.e. assume a single hypothesis represented by a uni-modal state and map estimates. Yet, in perceptually aliased environments, this assumption is not reasonable and could lead to catastrophic results. Therefore, it is crucial to reason about data association, in both inference and planning, while also considering other sources of uncertainty. Explicitly reasoning about data association, the number of hypotheses grows exponentially with time. As such, when considering real time operation using inexpensive hardware, hard computational constraints are often required, e.g. bounding the number of supported hypotheses. State of the art inference and planning approaches therefore use diﬀerent heuristics, e.g. pruning and merging, to relax the computational complexity. However, this loss of information incurs loss in solution quality and there are usually no performance guarantees. Moreover, inference and planning are commonly treated separately and it is unclear how budget constraints in one process aﬀect another. In this work we extend our presented approach in [17] to a nonmyopic setting. Speciﬁcally, we handle the exponential growth of hypotheses in BSP by solving a simplified problem while providing performance guarantees. To that end, we analyze for the ﬁrst time, the construction of a belief tree within planning given a mixture belief, e.g. Gaussian Mixture Models (GMM). We further show how to utilize the skeleton of such belief tree to reduce the computational complexity in BSP. Crucially, this paper thoroughly studies, for the ﬁrst time, the impacts of hard budget constraints in either planning and/or inference.

2

Related Work

Several approaches were recently proposed to ensure eﬃcient and reliable operation in ambiguous environments. Known as robust perception, these approaches typically maintain probabilistic data association and hypothesis tracking. A good inference mechanism should handle false data association made by front-end algorithms and be computationally eﬃcient. The authors of [16] recently suggested to re-use hypotheses’ weights from previous steps to reduce computational complexity and improve current-time hypotheses pruning. Convex relaxation approaches over graphs were proposed in [2,13] to capture perceptual aliasing and ﬁnd the maximal subset of internally coherent measurements, i.e. correct data association. The max-mixture model was presented in [14] to allow fast maximum-likelihood inference on factor graphs [12] that contain arbitrarily complex probability distributions such as the slip/grip multi modal problem. The authors of [7,8] used factor graphs with an expectation-maximization

104

M. Shienman and V. Indelman

approach to eﬃciently infer initial relative poses and solve a multi robot data association problem. In [21] the topological structure of a factor graph was modiﬁed during optimization to discard false positive loop closures. The Bayes tree algorithm [10] was extended in [5,9] to explicitly incorporate multi-modal measurements within the graph and generate multi-hypothesis outputs. These works, however, were only developed for the purpose of inference, i.e. without planning. Ambiguous data association was also considered in planning. In [1] a GMM was used to model prior beliefs representing diﬀerent data association hypotheses. However, the authors did not reason about ambiguous data association within future beliefs (owing to future observations), i.e. they assumed that it is solved and perfect in planning. In [15] the authors introduced DA-BSP where, for the ﬁrst time, reasoning about future data association hypotheses was incorporated within a BSP framework. The ARAS framework proposed in [6] leveraged the graphical model presented in [5] to reason about ambiguous data association in future beliefs. All of these approaches handled the exponential growth in the number of hypotheses by either pruning or merging. The ﬁrst work to also provide performance guarantees on the loss in solution quality was presented in [17]. Yet, the authors only considered a myopic setting. The notion of simpliﬁcation was introduced in [4], where, the authors formulated the loss in solution quality in BSP problems via bounds over the objective function. However, they only considered the Gaussian case and a maximum likelihood assumption. The authors of [23] used bounds as a function of simplified beliefs to reduce the computational complexity in nonmyopic BSP problems with general belief distributions. In [22] they incorporated this concept within a Monte Carlo Tree Search (MCTS) planning framework, i.e. without assuming that the belief tree is given, which is complimentary to our approach. Yet, they did not handle ambiguous data association nor budget constraints aspects.

3

Background and Notations

In this section we review some basic concepts from estimation theory and BSP which we will use in the following sections. 3.1

Inference

Consider an autonomous agent operating in a partially known or pre-mapped environment containing similar landmarks or scenes. The agent acquires observations and tries to infer random variables of interest that are application dependent while reasoning about data association. We denote the agent’s state at time instant k by xk . Let Zk {zk,1 , ..., zk,nk } denote the set of all nk measurements and let uk denote the agent’s action. Z1:k and u0:k−1 denote all observations and actions up to time k, respectively. The motion and observation models are given by (1) xk+1 = f (xk , uk , wk ) , zk = h xk , xl , vk ,

Nonmyopic Distilled Data Association BSP Under Budget Constraints

105

where xl is a landmark pose and wk and vk are noise terms, sampled from known motion and measurement distributions, respectively. Given nk observations, the data association realization vector is denoted by βk ∈ Nnk . Elements in βk are associated according to the given observation model and each element, e.g. landmark, is given a unique label. A speciﬁc data association hypothesis is thus given by a speciﬁc set j of associations up to and j . including time k and is denoted as β1:k At each time step the agent maintains a posterior belief over both continuous and discrete variables given by b [xk , β1:k ] P (xk , β1:k |z0:k , u0:k−1 ) = P (xk , β1:k |Hk ) ,

(2)

where Hk {Z1:k , u0:k−1 } represents history. Using the chain rule, the belief becomes a mixture and can be written as a linear combination of |Mk | hypotheses j j P xk |β1:k , Hk P β1:k |Hk , (3) bk =

j∈Mk bjk

j wk

where bjk is a conditional belief, with some general distribution, and wkj is the associated weight. Therefore, Mk is a set of maintained weighted conditional beliefs, representing diﬀerent data association hypotheses. In this work, we interchangeably refer to each bjk as both a hypothesis and a component. Each conditional belief hypothesis bjk in (3) can be eﬃciently calculated by maximum a posteriori inference, e.g. as presented in [10] for the Gaussian case. Nevertheless, our formulation and approach also applies to a non-parametric setting. Each component weight wkj is calculating by marginalizing over the state space and applying the Bayes rule (as developed in [15,17]). Reasoning about data association, without any computational constraints, the number of considered hypotheses grows exponentially with time. In general, such belief is a function of bk = ψk (bk−1 , uk−1 , Zk ). However, under hard computational constraints, the number of hypotheses is bounded by C ∈ N. Therefore, the belief in each time step is a function of C bψ k = ψk (bk−1 , uk−1 , Zk , C) ,

where ψkC contains some heuristic function hinf such that 3.2

(4) |Mkψ |

≤ C.

DA-BSP

Given a posterior belief (3) and a set of candidate action sequences U the goal of BSP is to ﬁnd the optimal action sequence that would minimize/maximize a certain objective function. We note that while in this paper we consider, for simplicity, action sequences, our approach is applicable also to policies. Reasoning about data association in planning, a user deﬁned objective function J can be written as N c (bk+n , uk+n−1 ) , (5) E J (bk , uk:k+N −1 ) = E β(k+1)+

Z(k+1)+ |β(k+1)+

n=1

106

M. Shienman and V. Indelman

Fig. 1. (a) A belief tree constructed during planning. Each node represents a posterior belief (3); The number of belief components grows exponentially along the highlighted path as presented in (b).

where β(k+1)+ βk+1:k+N , Z(k+1)+ Zk+1:k+N and c (·) denotes a cost function. The expectation is taken with respect to both future data association realizations and observations. The optimal action sequence u∗k:k+N −1 is deﬁned as u∗k:k+N −1 = argmin J (bk , uk:k+N −1 ) . U

(6)

To solve (6) we need to consider all possible future realizations of Zk+n for every n ∈ [k + 1, k + N ] while marginalizing over all possible locations and data association realizations (see Section 5.2 in [15]). However, solving these integrals analytically is typically not feasible. In practice, the solution should be approximated by sampling future observations from the relevant distributions. Using these samples, the agent constructs and traverses a belief tree (as shown in Fig. 1a) which branches according to future actions and observations. Nevertheless, the number of hypotheses grows exponentially with the planning horizon (see Fig. 1b). Speciﬁcally, given |Mk | hypotheses and D data association realizations, i.e. diﬀerent βk+i at each look-ahead step, the number of n belief components at the nth look-ahead step is |Mk+n | = |Mk | |D| . As such, considering every possible future hypothesis is not practical.

4

Methodology

In this section we ﬁrst describe how to construct a belief tree skeleton during planning. We then present a general framework to reduce the computational complexity when solving a sampling based approximation of (5). Finally, we analyze the implications of using our proposed framework under diﬀerent conditions. 4.1

Constructing the Belief Tree skeleton

Previous works addressed the exponential growth of the belief tree with the planning horizon without reasoning about data association. In this work we analyze and describe, for the ﬁrst time, the structure of a belief tree given a mixture belief such as (3). In this setting there is an additional exponential growth in the number of belief components for every considered future observation realization

Nonmyopic Distilled Data Association BSP Under Budget Constraints

107

(see Fig. 1). These realizations are functions of future beliefs (3), data association realizations and actions P(Zk+1:k+n |bk , uk:k+n−1 , βk+1:k+n ).

(7)

To construct the belief tree in practice, we sample states from beliefs, sample data association given states and ﬁnally sample observations from (7). Our key observation is that in order to construct a belief tree skeleton, i.e. without explicitly calculating or holding posterior beliefs at each node, we can sample future observations in two diﬀerent ways. We describe these two options for a planning horizon of n = 2. Speciﬁcally, we can either rewrite (7) as P(Zk+2 |bk+1|βk+1 , uk+1 , βk+2 )P(Zk+1 |bk , uk , βk+1 ),

(8)

where bk+1|βk+1 is a posterior belief and each term is evaluated by integrating over xk+1:k+2 , or, by ﬁrst integrating and then applying the chain rule as

P(Zk+2 |xk+2 , βk+2 ) xk+2

P(xk+2 |xk+1 , uk+1 )P(Zk+1 |xk+1 , βk+1 )P(xk+1 |bk , uk ). (9)

xk+1

While these two expressions are analytically identical, they represent two diﬀerent processes of sampling. In the former observations are sampled from posterior beliefs, while in the latter observations are sampled using the motion and observation models, similar to the MCTS particle trajectories techniques in [20,24].

Algorithm 1: Construct belief tree skeleton 1 2 3 4 5 6 7 8

Input: prior belief bk , action sequence uk:k+n−1 Output: sampled future observations Zk+1:k+n Z=∅ xk ∼ bk for i ∈ [1, n] do xk+i ∼ P(xk+i |xk+i−1 , uk+i−1 ) determine βk+i based on xk+i Zk+i ∼ P(Zk+i |xk+i , βk+i ) Z = Z ∪ Zk+i return Z

To avoid the explicit representation of the exponential number of belief components, in this work we sample future observations using (9) and bypass the inference stage. We formulate this sampling method in Algorithm 1. Yet, this is of little help if the posterior belief is required for calculating the cost function itself. We next describe our approach to avoid these calculations. 4.2

Nonmyopic Distilled Data Association BSP

Our goal is to reduce the computational complexity of nonmyopic BSP problems where ambiguous data association is explicitly considered, i.e. solving (6) eﬃciently. We start by writing (5) in a recursive form

108

M. Shienman and V. Indelman

J (bk , uk:k+N −1 ) = c (bk , uk ) + E

βk+1

E

Zk+1 |βk+1

[J (bk+1 , uk+1:k+N −1 )] .

(10)

As in practice we approximate the solution via samples, we rewrite (10) as

ˆ ˆ Jˆ (bk+1 , uk+1:k+N −1 ) . (11) Jˆ (bk , uk:k+N −1 ) = c (bk , uk ) + E E βk+1

Zk+1 |βk+1

Using Bellman’s principle of optimality, the optimal solution for (11) is

ˆ ˆ Jˆ bk+1 , u∗k+1:k+N −1 Jˆ bk , u ˆ∗k:k+N −1 = min{c (bk , uk ) + E E }, uk

βk+1

Zk+1 |βk+1

(12) where

u ˆ∗k:k+N −1

= argmin Jˆ (bk , uk:k+N −1 ). To reduce the computational comU

plexity in (12), we propose utilizing the belief tree skeleton, without having access to posterior beliefs, to solve an easier to compute version of the considered cost function.

Fig. 2. BSP using bounds over the objective function. In (a) choosing action #1 is guaranteed to be optimal as the corresponding upper bound is lower than all other lower bounds; In (b) choosing action #2 is not guaranteed to be optimal. The loss in solution quality, however, is upper bounded.

In general, the cost function over the original beliefs can be bounded using a simplified belief bsk as c (bsk , uk ) ≤ c (bk , uk ) ≤ c¯ (bsk , uk ) .

(13)

We note that this formulation also supports replacing the cost function itself with a computationally simpler function, as in [11,19]. Using the belief tree skeleton and some method to calculate the simplified beliefs, to be deﬁned, we now traverse the belief tree from the leafs upwards. At each belief tree node the bounds over the objective function (5) are calculated recursively using the Bellman equation (12) and (13) for every n ∈ [0, N − 1]

s ˆ ˆ J bk+n+1 , u(k+n)+ J bk+n , u(k+n)+ = c bk+n , uk+n + E E , βk+1 Zk+n+1 |βk+1

ˆ J¯ bk+n , u(k+n)+ = c¯ bsk+n , uk+n + E

ˆ E

βk+1 Zk+n+1 |βk+1

¯ J bk+n+1 , u(k+n)+ , (14)

Nonmyopic Distilled Data Association BSP Under Budget Constraints

109

where u(k+n)+ uk+n:k+N −1 . If these bounds do not overlap (see Fig. 2a), one can guarantee to select the optimal action sequence as in (12). Our general Nonmyopic Distilled Data Association BSP (ND2A-BSP) approach is presented in Algorithm 2. The algorithm receives a belief tree skeleton; a heuristic function h used to select the subsets of hypotheses in each belief tree node, i.e. deﬁnes bsk+n ; and a decision rule R which decides whether the considered subsets are enough, e.g. when no overlap between bounds is required or when calculations exceed a user deﬁned time threshold, providing anytime performance guarantees. The algorithm returns the best action sequence, given the computational constraints, and an upper bound on the loss in solution quality. It is worth mentioning that our approach can be adapted to a setting where the belief tree construction is coupled with Q function estimates, e.g. using MCTS and Upper Conﬁdence Bound (UCB) techniques [20], following a similar approach to the one presented in [22]. However, we emphasize that as the belief tree skeleton approximates (10) via samples, our method provides performance guarantees with respect to that speciﬁc skeleton, i.e. with respect to (12). Not to be confused with the asymptotic guarantees of MCTS approaches, with respect to the theoretical problem (10), which is an entirely diﬀerent aspect not related to the approach presented in this paper. Algorithm 2: Generic Nonmyopic Distilled Data Association BSP 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Input: belief tree skeleton T , simplification heuristic h, decision rule R Output: action sequence u∗ , loss Function ND2A-BSP(T, h, R): LB ∗ , U B ∗ , loss = PLAN (T.root, h, R) u∗ ← corresponding to LB ∗ , U B ∗ return u∗ , loss Function PLAN(N ode, h, R): N ode.bsk+n ← h (N ode) if Node is a leaf then // loss = 0 at leaf return c N ode.bsk+n , c¯ N ode.bsk+n , 0 N ode.bounds = ∅ foreach child C of N ode do lb, ub, loss ← ND2A-BSP(C, h, R) // objective lower bound (14) LB ← c N ode.bsk+n + lb s // objective upper bound (14) U B ← c¯ N ode.bk+n + ub N ode.bounds = N ode.bounds ∪ (LB, U B) while R (N ode.bounds) is not satisified do ND2A-BSP(N ode, h, R) // further simplification is needed LB ∗ , U B ∗ , loss ← N ode.bounds ∗ ∗ return LB , U B , loss

We now analyze diﬀerent settings, within inference and planning, where the agent either has or does not have hard budget constraints. To the best of our knowledge, this is the ﬁrst time that these aspects are addressed in works that attempt to reduce the computational complexity of the planning problem. The diﬀerences between the considered settings are summarized in Table 1. 4.3

No Budget Constraints in Inference

In this section we assume that there are no constraints in inference, i.e. each belief tree node can theoretically hold every possible hypothesis within the planning

110

M. Shienman and V. Indelman

Table 1. A summary of the considered scenarios, with respect to budget constraints on the number of supported hypotheses in each algorithm, for each considered case. Cases 1&2 are presented in Sect. 4.3 while cases 3&4 are presented in Sect. 4.4 Budget constraints in inference Budget constraints in planning Case 1

Case 2

Case 3

Case 4

horizon. The objective of inference however is diﬀerent than the main goal of BSP. In inference the agent tries to represent the considered state as accurately as possible while in planning the goal is to retrieve the optimal action sequence or policy. As such, in this setting, the problems are decoupled (see Fig. 3a). We now further separate between two cases, when the planning algorithm either has budget constraints or not. In both cases, each belief tree node still has an exponential number of components, which we avoid calculating explicitly. Case 1. With no budget constraints in planning we propose bounding the cost function as c bk , uk+ , Z(k+1)+ , bsk+n ≤ c (bk+n , uk+n ) ≤ c¯ bk , uk+ , Z(k+1)+ , bsk+n . (15) where uk+ uk:k+n−1 and Z(k+1)+ Zk+1:k+n . A key diﬀerence from the approach presented in [23] is that these bounds are not functions of bk+n . As the number of belief components grows exponentially we avoid calculating c (bk+n ). Instead, we calculate a simplified belief bsk+n , using Bayesian updates via uk:k+n−1 and Zk+1:k+n , only for speciﬁc components from the prior belief bk . This extends our proposed approach in [17] to the nonmyopic case. Each s ⊆ Mk+n components, as simplified belief is formally deﬁned, using Mk+n bsk+n

s r∈Mk+n

s,r r wk+n bk+n ,

s,r wk+n

r wk+n m,s , wk+n

(16)

m,s m wk+n is used to re-normalize each corresponding where wk+n s m∈Mk+n weight. Most importantly, a simplified belief bsk+n is calculated using only a subset of hypotheses, i.e. without calculating the posterior belief bk+n . Using Algorithm 2 given a decision rule R, with no overlap between bounds (14), and a heuristic h, e.g. which chooses hypotheses greedily based on prior weights, we guarantee the selection of the optimal actions sequence, with respect to the speciﬁc belief tree, while reducing the computational complexity. Case 2. Under budget constraints in planning, the algorithm can use up to C components, in each simplified belief bsk+n , to calculate the bounds in (15). Yet, each subset of components is chosen independently w.r.t. bk+n which develops

Nonmyopic Distilled Data Association BSP Under Budget Constraints

111

Fig. 3. No budget constraints in inference. Colors denote components generated from previous time steps. (a) Planning without budget constraints, the algorithm can choose any subset of components, highlighted in yellow, in each node to evaluate the bounds; (b) With budget constraints in planning, each subset selection is bounded in size by C = 2.

exponentially, i.e. hypotheses chosen in time steps k + n and k + n + i are not necessarily related (see Fig. 3b). In this setting, the number of possible distilled subsets for each bsk+n is |Mk+n | which can be very high. Moreover, there are no guarantees that the C bounds between candidate actions would not overlap. However, using the bounds in (15), our proposed approach can yield the worst-case loss in solution quality, i.e. provide performance guarantees (see Fig. 2b). 4.4

Hard Budget Constraints in Inference

In the previous section we only considered that the belief at the root of the tree is provided from inference. As the posterior beliefs within the constructed belief tree were with an exponentially increasing number of components, i.e. without budget constraints, the key idea was to avoid making explicit inferences. Instead, we calculated bounds that utilized, under budget constraints in planning, a ﬁxed number of components. In practice, however, real world autonomous systems do not work that way. Instead, they are often required to operate in real time using inexpensive hardware with hard computational budget constraints in both inference and planning. Under hard budget constraints on the number of considered hypotheses in inference, the posterior belief in each belief tree node is determined by (4), i.e. ψ | ≤ C under some heuristic hinf . Moreover, once a hypothesis is discarded |Mk+n in time step k it is no longer considered in future time steps. Yet, the decision regarding which components to choose, while calculating the bounds in planning, depends on either if the heuristic in (4) is given or determined within planning. To the best of our knowledge, the latter is a novel concept never considered. Case 3. In this setting we consider the heuristic in (4) to be given within planning, i.e. posterior belief tree nodes exactly represent how the belief would evolve in inference under (4). In contrast to Sect. 4.3, as the number of components does not grow exponentially, we sample future observations according to (8) and construct the belief tree explicitly, i.e. perform inference in each node. Therefore, the planning algorithm can no longer choose any subset of components for each

112

M. Shienman and V. Indelman

Fig. 4. Hard budget constraints in inference. Colors denote components generated from previous time steps. (a) Planning given the heuristic in inference, the algorithm can only evaluate the bounds using components that represent how the belief would evolve in inference; (b) The planning algorithm is free to choose components under any valid heuristics in inference given the budget C. Each selected component in time step k+n+1 must originate from a selected component in time step k + n.

bsk+n , i.e. hypotheses discarded in time step k + n cannot be considered in time step k + n + 1 (see Fig. 4). The bounds over the considered cost are now a function of the belief in the previous time step under (4). Speciﬁcally, we rewrite them as ψ ψ s s ≤ c b ≤ c ¯ b , u , Z , b , u , u , Z , b c bψ k+n−1 k+1 k+n k+n−1 k+1 k+n k+n . k+n−1 k+n k+n−1 (17) These bounds represent a recursive setting in contrast to the bounds in (15). Using our approach iteratively in each time step, reduces the computational complexity of the considered cost function in planning while providing performance guarantees. As each posterior belief is determined by inference (Fig. 4a), performance guarantees are with respect to the given heuristic in inference (4). Case 4. We now relax the assumption that the planning algorithm is conﬁned to the speciﬁc heuristic in (4). Unlike in Case 2, where each subset of components can be used in each node to calculate the bounds, this setting has an additional constraint. We formulate this by representing the bounds from (15) in two consecutive time steps c bk , uk+ , Z(k+1)+ , bsk+n ≤ c (bk+n ) ≤ c¯ bk , uk+ , Z(k+1)+ , bsk+n , c bk , uk+ , Z(k+1)+ , bsk+n+1 ≤ c (bk+n+1 ) ≤ c¯ bk , uk+ , Z(k+1)+ , bsk+n+1 , (18) s s , Mk+n+1 ≤ C and ∀bs,ij ∈ bsk+n+1 ⇒ bs,j ∈ bsk+n , s.t. Mk+n k+n+1

k+n

s,ij s where bs,j k+n denotes the jth hypothesis in the simplified subset bk+n and bk+n+1 denotes the ith hypothesis in the simplified subset bsk+n+1 , originated from bs,j k+n , i.e. as in Fig. 1b. The components chosen in the sequence of bounds (18) which minimizes the loss, w.r.t. the original problem, deﬁne a heuristic hp (see Fig. 7c), which is valid in inference. The heuristic hp can be used with any BSP approach to solve (12) and to reduce computational complexity, using our approach, as described in

Nonmyopic Distilled Data Association BSP Under Budget Constraints

113

Case 3. To the best of our knowledge, leveraging hp is a novel concept. We note that while hp minimizes the loss in planning, it is generally diﬀerent than hinf . As such, the implications of utilizing such heuristic in inference are not straightforward. The study of such mechanism is left for future research. 4.5

The Cost Function

While the formulation thus far was for a general cost function, in this section we focus on active disambiguation of hypotheses. Speciﬁcally, we utilize the Shannon entropy, deﬁned over posterior belief components weights. The cost thus given by Hk+n c (bk+n ) = for a belief bk+n with k+n components is M r r wk+n wk+n r − r∈Mk+n ηk+n log ηk+n , where ηk+n r∈Mk+n wk+n . Similarly, for a sim s s s pliﬁed belief bk+n with Mk+n ⊆ Mk+n the cost is given by Hk+n c bsk+n = s,r s,r − r∈M s wk+n . log wk+n k+n To allow ﬂuid reading, proofs for all theorems and corollaries are given in the supplementary material [18]. Theorem 1. For each belief tree node representing a belief bk+n with Mk+n s components and a subset Mk+n ⊆ Mk+n the cost can be expressed by Hk+n =

m,s wk+n ηk+n s + log Hk+n − m,s ηk+n wk+n

s r∈¬Mk+n

r wk+n log ηk+n

r wk+n ηk+n

,

(19)

s s Mk+n \ Mk+n . where ¬Mk+n

Using Theorem 1, we derive bounds for Hk+n which are computationally more eﬃcient to calculate as we only consider a subset of hypotheses. However, as evaluating ηk+n requires by deﬁnition evaluating all posterior components weights, which we do not have access to, we need to bound this term as well (denoted below as LB [ηk+n ] and UB [ηk+n ]). s Theorem 2. Given a subset of components Mk+n ⊆ Mk+n , the cost term in each belief tree node is bounded by

m,s wk+n LB [ηk+n ] s , (20) LB [Hk+n ] = Hk+n + log m,s UB [ηk+n ] wk+n

m,s wk+n UB [ηk+n ] γ¯ s UB [Hk+n ] = Hk+n + log , − γ¯ log m,s LB [ηk+n ] wk+n |¬Mk+n |

where γ¯ = 1 −

r wk+n s r∈Mk+n U B[ηk+n ]

(21)

s > 2. and ¬Mk+n

Furthermore, considering diﬀerent levels of simpliﬁcations, i.e. adding belief coms ponents to Mk+n , these bounds converge.

114

M. Shienman and V. Indelman

s Corollary 1. The bounds in Theorem 2 converge to Hk+n when Mk+n → Mk+n

lim

s Mk+n →Mk+n

LB [Hk+n ] = Hk+n = UB [Hk+n ] .

(22)

A recursive update rule is given in Section C of the supplementary material [18]. s ⊆ Mk+n , the term ηk+n , in Theorem 3. Given a subset of components Mk+n each belief tree node, is bounded by ⎞ ⎛ n m,s m,s ⎝ |Mk+n | − LB [ηk+n ] = wk+n ≤ ηk+n ≤ wk+n + wkr ⎠ σ i = UB [ηk+n ] , |Mk | s i=1 r∈Mk+n

(23) where σ i max (P (Zk+i |xk+i )) and wkr is the prior weight at time k for every s at time k + n. component in Mk+n As in Theorem 2, since we only consider a subset of hypotheses, these bounds are also computationally more eﬃcient to calculate and converge. We also note that speciﬁcally for Case 3, the bounds in Theorem 2 and Theorem 3 are calculated iteratively in each time step k + n given the belief bψ k+n−1 as presented in (17). s → Mk+n Corollary 2. The bounds in Theorem 3 converge to ηk+n when Mk+n

lim

s Mk+n →Mk+n

LB [ηk+n ] = ηk+n = UB [ηk+n ] .

(24)

A recursive update rule is given in Section C of the supplementary material [18].

5

Results

We evaluate the performance of our approach for the diﬀerent cases presented in Sect. 4. Our prototype implementation uses the GTSAM library [3]. Our considered scenarios represent highly ambiguous environments containing perceptually identical landmarks in diﬀerent locations. In our ﬁrst scenario, floors, the agent is initially located in one of F ﬂoors such that each ﬂoor contains a unique landmark, speciﬁc to that ﬂoor (Fig. 5a). In our second scenario, 2d random, the agent is initially placed in a random environment in front of a blue square (Fig. 5b). Both scenarios can be considered as versions of the kidnapped robot problem. With no other prior information, the initial belief, in both cases, is multi-modal containing |M0 | hypotheses. The agent captures the environment using range measurements containing a class identiﬁer, e.g. red triangle or green square. When the agent receives a measurement to some landmark which is ambiguous, i.e. it can theoretically be generated from more than one landmark, the number of hypotheses grows (see Fig. 5c). The number of identical landmarks can be adjusted to represent higher ambiguity, increasing the number of considered hypotheses. The agent’s goal is to disambiguate between hypotheses by solving the corresponding BSP problem (6) at each planning session using entropy over posterior belief components weights as a cost function.

Nonmyopic Distilled Data Association BSP Under Budget Constraints

115

Fig. 5. (a) The floors environment where F identical ﬂoors represent diﬀerent prior hypotheses. Each ﬂoor contains a unique landmark. The true location of the agent is highlighted in yellow; (b) The 2d random environment with many identical landmarks. The agent is initially placed in front of a blue square with no other prior information; (c) A planning session where ambiguous data association results in two hypotheses denoted by the yellow and blue ellipses.

In our ﬁrst experiment we consider Case 1. We compare our approach with evaluating the cost function over the original belief, i.e. considering every possible future hypothesis. The heuristic in planning chooses the subset of components for each belief tree node greedily based on prior weights at time k. The decision rule R was set as no overlap, i.e. no loss with guaranteed optimal solution. The computational merits of our approach are presented in Fig. 6. Moreover, in Fig. 6c, f we can see that with a longer planning horizon the subset of hypotheses used for disambiguation becomes smaller. As more observations are utilized along the horizon, it is easier to discard wrong hypotheses in our considered cases. In our second experiment we consider Case 2. In Fig. 7a we present the loss as a function of the budget size. As expected, with higher budget constraints the loss in solution quality becomes smaller. Moreover, as can be seen in Fig. 7b the loss is higher closer to the root of the belief tree, as bounds are accumulated in the non-myopic setting, increasing the overlap. Considering Case 3, our experiments did not show any computational improvements between calculating the original cost function and using our approach. We indicate that this is because there is no exponential growth in the number of hypotheses within the horizon and our considered cost function is linear w.r.t. the number of components. However, as seen in [23], using a diﬀerent cost, which is beyond the scope of this work, our approach can reduce the computational complexity while providing guarantees in Case 3 as well. Finally, we consider Case 4. We ﬁrst report that under this setting the computational complexity is high as every possible heuristic under the given budget is considered. In Fig. 7c preliminary results indicate that this process can improve the bounds over the loss in solution quality vs a given heuristic hinf .

116

M. Shienman and V. Indelman

Fig. 6. Case 1 study for floors and 2d random environments. All scenarios presented carry zero loss. (a), (d) Planning time as a function of the planning horizon. In both environments, all settings the considered 4 prior hypotheses; (b), (e) Planning time as a function of the number of prior hypotheses. In both environments, all settings considered a planning horizon of 3; (c), (f) % components used to calculate bounds in each level of the belief tree. Circles scales are normalized as the number of nodes grows exponentially going down the tree.

Fig. 7. (a) Normalized loss as a function of the size of the budget in Case 2. In all settings the number of ﬂoors, i.e. prior hypotheses was set to 12; (b) Normalized loss along the depth of a belief tree in Case 2 with C = 3 components and a planning horizon of n = 2; (c) Normalized loss as a function of the size of the budget in Case 4, i.e. considering every valid heuristic in inference hp candidate . When C ≤ 6, the heuristic hp induces a smaller loss than hinf . When C > 6, both hp and hinf induce zero loss, i.e. are optimal in this setting.

6

Conclusions

In this work we introduced ND2A-BSP, an approach to reduce the computational complexity in data association aware BSP with performance guarantees for the nonmyopic case. We rigorously analyzed our approach considering diﬀerent settings under budget constraints in inference and/or planning.

Nonmyopic Distilled Data Association BSP Under Budget Constraints

117

Furthermore, future research will consider how to utilize information from planning in inference when the latter is subject to hard computational budget constraints, as in most real-world autonomous systems.

References 1. Agarwal, S., Tamjidi, A., Chakravorty, S.: Motion planning for active data association and localization in non-gaussian belief spaces. In: Algorithmic Foundations of Robotics XII. SPAR, vol. 13, pp. 288–303. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-43089-4 19 2. Carlone, L., Censi, A., Dellaert, F.: Selecting good measurements via L1 relaxation: a convex approach for robust estimation over graphs. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2667–2674. IEEE (2014) 3. Dellaert, F.: Factor graphs and GTSAM: a hands-on introduction. Technical report, GT-RIM-CP&R-2012-002, Georgia Institute of Technology (2012) 4. Elimelech, K., Indelman, V.: Simpliﬁed decision making in the belief space using belief sparsiﬁcation. Int. J. Robot. Res. (2021). arXiv: https://arxiv.org/abs/1909. 00885 5. Hsiao, M., Kaess, M.: MH-iSAM2: multi-hypothesis iSAM using Bayes tree and hypo-tree. In: IEEE International Conference on Robotics and Automation (ICRA) (2019) 6. Hsiao, M., Mangelson, J.G., Suresh, S., Debrunner, C., Kaess, M.: Aras: ambiguityaware robust active slam based on multi-hypothesis state and map estimations. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5037–5044. IEEE (2020) 7. Indelman, V., Nelson, E., Dong, J., Michael, N., Dellaert, F.: Incremental distributed inference from arbitrary poses and unknown data association: using collaborating robots to establish a common reference. IEEE Control Syst. Mag. (CSM) 36(2), 41–74 (2016) 8. Indelman, V., Nelson, E., Michael, N., Dellaert, F.: Multi-robot pose graph localization and data association from unknown initial relative poses via expectation maximization. In: IEEE International Conference on Robotics and Automation (ICRA) (2014) 9. Jiang, F., Agrawal, V., Buchanan, R., Fallon, M., Dellaert, F.: iMHS: an incremental multi-hypothesis smoother. arXiv preprint arXiv:2103.13178 (2021) 10. Kaess, M., Johannsson, H., Roberts, R., Ila, V., Leonard, J., Dellaert, F.: iSAM2: incremental smoothing and mapping using the Bayes tree. Int. J. Robot. Res. 31(2), 217–236 (2012) 11. Kitanov, A., Indelman, V.: Topological multi-robot belief space planning in unknown environments. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 5726–5732 (2018) 12. Kschischang, F., Frey, B., Loeliger, H.A.: Factor graphs and the sum-product algorithm. IEEE Trans. Inform. Theory 47(2), 498–519 (2001) 13. Lajoie, P.Y., Hu, S., Beltrame, G., Carlone, L.: Modeling perceptual aliasing in slam via discrete-continuous graphical models. IEEE Robot. Autom. Lett. (RA-L) 4(2), 1232–1239 (2019) 14. Olson, E., Agarwal, P.: Inference on networks of mixtures for robust robot mapping. Int. J. Robot. Res. 32(7), 826–840 (2013)

118

M. Shienman and V. Indelman

15. Pathak, S., Thomas, A., Indelman, V.: A uniﬁed framework for data association aware robust belief space planning and perception. Int. J. Robot. Res. 32(2–3), 287–315 (2018) 16. Shelly, O., Indelman, V.: Hypotheses disambiguation in retrospective. IEEE Robot. Autom. Lett. (RA-L) 7(2), 2321–2328 (2022) 17. Shienman, M., Indelman, V.: D2A-BSP: distilled data association belief space planning with performance guarantees under budget constraints. In: IEEE International Conference on Robotics and Automation (ICRA) (2022) 18. Shienman, M., Indelman, V.: Nonmyopic distilled data association belief space planning under budget constraints. Technical report, Technion - Israel Institute of Technology (2022). https://tinyurl.com/4h3b9mz8 19. Shienman, M., Kitanov, A., Indelman, V.: FT-BSP: focused topological belief space planning. IEEE Robot. Autom. Lett. (RA-L) 6(3), 4744–4751 (2021) 20. Silver, D., Veness, J.: Monte-Carlo planning in large POMDPs. In: Advances in Neural Information Processing Systems (NIPS), pp. 2164–2172 (2010) 21. Sunderhauf, N., Protzel, P.: Towards a robust back-end for pose graph slam. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 1254– 1261. IEEE (2012) 22. Sztyglic, O., Zhitnikov, A., Indelman, V.: Simpliﬁed belief-dependent reward mcts planning with guaranteed tree consistency. arXiv preprint arXiv:2105.14239 (2021) 23. Sztyglic, O., Indelman, V.: Online pomdp planning via simpliﬁcation. arXiv preprint arXiv:2105.05296 (2021) 24. Ye, N., Somani, A., Hsu, D., Lee, W.S.: DESPOT: online POMDP planning with regularization. JAIR 58, 231–266 (2017)

SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World Semantic Scene Understanding Hermann Blum1(B) , Marcus G. M¨ uller1,2 , Abel Gawel3 , Roland Siegwart1 , and Cesar Cadena1 1

Autonomous Systems Lab, ETH Z¨ urich, Z¨ urich, Switzerland [email protected] 2 German Aerospace Center (DLR), Munich, Germany 3 Huawei Research, Z¨ urich, Switzerland

Abstract. In order to operate in human environments, a robot’s semantic perception has to overcome open-world challenges such as novel objects and domain gaps. Autonomous deployment to such environments therefore requires robots to update their knowledge and learn without supervision. We investigate how a robot can autonomously discover novel semantic classes and improve accuracy on known classes when exploring an unknown environment. To this end, we develop a general framework for mapping and clustering that we then use to generate a self-supervised learning signal to update a semantic segmentation model. In particular, we show how clustering parameters can be optimized during deployment and that fusion of multiple observation modalities improves novel object discovery compared to prior work. Models, data, and implementations can be found at github.com/hermannsblum/scim. Keywords: Self-supervised learning · Semantic segmentation Self-improving perception · Semantic scene understanding

1

·

Introduction

Robots that automate tasks such as household work, hospital logistics, assistive care, or construction work have to operate in environments that are primarily designed for humans. Moreover, all these tasks have a high level of complexity that requires semantic scene understanding [1]. Semantics in human environments are open-world. They have domain gaps, contain novel objects, and change over time. For robots to operate autonomously in such environments, they need to be able to deal with such changes and novelties. This requires a methodological shift from deploying models trained on ﬁxed datasets to enabling This paper was ﬁnancially supported by the HILTI Group.

Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 9. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 119–135, 2023. https://doi.org/10.1007/978-3-031-25555-7_9

120

H. Blum et al.

Fig. 1. Example predictions from a segmentation network trained with our method (third column), compared to the predictions of a pretrained model (second column) and the ground-truth label (ﬁrst column). The base model did not see the outlier class during training. These novel objects are discovered autonomously, and are therefore not labelled by a word but by their cluster id (e.g., c1, c2, ...).

robots to learn by themselves, building up on advancements in self-supervised learning and continual learning. It is the robotic version of ideas like ‘learning on the job’ [2], open-world object detection [3] or segmentation [4], and is related to the idea of developmental robotics [5]. This work investigates scenarios where robots should perform semantic scene understanding in unknown environments that contain novel object categories. We show how robots can improve their semantic segmentation in these new environments on both known and unknown categories by leveraging semantic mapping, uncertainty estimation, self-supervision, and clustering. We call the investigated problem ‘Simultaneous Clustering, Inference, and Mapping’ (SCIM). The approaches we investigate work fully autonomously without human supervision or intervention. While fusion of predictions and discovery of novel objects has also been investigated in the context of semantic mapping [6,7], maps are always bound to one point in time and one speciﬁc environment. Instead, segmentation networks can carry knowledge into diﬀerent environments. Prior work has shown that combining self-supervised pseudo-labels with continual learning can integrate knowledge gathered over multiple environments in closed-world [8], which would not be possible with mapping alone. Therefore, we investigate promising self-supervision signals for open-world class-incremental learning.

Simultaneous Clustering, Inference, and Mapping

121

Based on the motivation to deploy robots to human environments, we focus this work on indoor scenes. Given a trajectory in the unknown environment, the robot collects observations with a RGB-D sensor in diﬀerent modalities. These encompass (1) predictions of the base segmentation model and their uncertainties, (2) deep features of the segmentation network and other image-based networks, (3) a volumetric map of the environment, and (4) geometric features extracted from that map. Subsequently, we categorize these observations through clustering and integrate everything back into the semantic segmentation network by training it with pseudo-labels. We obtain an updated network that can identify novel semantic categories and has overall higher prediction accuracy in the environment, as shown in Fig. 1. In summary, our contributions are: – A novel framework to map, discover, represent, and integrate novel objects into a semantic segmentation network in a self-supervised manner. – An algorithm that leverages prior knowledge to optimise the clustering parameters for ﬁnding representations of novel objects. – We provide the ﬁrst open-source available method implementations and, while not at the scale of a benchmark, develop metrics and evaluation scenarios for open-world semantic scene understanding.

2

Related Work

Novel object discovery describes an algorithm’s ability to account for unknown object classes in perception data. Grinvald et al. [7] showed a semantic mapping framework that was able to segment parts of the scene as ‘unknown object’, but without the ability to categorize these. Nakajima et al. [9] were one of the ﬁrst to demonstrate semantic scene understanding that can identify novel objects. They rely on superpixel segmentation, mapping, and clustering to identify object categories. Hamilton et al. [10] showed fully unsupervised video segmentation based on a similar framework like the one we describe in Sect. 3.1. Both [9] and [10] cluster a whole scene into semantic parts without the ability to relate a subset of clusters to known labels. Uhlemeyer et al. [11] recently demonstrated that based on advancement in out-of-distribution detection, a segmentation can be split into inliers and outliers. They cluster only the outliers into novel categories, but based on features that were already supervised on some of their outlier classes. In contrast to our method, theirs is therefore not fully unsupervised and further lacks the capability to integrate multiple observation modalities. Clustering for classiﬁcation can be understood as a two part problem. First, (high dimensional) descriptors for the items in question have to be found. Then, a clustering algorithm groups similar descriptors together. The established approach in representation learning is to learn a single good descriptor that can be clustered with kNN or k-means [12]. K-means can be used with mini-batches, is diﬀerentiable, fast, and easy to implement. However, we argue that there are two big disadvantages: it requires a priori knowledge of the number of clusters k and only works in the space of a single descriptor. For a scenario in which a robot is deployed to unknown environments, it cannot know the number of novel

122

H. Blum et al.

object categories. It is also questionable whether a single descriptor will be able to well describe all parts of the unknown environment. Graph clustering algorithms like DBSCAN [13] or HDBSCAN [14] can cluster arbitrary graphs and do not need to know the number of clusters beforehand. Their hyperparameters are instead related to the values of the connectivity matrix. These graph edges are independent of any speciﬁc descriptor space and can e.g. also be an average over multiple descriptor distances. This makes graph clustering a better candidate when working with more than one descriptor (in clustering literature, this is called ‘multi-view’ clustering [15], which however is a very ambiguous term in scene understanding). Unfortunately, there exist no diﬀerentiable graph clustering algorithms. The closest in the literature is [16], which however links nodes to a single descriptor. [17] proposes a diﬀerentiable multi-descriptor clustering, which however just uses representation learning to link from multiple to one descriptor. Our work therefore investigates how graph clustering can be used to cluster scenes based on multiple descriptors. We further investigate how parameters of graph clustering can be tuned without gradient based optimisation. Continual learning describes the problem of training a single model over a stream of data that e.g. contains increasing amount of classes, shifts in data distribution, etc. New knowledge should be integrated into the model without forgetting old knowledge. Prior work [8,11] has already shown the eﬀectiveness of pseudo-labels in continual learning and other works [18,19] showed techniques for supervised class-incremental semantic segmentation that mitigate forgetting. However, supervision is not available in autonomous open-world operation. We therefore focus this work on the self-supervision part of class-incremental learning, while referring to [18,19] for ways to address or evaluate forgetting.

3

Method

In the following we deﬁne the problem of open-world scene understanding. We ﬁrst take a step back and set up a general formal description of the problem, which we show relates to similar works on segmentation, and enables us to identify the core diﬀerences between the evaluated methods. We then describe how we identify novel object categories, and how we use this identiﬁcation to train the robot’s segmentation network in a self-supervised manner. 3.1

Preliminaries: Scene Understanding as a Clustering Problem

A robot explores an environment and collects over its trajectory camera images and consequently through simultaneous localisation and mapping (SLAM) corresponding poses in relation to the built map. Let G = (V, E) describe a graph where every vertex vi ∈ V is an observed pixel, as illustrated in Fig. 2. For each of these observation vertices vi , the following information is available: – image plane coordinates in the corresponding camera image – 3D coordinates Tmap→vi in the robot map

Simultaneous Clustering, Inference, and Mapping

123

Fig. 2. Illustration of the graph setup. Every node (star) is a pixel in a camera frame. For every node, we also know the projection into 3D and potential correspondences from other frames. We can further relate nodes within and between frames through deep features from networks run on the frames.

– time of observation t(vi ) – semantic prediction pred(vi ) and associated (un)certainty cert(vi ) – any local visual/learned/geometric feature f (vi ) that can be inferred from the corresponding camera image, map location, or additional sensing modalities available on the robot Edges can then in general be found from a function e : V × V → R≥0 : e(vi , vj ) = e(Tvi →vj , f (vi ), f (vj ), t(vi ), t(vj ), pred(vi ), pred(vj )) that, based on distance functions ·, ·, distills the multimodal relations of vi and vj into an edge weight. The semantic interpretation of the scene can then be expressed as the graph clustering problem on G that splits into disjoint clusters Vk . Vk ∀i, j, i = j : Vi ∩ Vj = Ø V = K clusters

The formulation above is very related to conditional random ﬁelds (CRFs) and the unsupervised segmentation loss from Hamilton et al. [10]. They show that this graph structure represents a Potts problem [20]. Brieﬂy summarised: If φ : V → C (softly) assigns vertices to clusters and μ : C × C → R sets a cost for the comparison between two assignments, e.g. the cross-entropy, the graph clustering problem introduced above minimizes the following energy term: e(vi , vj )μ(φ(vi ), φ(vj )) E(φ) = vi ,vj ∈V

In the most simple case of perfect (i.e. ground-truth) predictions, const. if pred(vi ) = pred(vj ) e(vi , vj ) = 0 otherwise is a disjoint graph with each cluster matching one label. Often however, predictions are not perfect and techniques like volumetric semantic mapping can be used to ﬁlter noisy predictions by enforcing that a voxel

124

H. Blum et al.

should have the same class regardless of the viewpoint. This roughly corresponds to the problem of clustering G with e.g. ⎧ ⎨ 1 if vi , vj in same voxel e(vi , vj ) = < 1 if pred(vi ) = pred(vj ) ⎩ 0 otherwise where some approaches also take uncertainty or geometry into account. 3.2

Identifying Novel Categories

To identify novel categories in a scene, observation vertices vi need to be clustered based on commonalities that go beyond position in the map and prediction of a pretrained classiﬁer, because by deﬁnition the pretrained classiﬁer will not be able to identify novel categories. As the predominant current approach in category prediction is to identify categories visually, we also follow this approach to cluster observations into novel categories based on a range of visual descriptors. In the above introduced graph clustering framework, this means: wd fd (vi ), fd (vj ) (1) eours (vi , vj ) = d∈descriptors

with wd the weight for each descriptor and wd = 1. Note that Eq. (1) is a generalisation of diﬀerent related work. For example, Uhlemeyer et al. [11] links observations only by the feature of an Imagenet pretrained ResNet and Nakajima et al. [9] weight features from the pretrained segmentation network and geometric features based on the entropy of the classiﬁcation prediction h(vi ). euhlemeyer (vi , vj ) = tSNE(PCA(fimgnet (vi ))) − tSNE(PCA(fimgnet (vj ))) 2 enakajima (vi , vj ) = (1 − h(vi ))fsegm (vi ) − (1 − h(vj ))fsegm (vj ) 2 + h(vi )fgeo (vi ) − h(vj )fgeo (vj ) 2 where tSNE(PCA(·)) is a dimensionality reduction as described in [11]. Optimisation of Clustering Parameters To solve the clustering problem of G(V, E), diﬀerent hyper parameters Θ have to be found. This includes parameters of the clustering algorithm and the weights wd of the diﬀerent descriptors. The choice of these parameters governs the ‘grade of similarity’ that is expected within a cluster, i.e. how ﬁne-grained categories should be. In general, choosing these parameters is very hard because the choice has to hold for unknown objects and scenes. In this work, we therefore propose to solve the parameter choice by optimisation. In particular, we propose to use the subset V˜ ∈ V of observations where the prediction pred(v) out of one of the known classes 1, ..., K has high certainty ∀v ∈ V˜ : cert(v) > δ and ﬁnd the hyperparameters Θ as follows: Θ = argmax mIoU ({V1 , ..., VK }, {pred(v) = 1, ..., pred(v) = K}) ˜ v∈V

(2)

Simultaneous Clustering, Inference, and Mapping

125

Fig. 3. Overview of the steps for self-supervised, class-incremental learning. The method is described in Sect. 3.2 and implementation details in Sect. 4.1.

where mIoU is the mean intersection over union (IoU) of the best matching between the clustering {V1 , ..., VK } and the predictions of the pretrained classiﬁer. As shown in Sect. 2, there exists no graph clustering algorithm that is differentiable either to its input or its parameters. Equation (2) is therefore not optimisable based on gradients. Hence, we employ black-box optimisation that models the clustering algorithm as a gaussian process Θ → mIoU. Based on the ‘skopt’ library [21], samples of Θ are choosen to cover the optimiation space but favoring areas where good mIoU is expected based on previous measurements. After 200 iterations, we select the point with the best mIoU. Subsampling of the Clustering Graph Usually G(V, E) is too large to eﬃciently compute the clustering problem. With an already low image resolution of 640 × 480 and a frame rate of e.g. 20 Hz, 2 min of camera trajectory correspond to |V | > 7e9. We therefore rely on random subsampling, taking 100 random points on every 5th frame, to create a smaller problem that is still representing the whole scene. Such subsampling removes redundancies in observations of neighboring pixels and subsequent frames, but also exaggerates noise that would otherwise average out over more data points. We hence choose parameters to the maximum possible with the available memory. Subsampling however comes with the challenge that there is no direct clustering solution for all v ∈ V . We therefore combine the subsampled graph clustering with nearest neighbor search. Given a clustering for a subsampled part of V , we assign every v either to its cluster, if it was part of the graph clustering, or to the cluster of the nearest neighbor according to eours (vi , vj ). Note that prior work reduces computationally complexity by ﬁrst segmenting the scene into superpixels or segments and then clustering these. This approach comes with other challenges, notable how to attribute features to segments and how to ensure segments are not merging independent objects. With our subsampling, we test an alternative approach. Self-supervised Class-Incremental Training To adapt the robot’s perception to a novel scene, on both known and unknown categories, we leverage the above described method as part of a larger system that is outlined in Fig. 3. Its goal is to produce a useful learning signal without

126

H. Blum et al.

any supervision, i.e. purely based on the observations the robot makes itself 1 . With such a learning signal, we ﬁnetune the base segmentation model 2 with the goal of improving the predictions in the given scene on the next trajectory. As a volumetric mapping framework 3 , we use the implementation from [22] without the panoptic part. This framework performs volumetric mapping and integration of semantic predictions per voxel. We extend this framework to also integrate uncertainties associated with the predictions as average per voxel. For pose-estimation, we rely on the poses provided with the data, which are obtained through visual-inertial mapping and bundle adjustment [23]. Because the semantic map integrates many predictions from diﬀerent viewpoints, it cancels out some noise from single frame predictions and has a higher overall accuracy. This was used in [24] as a learning signal for scene adaptation, but assuming that all classes are known. Similar to [24], we also render 4 the semantic map back into each camera pose, obtaining an improved semantic prediction for every frame. We additionally render the averaged uncertainty, which indicates which parts of the scene are reliable predictions on inlier classes (low average uncertainty) and which parts are either novel categories or unfamiliar known categories (both high uncertainty), as shown in Fig. 5. Given a clustering solution 5 , we produce pseudo-labels for training by merging renderings from the semantic map with clustering-based predictions 6 . We ﬁrst identify clusters with large overlap to predicted categories by measuring the contingency table between the semantic classes in the map and the clustering. We merge clusters that have an IoU > .5 with the corresponding predicted class. All other clusters are considered novel categories. We then assign for each pixel in each camera frame, i.e. each vi , either (i) the rendered semantic class from the map if the average map uncertainty is below a threshold δ or (ii) the assigned cluster if the uncertainty of the prediction is higher than δ. Essentially, our pseudolabels therefore identify unknown parts of a scene and produce a learning signal to perform domain adaptation for the known parts of the scene and novel object discovery for the unknown parts of the scene. To train the model on the pseudo labels 7 , we need to extend its last layer to accomodate the newly identiﬁed categories. We do this by increasing the kernel and bias of the last layer from RF ×C to RF ×(C+C ) where F is the feature size, C is the number of known classes, and C is the number of novel detected clusters. We then initialise these matrixes with standard random initialisation for the new rows and with the base model’s parameters for the already existing rows.

4

Experimental Evaluation

We investigate the setting where a robot is put into a new environment that contains objects it has never seen. We choose the ScanNet dataset [23] of RGB-D trajectories in real indoor environments. Despite multiple errors in the semantic annotations, ScanNet had the highest data quality when we searched for datasets containing RGB-D trajectories, poses, and semantic annotations. The dataset is split into scenes (usually one room), where each scene may include multiple

Simultaneous Clustering, Inference, and Mapping

127

trajectories. We select television, books, and towel as outlier classes, which the segmentation models must not have seen during training. After data quality ﬁltering to e.g. account for incorrect annotations, we end up with 3 scenes with TVs, 1 scene with books and 2 scenes with towels (out of the ﬁrst 100 validation scenes). Compared to prior work that tests on a total of 360 images [9] and 951 images [11], we therefore test on signiﬁcantly more data with total of 15508 frames and at least 1000 frames per trajectory. 4.1

Method Implementation

We run two variants of our method: One is only taking self-supervised information as input, i.e. features of the segmentation network itself, self-supervised visual features from DINO [12], and geometric features. The second variant (following [11]) is also fusing visual information of a ResNet101 trained on Imagenet as input. Imagenet features are obtained by supervised training on a wide range of classes, including the ones we consider as outliers in our experimental setting, so they cannot be considered self-supervised. We obtain geometric features by running the provided model of [25], which trains a descriptors to register 3D point clouds, on the surface point cloud of the voxel map. We extract features of our segmentation network at the ‘classiﬁer.2’ layer, which is one layer before the logits. As Imagenet features, we take the output of the last ResNet block before ﬂattening, such that the features still have spatial information (‘layer4’ in pytorch). From DINO, we take the last token layer that still has spatial relations. We normalise these descriptors to a l2 norm of 1 and calculate pairwise euclidean distances. We then harmonize the scale of diﬀerent feature distances by ﬁnding scalar factors α such that p(α ∗ |vi − vj | < 1) = .9 for vi , vj that have low uncertainty and the same predicted class. For clustering, we use HDBScan [14]. This is an improved version of DBScan that is designed to deal better with changing densities of the data and we found it in general to perform slightly more reliable. As segmentation network, we use a DeepLabv3+ trained on COCO and then on ScanNet, but not on the scenes or objects categories we test on. We employ standard image augmentation and random crops. For uncertainty estimation, we take the method with the best simplicity-performance trade oﬀ from the Fishyscapes benchmark [26]: We use the max-logit value of the softmax, including the post-procesing introduced in [27], but without their standardisation step. 4.2

Adaptation of Baselines

In addition to the full optimisation based approach above, we want to test concepts from related work. Unfortunately, neither [9] nor [11] made their implementation available, which is why we implement both methods and adapt them to our evaluation setting. It is important to note that these are not direct replications or reproductions, but we take as many ideas from these papers as possible and combine them with the system above to get the best result.

128

H. Blum et al.

For Nakajima et al. [9], we implement their clustering procedure, but take the same segmentation network, mapping framework, and geometric features as for our method. Since we could not ﬁnd all clustering parameters in their paper, we run our proposed parameter optimisation on the inﬂation and η parameter of the MCL clustering for every scene. Because MCL takes longer than HDBScan, we needed to use stronger subsampling. Given that our geometric features are very diﬀerent than the ones used in [9], we also evaluate a variant that only uses the features of the segmentation network. In [9], this had slightly worse performance. For Uhlemeyer et al. [11], their underlying meta-segmentation is released as open-source. However, their uncertainty estimation is only available for an urban driving network, so we replace it with the uncertainty metric that we also use for our method. Also this paper does not report its clustering parameters. We however cannot run our proposed parameter optimisation for this method, because it only clusters the outliers. We therefore, following advice of the authors, hand-tune the parameters until we get a good result for scene 0354 and use these settings ( = 3.5, min samples = 10) for all scenes. 4.3

Evaluation Metrics

The evaluation protocol in prior work is not well documented [9] or limited to one novel cluster [11]. We therefore describe our protocol in more detail. Classincremental learning is usually supervised and therefore evaluated with a standard confusion matrix [28]. In unsupervised representation learning literature, the number of clusters is usually set to the number of labels, such that the Hungarian Algorithm can ﬁnd the optimal matching of clusters and labels [10,12]. In our problem setting, these assumptions do not hold. Parts of the predicted classes are trained in a supervised manner and only the novel categories are found in an unsupervised way. Because parts of the scenes contain outlier objects without annotation, we can also not punish a method for ﬁnding more categories than there are labels. To match clusters to existing labels, we ﬁrst count how often which cluster is predicted for which label in the contingency matrix of size Nlabels × Nclusters on the labelled part of the scene. We then use a variant of the Hungarian algorithm [29] that ﬁrst pads the matrix with zeros to a square matrix and further disallows to assign predictions with a supervised label to another label (they may however be disregarded entirely if an unlabelled cluster is a better match). If no prediction has a supervised label, this is equivalent to the established Hungarian matching from representation learning. If all predictions have supervised labels, this is equivalent to the standard confusion matrix. We also report the v-score [30], which is independent of label-cluster matching. It is the harmonic mean over two objectives: All points in a cluster belong to the same label, and all predictions of a label come from the same cluster. 4.4

Results

We present the main results of this study in Table 1. Qualitative examples can be found in the video attachment. In general, we observe that all methods are

Simultaneous Clustering, Inference, and Mapping

129

Fig. 4. Predictions on unlabelled parts of the scene reveal that the adapted network detects more novel classes than the labels allow to measure.

Fig. 5. Uncertainty in the volumetric maps, measured as the max-logit (lower is more uncertain). As expected, the outlier objects towel (a), books (b), and TV (c) have high uncertainty. Note that some other objects like the ladder in (a) that is also shown in Fig. 4 have high uncertainty.

able to detect the novel object categories to a certain degree, except for variants of nakajima in scene 0598. We also observe that the full nakajima variant with segmentation and geometric features performs poorly, but note that the method was originally designed for diﬀerent geometric features and the segmentationonly variant performs much better. Those methods that adapt the segmentation model through training (ours and uhlemeyer) in all cases improve performance over the base model. Where a second trajectory of the environment is available, we can also verify that this is a true improvement of the segmentation and not overﬁtting on the frames. Notably, our fully unsupervised variant1 is better than all supervised methods in 2 scenes and competitive in the other scenes. This indicates that the use of supervised ImageNet pretraining is limited and very useful features can instead be learned unsupervisedly from any environment in open-world deployment. As discussed in Sect. 2, graph clustering does not require a priori knowledge of the number of classes. As such, Fig. 4 shows examples of objects that are not measureable as outliers, yet got discovered by the algorithms and predicted as a new category by the trained network. In scenes like 0568 or 0164, no investigated approach is able to discover the outlier class. Especially small objects and cluttered scenes pose big challenges. We conclude that our approach that is fusing multiple sources of information, especially the unsupervised variant, shows the most consistent performance over the listed scenes, but there is no approach that is the best in every scene. 1

‘unsupervised’ refers to novel classes. All tested methods are supervised on the known classes.

130

H. Blum et al.

Table 1. Model predictions in [% mIoU] for diﬀerent scenes and diﬀerent outlier classes. We mark the best unsupervised method and the best overall. The available unsupervised information are the base model’s features (segm.), geometric features (geom.), and DINO [12] features (dino). Some methods however use features from supervised ImageNet training (imgn.). Note that ‘nakajima’ is a pure clustering method, so we measure the clustering instead of model predictions. For those scenes where a second trajectory is available, we evaluate trained models on the second trajectory. Outlier Tv

Tv

Tv

Books

Towel

Towel

Scene 0354

0575

0599

0598

0458

0574

Method

Training traj.

New traj.

Out IoU

mIoU

Out IoU

mIoU

Base model

0

47

–

–

Adapted nakajima: segm. + geom

7

4

–

–

Adapted nakajima: segm. only

9

20

–

–

SCIM fusing segm. + geom. + dino

73

62

–

–

Adapted uhlemeyer: imgn

65

60

–

–

Adapted uhlemeyer: imgn. + map

21

56

–

– –

SCIM fusing segm. + geom. + imgn

41

54

–

Base model

0

53

0

53

Adapted nakajima: segm. + geom

10

11

–

–

Adapted nakajima: segm. only

24

17

–

–

SCIM fusing segm. + geom. + dino

33

64

39

63

Adapted uhlemeyer: imgn

12

55

23

61

Adapted uhlemeyer: imgn. + map

6

62

4

58 63

SCIM fusing segm. + geom. + imgn

50

69

28

Base model

0

60

0

63

Adapted nakajima: segm. + geom

1

0

–

–

Adapted nakajima: segm. only

36

26

–

–

SCIM fusing segm. + geom. + dino

37

67

36

63

Adapted uhlemeyer: imgn

42

71

55

74

Adapted uhlemeyer: imgn. + map

11

70

10

68 62

SCIM fusing segm. + geom. + imgn

32

65

31

Base model

0

59

–

–

Adapted nakajima: segm. + geom

1

1

–

–

Adapted nakajima: segm. only

0

0

–

–

SCIM fusing segm. + geom. + dino

43

77

–

–

Adapted uhlemeyer: imgn

29

63

–

–

Adapted uhlemeyer: imgn. + map

29

70

–

– –

SCIM fusing segm. + geom. + imgn

33

74

–

Base model

0

48

0

38

Adapted nakajima: segm. + geom

6

7

–

–

Adapted nakajima: segm. only

63

44

–

–

SCIM fusing segm. + geom. + dino

33

57

37

47

Adapted uhlemeyer: imgn

38

61

47

49

Adapted uhlemeyer: imgn. + map

38

59

46

48 55

SCIM fusing segm. + geom. + imgn

60

63

79

Base model

0

45

–

–

Adapted nakajima: segm. + geom

22

7

–

–

Adapted nakajima: segm. only

4

27

–

–

SCIM fusing segm. + geom. + dino

28

52

–

–

Adapted uhlemeyer: imgn

23

50

–

–

Adapted uhlemeyer: imgn. + map

f

f

–

–

SCIM fusing segm. + geom. + imgn

39

53

–

–

Simultaneous Clustering, Inference, and Mapping

131

Table 2. Ablation of diﬀerent information sources on scene 0354 00. Listed in ‘information’ are the weights of diﬀerent feature distances. Variant

Information

Single class IoU

mIoU v score

3D Segm Imgn Geom Wall Floor Chair Table Door Window Tv Whiteboard Base model

Clustering

–

–

–

81

72

67

84

30

12

0

30

47

62

Semantic map –

–

–

82

78

73

89

33

8

0

51

52

68

Seg only

–

1

–

–

51

5

11

3

0

0

28 0

12

27

Seg + imgn

–

.69

.31

–

32

32

47

27

10

1

27 7

23

40

1

–

–

83

78

73

91

37

21

42 20

56

68

.69

.31

–

84

78

73

91

37

19

43 21

56

69

Pseudolabel Seg only Seg + imgn

–

Table 3. Ablation of diﬀerent information sources on scene 0574 00. Listed in ‘information’ are the weights of diﬀerent feature distances. Variant

Information

Single class IoU

mIoU v score

3D Segm Imgn Geom Wall Floor Cabinet Door Mirror Ceiling Towel Sink

Clustering

Base model

–

–

–

–

53

76

76

18

10

61

0

62

45

56

Semantic map

–

–

–

50

83

77

11

14

60

0

57

44

59

Seg only

–

1

–

–

37

33

0

0

0

0

0

0

9

25

Seg + geo

–

.94

–

.06

23

31

71

26

22

19

12

19

28

34

Seg + imgn + geo –

.64

.15

.21

31

22

0

11

0

0

0

0

8

23

1

–

–

43

80

50

11

0

58

4

84

41

48

.94

–

.06

48

80

72

11

23

58

16

84

49

52

.15

.21

48

80

50

11

5

58

32

84

46

53

Pseudolabel Seg only Seg + geo

Seg + imgn + geo .64

4.5

Design Choice Verification

We now evaluate diﬀerent design choices. In Fig. 5, we show the volumetric maps of diﬀerent environments and the mapped uncertainty estimation. These maps show that the uncertainty in combination with mapping is very eﬀective to identify uncertain parts of the scene. Next to the ‘target’ objects, also other uncertain objects are identiﬁed. It is expected that a model can also be uncertain about known classes, or that more than 1 novel object are present is a scene. We further show this in Fig. 4. As described in Sect. 3.2, we choose clustering parameters based on an optimisation objective. Figure 6 shows an analysis of whether this objective, which only approximates the true clustering performance, correlates with the true performance of the clustering. For this analysis, we subsample every 20th frame and cluster segmentation features only, to then evaluate the full clustering performance at every 5th optimisation step. We observe that the correlation increases in noise for higher mIoU, but in general correlates well enough to disregard bad parameters. Tables 2 and 3 investigate how beneﬁcial multiple sources of information are in the clustering problem. Firstly, the tables show that clustering with the segmentation features does not result in the same performance as the prediction of the same network, indicating that signiﬁcant knowledge about classes is stored in the ﬁnal layer (we extract features before the ﬁnal layer). Over both scenes, we see that multiple sources of information result in better pseudolabels than just the segmentation features. We also see that which information is useful diﬀers

0.4

0.4

0.3

0.3

0.2

0.2

optimizer actual clustering

0.1

0.1

0 0

50 100 150 optimisation steps

200

0 0 0.1 0.2 0.3 0.4 mIoU of clustering

mIoU of optimizer

H. Blum et al.

mIoU

132

Fig. 6. Analysis of the correlation between the optimisation objective (clustering subset measured against high-conﬁdence predictions) and the actual performance of the clustering (full images measured against ground-truth labels).

from scene to scene, motivating our decision to select the weights of the sources in each scene through optimisation. While the combination of all features therefore creates a robust clustering for diﬀerent scenes, Table 3 shows that this can result in a small tradeoﬀ in single scene performance. We assume that this is caused by the increase of the optimisation space for every new feature.

5

Discussion and Outlook

We investigate the problem of semantic scene understanding in unknown environments containing novel objects. We develop a framework of clustering, inference, and mapping that can be used to autonomously discover novel categories and improve semantic knowledge. It generalises over existing work and helps us to create a new method based on black-box optimisation and information fusion. To discover novel categories, our experiments show that unsupervised features are as useful as supervised ones, especially when multiple features are used. We also show that prior knowledge is very helpful, e.g. to optimize parameters. Better (gradient based) optimisation and a less noisy objective function may even improve this mechanism. In general, from the components in Fig. 3, most are able to propagate gradients, opening opportunities for future research to replace more heuristics with deep learning. While we showed that information fusion can be advantageous, it requires graph clustering, which is neither mini-batch compatible nor diﬀerentiable. We do not report runtimes, because all steps after mapping are not required to be online and our implementations are not optimised for this. We can however report that the meta segmentation of [11] and our optimisation usually required multiple hours. Both points show the potential of more eﬃcient clustering. This work did not touch upon the questions of continual learning or active exploration. The ﬁrst asks how to organise class-incremental learning such that more and more semantic categories can be discovered as a robot moves from scene to scene. Prior work as touched on this topic [8,11,19], but none has evaluated self-supervised approaches with multiple classes over multiple consecutive environments. Similarly, the inﬂuence of actively planning trajectories to

Simultaneous Clustering, Inference, and Mapping

133

aid discovery remains to be investigated, with promising results from domain adaptation on known classes indicating that planning is helpful [31,32]. Our experiments had to overcome a lack of available implementations and quality problems in the data. By releasing our segmentation models, implementations of related work, and implementation of our SCIM approach, we aim to accelerate future research on this topic.

References 1. Garg, S., S¨ underhauf, N., Dayoub, F., et al.: Semantics for robotic mapping, R Robot. 8(1– perception and interaction: a survey. Engl. Found. Trendsin 2) (2020). https://doi.org/10.1561/2300000059, https://www.nowpublishers.com/ article/Details/ROB-059 2. Liu, B.: Learning on the job: online lifelong and continual learning. In: AAAI, vol. 34, no. 09 (2020). https://doi.org/10.1609/aaai.v34i09.7079, https://ojs.aaai.org/ index.php/AAAI/article/view/7079 3. Joseph, K.J., Khan, S., Khan, F.S., Balasubramanian, V.N.: Towards open world object detection (2021) 4. Cen, J., Yun, P., Cai, J., Wang, M.Y., Liu, M.: Deep metric learning for open world semantic segmentation (2021) 5. Lungarella, M., Metta, G., Pfeifer, R., Sandini, G.: Developmental robotics: a survey. Connect. Sci. 15(4), 151–190 (2003). https://doi.org/10.1080/09540090310 001655110 6. McCormac, J., Handa, A., Davison, A., Leutenegger, S.: SemanticFusion: dense 3D semantic mapping with convolutional neural networks. Bayesian Forecast. Dyn. Models 22(2) (2016). https://doi.org/10.1007/b94608 7. Grinvald, M., Furrer, F., Novkovic, T., et al.: Volumetric instance-aware semantic mapping and 3D object discovery. IEEE Robot. Autom. Lett. 4(3) (2019). https://doi.org/10.1109/LRA.2019.2923960, https://ieeexplore.ieee.org/ document/8741085/ 8. Blum, H., Milano, F., Zurbr¨ ugg, R., Siegwart, R., Cadena, C., Gawel, A.: Selfimproving semantic perception for indoor localisation. In: Proceedings of the 5th Conference on Robot Learning (2021). https://proceedings.mlr.press/v164/ blum22a.html 9. Nakajima, Y., Kang, B., Saito, H., Kitani, K.: Incremental class discovery for semantic segmentation with RGBD sensing (2019). http://openaccess.thecvf.com/ content ICCV 2019/html/Nakajima Incremental Class Discovery for Semantic Segmentation With RGBD Sensing ICCV 2019 paper.html 10. Hamilton, M., Zhang, Z., Hariharan, B., Snavely, N., Freeman, W.T.: Unsupervised semantic segmentation by distilling feature correspondences. In: ICLR (2022) 11. Uhlemeyer, S., Rottmann, M., Gottschalk, H.: Towards unsupervised open world semantic segmentation (2022) 12. Caron, M., Touvron, H., Misra, I., et al.: Emerging properties in self-supervised vision transformers. arXiv:2104.14294 [cs] (2021) 13. Ester, M., Kriegel, H.-P., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996) 14. Campello, R.J.G.B., Moulavi, D., Sander, J.: Density-based clustering based on hierarchical density estimates. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 160–172. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2 14

134

H. Blum et al.

15. Fu, L., Lin, P., Vasilakos, A.V., Wang, S.: An overview of recent multi-view clustering. Neurocomputing 402, 148-161 (2020). https://doi.org/10.1016/j.neucom.2020. 02.104, https://www.sciencedirect.com/science/article/pii/S0925231220303222 16. Shah, S.A., Koltun, V.: Deep continuous clustering (2018) 17. Du, S., Liu, Z., Chen, Z., Yang, W., Wang, S.: Diﬀerentiable bi-sparse multi-view co-clustering. IEEE Trans. Signal Process. 69, 4623–4636 (2021). https://doi.org/ 10.1109/TSP.2021.3101979 18. Yu, L., Liu, X., van de Weijer, J.: Self-training for class-incremental semantic segmentation (2020). http://arxiv.org/abs/2012.03362 19. Michieli, U., Zanuttigh, P.: Incremental learning techniques for semantic segmentation (2019). http://openaccess.thecvf.com/content ICCVW 2019/html/TASKCV/Michieli Incremental Learning Techniques for Semantic Segmentation ICCVW 2019 paper.html 20. Potts, R.B.: Some generalized order-disorder transformations. In: Mathematical Proceedings of the Cambridge Philosophical Society (1952). https://doi.org/10. 1017/S0305004100027419, http://www.cambridge.org/core/journals/mathematicalproceedings-of-the-cambridge-philosophical-society/article/some-generalizedorderdisorder-transformations/5FD50240095F40BD123171E5F76CDBE0 21. Head, T., Kumar, M., Nahrstaedt, H., Louppe, G., Shcherbatyi, I.: Scikitoptimize/scikit-optimize (2021). https://zenodo.org/record/1157319 22. Schmid, L., Delmerico, J., Sch¨ onberger, J., et al.: Panoptic multi-TSDFs: a ﬂexible representation for online multi-resolution volumetric mapping and long-term dynamic scene consistency. In: ICRA (2022). http://arxiv.org/abs/2109.10165 23. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: CVPR (2017). http://openaccess.thecvf.com/content cvpr 2017/html/Dai ScanNet RichlyAnnotated 3D CVPR 2017 paper.html 24. Frey, J., Blum, H., Milano, F., Siegwart, R., Cadena, C.: Continual learning of semantic segmentation using complementary 2D-3D data representations. arXiv:2111.02156 [cs] (2021) 25. Gojcic, Z., Zhou, C., Wegner, J.D., Wieser, A.: The perfect match: 3D point cloud matching with smoothed densities. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019). https://doi.org/10.1109/CVPR. 2019.00569, https://ieeexplore.ieee.org/document/8954296/ 26. Blum, H., Sarlin, P.-E., Nieto, J., Siegwart, R., Cadena, C.: The ﬁshyscapes benchmark: measuring blind spots in semantic segmentation. Int. J. Comput. Vision 129(11), 3119–3135 (2021). https://doi.org/10.1007/s11263-021-01511-6 27. Jung, S., Lee, J., Gwak, D., Choi, S., Choo, J.: Standardized max logits: a simple yet eﬀective approach for identifying unexpected road obstacles in urban-scene segmentation (2021) 28. Douillard, A., Chen, Y., Dapogny, A., Cord, M.: PLOP: learning without forgetting for continual semantic segmentation (2020) 29. Munkres - Munkres implementation for Python. http://software.clapper.org/ munkres/#license 30. Rosenberg, A., Hirschberg, J.: V-measure: a conditional entropy-based external cluster evaluation measure. In: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLPCoNLL) (2007). https://aclanthology.org/D07-1043

Simultaneous Clustering, Inference, and Mapping

135

31. Zurbr¨ ugg, R., Blum, H., Cadena, C., Siegwart, R., Schmid, L.: Embodied active domain adaptation for semantic segmentation via informative path planning. arXiv, Technical report arXiv:2203.00549 (2022) 32. Chaplot, D.S., Dalal, M., Gupta, S., Malik, J., Salakhutdinov, R.R.: SEAL: self-supervised embodied active learning using exploration and 3D consistency. In: Advances in Neural Information Processing Systems, vol. 34 (2021). https:// proceedings.neurips.cc/paper/2021/hash/6d0c932802f6953f70eb20931645fa40Abstract.html

6N-DoF Pose Tracking for Tensegrity Robots Shiyang Lu1 , William R. Johnson III2 , Kun Wang1 , Xiaonan Huang2 , Joran Booth2 , Rebecca Kramer-Bottiglio2 , and Kostas Bekris1(B) 1 Rutgers University, New Brunswick, NJ, USA {shiyang.lu,kun.wang2012,kostas.bekris}@rutgers.edu 2 Yale University, New Haven, CT, USA {will.johnson,xiaonan.huang,joran.booth,rebecca.kramer}@yale.edu

Abstract. Tensegrity robots, which are composed of compressive elements (rods) and flexible tensile elements (e.g., cables), have a variety of advantages, including flexibility, low weight, and resistance to mechanical impact. Nevertheless, the hybrid soft-rigid nature of these robots also complicates the ability to localize and track their state. This work aims to address what has been recognized as a grand challenge in this domain, i.e., the state estimation of tensegrity robots through a marker-less, vision-based method, as well as novel, on-board sensors that can measure the length of the robot’s cables. In particular, an iterative optimization process is proposed to track the 6-DoF pose of each rigid element of a tensegrity robot from an RGB-D video as well as endcap distance measurements from the cable sensors. To ensure that the pose estimates of rigid elements are physically feasible, i.e., they are not resulting in collisions between rods or with the environment, physical constraints are introduced during the optimization. Real-world experiments are performed with a 3-bar tensegrity robot, which performs locomotion gaits. Given ground truth data from a motion capture system, the proposed method achieves less than 1 cm translation error and 3◦ rotation error, which significantly outperforms alternatives. At the same time, the approach can provide accurate pose estimation throughout the robot’s motion, while motion capture often fails due to occlusions. Keywords: Robot perception · Soft robotics

1 Introduction Tensegrity robots are lightweight, deformable platforms that are composed of compressive elements (rods) and flexible tensile elements (e.g., cables [29], springs [14], elastic membranes [1], etc.). This enables tensegrity robots to distribute external forces through deformation and thus prevent damage. Furthermore, by controlling its tensile elements, a tensegrity robot can reshape itself and move in complex environments [29]. An illustration of a 3-bar tensegrity robot, used for the experiments presented in this work, is shown in Fig. 1. Tensegrity robots’ adaptability and compliance have motivated their use in many applications, such as manipulation [16], locomotion [22], morphing airfoil [8] and spacecraft landers [5]. While their versatility is exciting, tensegrity robots are difficult to track, model and control because of their complex dynamics [24]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 136–152, 2023. https://doi.org/10.1007/978-3-031-25555-7_10

Tensegrity Robot Perception

137

Fig. 1. (a) The 3-bar tensegrity robot used in the accompanying experiments. Each bar has a length of 36 cm. The robot can reshape itself and perform locomotion by changing the length of its actuation cables. Distances between endcaps are measured by on-board elastic stretch sensors. Motion capture markers are used to generate ground truth poses only for the evaluation purpose. (b) Noisy observation from a commodity RGB-D camera poses challenge on vision-based pose estimation. (c) The tensegrity robot rolling on a carpet. The shape of this robot keeps changing during the motion and generates significant self-occlusions. (d) 6-DoF pose estimates for each bar at the corresponding time steps.

In particular, state estimation and tracking are critical for control algorithms that require the current robot state. For tasks, such as bringing the tensegrity to a desired goal configuration, it is necessary to identify the 6D pose of its rigid elements. Reconstructing the trajectory of a tensegrity robot can also be used for sim2real, where the parameters of a simulated model are identified to match real-world trajectories. Such simulated models [19, 30–32] can then be used to generate locomotion policies via RL [25, 26] without the need for many real world experiments. Accurately estimating the state of a tensegrity robot, i.e., the 6-DoF pose of each rod, remains a grand challenge in this domain [24]. Vision-based pose estimation is especially challenging due to noisy observations, self-occlusions, and a large number of degrees of freedom as shown in Fig. 1. This drives the need for improved sensor fusion, physics-aware state estimation algorithms and more stable, on-board sensors [24]. This work proposes a robust approach for markerless, vision-based pose tracking of an N -bar tensegrity robot given as input: 1) RGB-D images from a single, external commodity camera, and 2) measured endcap distances from recently developed, onboard, elastic stretch sensors. The proposed approach has the following key features:

138

S. Lu et al.

• An ICP-like tracking method is used to fuse the multi-modal inputs in an iterative optimization process. An objective function is designed to maximize both unary and binary potentials. Unary potentials reflect how well the mesh models of the rods’ endcaps fit with the observed point cloud. Binary potentials reflect the fitness of predicted endcap distances and the measured ones from on-board sensors. Adaptive weights are designed to balance these two potentials. • A variety of physical constraints between rigid elements of tensegrity robots are introduced in the optimization. These constraints ensure that the estimated poses are physically feasible without relying on a sophisticated physics engine, and further improve pose estimation accuracy. The proposed method is evaluated qualitatively and quantitatively on real-world experiments discussed in Sect. 5, using a 3-bar tensegrity performing locomotion gaits. The method achieves low translation and rotation error when compared against alternatives as well as ground truth data from motion capture. More critically, the approach can provide pose estimates throughout the robot’s motion, while motion capture often fails due to occlusions. A link1 points to a website that highlights the available experiments and shares an open-source dataset with the corresponding trajectories. The hope is that the methodology will prove useful for additional platforms beyond tensegrities that can make use of onboard distance sensors for soft elements and an external RGB-D camera.

2 Related Work The majority of state estimation work on tensegrity robots is based on multiple onboard sensors, such as sensors for contact sensing [7], cable length measurement [28], strut extension [10], and inertial measurement units (IMUs) [4, 7]. However, with on board sensors only, it is hard to reconstruct the state of tensegrity robots in the global frame. To estimate states in the global frame, Caluwaerts et.al [6] proposed a variant of the Unscented Kalman Filter (UKF) algorithm by fusing the measurements from IMU sensors and ranging sensors that are placed at the endcaps of rods and fixed anchors (landmarks) at the testing area. Similarly, a motion capture system could be setup and track the markers placed on endcaps of rods [3] to get their global coordinates. Despite being accurate when all the markers are visible, tracking often fails when markers are occluded. Moreover, since a motion capture system is fixed once calibrated, it constrains the movement of a robot to a small region. More recently, Moldagalieva et al. [17] proposed the first vision-based method to estimate the pose of a tensegrity robot using fiducial markers. In particular, a hemispherical camera module was attached to the base of their robot, so as to track the fiducial markers printed on a triangle-shaped plate fixed at the upper part of the robot. The pose estimates could be further fine-tuned by fusing the cable length measurement using a neural network [15]. However, the requirement for fiducial markers to always be visible to the camera prevents the robot from doing actions like rolling. It also requires a large amount of data for training a deep model. In contrast, the method proposed in this paper detects endcaps of rods based on color images instead of relying on fiducial markers. 1

https://sites.google.com/view/tensegrity-robot-perception.

Tensegrity Robot Perception

139

ICP algorithms [2, 9] have been widely used in the industry for point cloud registration and model matching. A large number of variants [20, 21, 23] have been proposed since the advent of this algorithm for better accuracy and faster convergence. Some extensions of the ICP algorithm can even be applied to track articulated objects, such as human bodies [18] and hands [27]. This proposed work is in line with the direction of those methods but with an objective function and physical constraints that are specifically designed for tensegrity robots.

3 Problem Formulation This work aims to track the pose of an N -bar tensegrity robot, i.e. 6-DoF pose of each rigid element (rod) of that robot, which are connected with flexible tensile elements (cables). An illustration of the robot used for the experimental tests of this paper is shown in Fig. 1. Two types of sensors are considers: 1. An overhead RGB-D camera records the motion of the tensegrity robot; 2. on-board strain sensors are publishing the cable length measurements during the motion. The images and cable length measurements are synchronized. The mesh model M of the rods and endcaps is assumed available and the structure of the robot (i.e., the connectivity between endcaps with cables) is also provided. The endcaps of different rods are painted with different colors and the ranges of their HSV values are known. Then, the objective is to compute the set of poses Pt = {p1t , ..., pN t } ∈ SE(3) of each rod at the current time step t, given: • known parameters for the setup, i.e., the camera’s intrinsic matrix C, the tensegrity’s rod length lrod and diameter drod , • previous rod pose estimates at step t − 1, i.e., Pt−1 = {p1t−1 , ..., pN t−1 } ∈ SE(3), • current measurements at time step t: – the input of an RGB-D image It observing the tensegrity, – cable length measurements Lt = {ltij |(i, j) ∈ E} where i, j are endcap indices and (i, j) ∈ E, if endcap i and endcap j are connected by a cable. To initialize the process, a 2D Region of Interest (RoI), in the form of a bounding box, is manually defined for each endcap at the initial frame I0 of the video sequence.

4 Method and Implementation The following discussion describes step by step how the poses Pt are estimated at each time step. In terms of the visual data and given that the depth measurements of rods can often be very noisy, as shown in Fig. 1b, only the observations of endcaps are used in the proposed state estimation process. 4.1 Pose Initialization The proposed tracking algorithm first establishes an initial pose estimate Pinit given the manually defined image RoIs. This initial estimate does not have to be very accurate, since it will be optimized by consecutive steps of the iterative process. The foreground

140

S. Lu et al.

pixels in each RoI are segmented based on the range of HSV values for that endcap, and then projected to 3D points given the camera intrinsic matrix C and the depth information. The centroid of the observed point cloud of each endcap is computed and used as a rough estimate of the initial endcap position. The initial position of each rod is then estimated by simply using the center of its two endcaps. The initial rotation is estimated using axis-angle representation where the axis is aligned with the the direction from one endcap to the other and the angle is arbitrarily set due to rotational symmetry. 4.2

Iterative Pose Tracking

The proposed method addresses the pose tracking problem through an iterative optimization process where each iteration is composed of a transition step and a correction step. Those two steps are performed alternately until convergence or a preset maximum number of iterations is reached (an empirical number is 6). In the transition step, only the RGB-D observation is used to estimate a local transformation from the previous time step. While in the correction step, a joint constrained optimization is performed given the input of estimated endcap positions and cable length measurements. Those two steps are detailed below. 1) Transition Step The transition model, also known as the action model, aims to provide a 6-DoF local transformation δPti for rods i = 1, 2, .., N , such that the new pose estimate for each i . It is, however, not easy to compute this transformarod to be Pˆti = δPti · Pt−1 tion directly from control gaits, due to the complex dynamics of tensegrity robots. Instead, the local transformation is estimated via scan matching, i.e., the registration of endcap model points with the observed points in the RGB-D data. To prevent unobserved model points (e.g., those on the backside of an endcap) to be registered with the point cloud, this work adopts a hidden point removal method [13] so that only visible points on the model will be used for registration and thus improve accuracy. Correspondence between endcap model points and observed points is established based on nearest neighbors as long as they are within a maximum distance threshold in the 3D space and also within an HSV range in the color space. The maximum distance threshold dmax is designed to be a dynamic value that decreases with the number of iterations. The intuition behind this is that in the first few iterations the region for correspondence search should be large in case of significant movement, while in later iterations it should be small for fine-grained estimates. With known correspondences, a local transformation for each rod can be estimated with the Kabsch algorithm [12], which minimizes the weighted root mean square error (RMSE) between point pairs. The weight of each point pair is computed using the following equation, where d is the distance between a pair of points. This weight is monotonically decreasing when the distance increases and clipped to zero if the distance goes beyond dmax . ⎧ ⎨ 1 − ( d )2 , d ≤ d max dmax (1) w(d) = ⎩ 0, d > dmax

Tensegrity Robot Perception

141

It is often the case, however, that the corresponding points of one endcap are too few for registration, mainly due to occlusions. To handle this problem, a certain number of dummy points (an empirical number is 50) are randomly sampled from observed and model points at the previous time step and pasted to the points in the current time step. This step is effective in preventing arbitrary poses being generated and results in more stable state estimates. It is also arguably better than alternative motion models, such as assuming the rod is not moving at the current time step since the observation of the other endcap can provide meaningful information. The weights of those dummy points are half the value of observed points. 2) Correction Step In the correction step, a joint constrained optimization is performed. The intuition behind this optimization is that the estimated endcap positions are expected to be close to the output of the transition model, and the distance between two connected endcaps should be close to the cable length measurement. Meanwhile, endcaps and rods should satisfy a variety of physical constraints to ensure that the estimated p1t , ..., pˆN 6-DoF poses are valid. Given the current pose estimates Pˆt = {ˆ t } from the transition step and cable length measurements Lt , the objective function shown below is aiming to find a set of optimal endcap positions Q = {qt1 , ..., qt2N }: L(Q) =

2N i

wi ||qti − qˆti ||22 +

wij (||qti − qtj || − ltij )2

(i,j)∈E

(2)

Qoptim = argmin L(Q) Q

ˆ = {ˆ where Q qt1 , ..., qˆt2N } are estimated endcap positions from the transition model and wi , wij are adaptive weights for the unary and binary loss terms respectively. The constraints that are accompanying this objective function and computation of those dynamic weights are elaborated upon in Subsects. 4.3 and 4.4 respectively. Once the positions of endcaps are optimized, the current pose estimates are updated for the next iteration such that the position of the rod is set to be the center of its endcaps and the rotation axis is set to be aligned with the direction from one endcap the other. Different from the initialization step in Subsect. 4.1, the rotation angle is no longer set to be arbitrary but rather to a value that minimizes the geodesic distance in SO(3) from the previously estimated rotation. Despite rotational symmetry, an arbitrary rotation angle can affect downstream tasks that need to compute angular velocity, such as feedback control and system identification. This constraint optimization is performed with sequential least squares programming (SLSQP). 4.3 Physical Constraints 1) Constraints between endcaps Rods are rigid elements of tensegrity robots. The distance between the estimated endcap positions of a rod should be a constant, i.e., the rod length lrod , regardless of time. Assuming that i, j are the two endcaps of a rod, an equality constraint is applied: ||qi − qj || = lrod

142

S. Lu et al.

2) Constraints between rods The estimated poses of rods should not make them collide or even penetrate each other between consecutive frames. Thanks to the simple geometry of the rods, the physical constraints between them can be added to the optimization problem without using a sophisticated simulation engine. Each rod is approximated as a cylinder and its endcaps as two hemispheres. With this approximation, a rod can be parameterized with only two variables: its length lrod and its diameter drod . A collision between rod i and rod j can be easily detected by comparing the rod diameter drod with the rod distance d(cit , cjt ) where (cit , cjt ) is the closest pair of points on the main axis of rod i and rod j at time step t. Rod i and rod j are colliding with each other if d(cit , cjt ) < drod . The closest pair of points can be computed given the rod pose estimates at timestep t. An inequality constraint can be added, i.e. −−→ |cit cjt | ≥ drod

(a)

Meanwhile, the relative pose between two rods are not expected to change dramatically, and it is impossible for one rod to suddenly cross through another between two consecutive frames. An illustration of penetration between two rods is shown −−−−−→ −−→ in Fig. 7. Such penetration can be detected if the angle between cit cjt and cit−1 cjt−1 −−→ −−−−−→ is greater than 90◦ , which can be written as cit cjt · cit−1 cjt−1 < 0. Thus, another inequality constraint can be added, i.e., −−→j −−−−− → cit ct · cit−1 cjt−1 ≥ 0

(b)

Fig. 2. Illustration of the physical constraints between rods. (a) Rod i and rod j at time step t − 1, where the (cit−1 , cjt−1 ) is the closest pair of points. (b) Rod i and rod j with two pose −−→ hypotheses at time step t. Their distance should satisfy the constraint |cit cjt | ≥ drod . (c) Despite that both hypotheses for rod j satisfy the distance constraint, the one in red is physically infeasible −−−−−→ −−→ as the angle between cit cjt and cit−1 cjt−1 is greater than 90◦ , which means rod j crosses through rod i between two consecutive frames.

Nevertheless, it is difficulty to directly apply constraints (a) and (b) despite their simple mathematical formulas. This is because the closest pair of points (cit , cjt ) depends on the relative pose of two rods, and their positions are non-linear with respect to the endcap positions. This problem is addressed by introducing a linear

Tensegrity Robot Perception

143

approximation of the constraints based on the assumption that the change in the closest points on the two rods is negligible in the rod’s local frame between two consec−−−−−→ −−→ utive frames, and the angle θ between cit cjt and cit−1 cjt−1 is small, i.e. cos(θ) ≈ 1. With this assumption, constraints (a) and (b) can be approximated with a single constraint linear to the current endcap positions (c), i.e., −−−−−→ −−→j cit−1 cjt−1 cˆit cˆt · −−−−−→ ≥ drod |cit−1 cjt−1 |

(c)

where cˆit and cˆjt are the same points as cit−1 and cjt−1 in the local frame of rod i and rod j. This linear approximation makes it easier to optimize, and it is shown to be effective in experiments. If (cit−1 , cjt−1 ) are the endpoints of rod i and rod j, then this inequality constraint is not applied (Fig. 2). 3) Constraints between rods and the supporting plane If given the prior knowledge that the tensegrity robot is not moving in a complex environment but instead on the ground, an additional physical constraint can be applied between the rods and the ground to prevent invalid poses that make rods penetrate the ground. A extrinsic transformation from the camera frame to the world ¯ and a translation t¯, can be computed using frame, which is composed of a rotation R plane detection by RANSAC, where the ground is represented as z = 0 in the world frame. The inequality constraint for endcaps at position qti is: ¯ i + t¯) ≥ drod /2, z(Rq t

i = 1, 2, .., 2N

4.4 Adaptive Weights Adaptive weights are applied to balance the unary loss and binary loss in the objective function in the correction step. Let the ratio ri for the endcap i be the number of corresponded points over the total number of rendered points for endcap i assuming no occlusion. The unary weight wi and binary weights wij are computed as follows:

wij =

⎧ ⎪ ⎨ ⎪ ⎩

wi = max(ri ,

0.1)

0, 0.25,

ri > 0.5 and rj > 0.5 ri < 0.2 or rj < 0.2

0.25 ∗ (1 − 0.5(ri + rj )),

otherwise

The design logic is that endcap distance measurements by on-board elastic stretch sensors are often not accurate enough and may enlarge error when both the end caps of a rod are visible, despite that these measurements are rather helpful when endcaps are occluded. When the number of corresponding points for both endcaps of a rod is large, i.e. ri > 0.5 and rj > 0.5, point registration is likely to give an accurate estimate, and the loss introduced by the distance measurement is not considered. However, when either of the endcaps on a rod is heavily occluded, i.e. ri < 0.2 or rj < 0.2, then the distance measurement must be taken into account and is set to be a constant value (0.25). The binary weights are otherwise linear to the sum of the ratio ri . These coefficients and thresholds are empirical, but they work sufficiently well in the experiments.

144

S. Lu et al.

4.5

Noise Filtering

Sensor noise in modern Lidar-based RGB-D cameras is typically small. Nevertheless, it is observed in real experiments that misalignment of the color and depth images does occur, and the depth quality is bad at the object edges. An example is shown in Fig. 1(b). This noise is usually not a big issue for target objects with a large observable region, but it turns out to be rather problematic for tensegrity robots as the endcaps are quite small (radius = 1.75 cm in the experiment). Although this noise can be partially removed by careful tuning of the camera setup, it is sensitive to many uncontrollable factors, such as lighting conditions in real environments. To minimize the sensor noise, observed points for each endcap are filtered given the prior knowledge of the endcap size. To be more specific, observed points are first filtered by the HSV range in color space and max correspondence distance in 3D space. Noisy points are further removed if their depth are greater than the minimum depth of the remaining points plus the endcap radius.

5 Experiments 5.1

Hardware Setup

The real-world data are collected on a three-bar tensegrity robot with nine elastic stretch sensors connecting the nodes. The length of of each bar is 36 cm. Six of the sensors run parallel to the six cables that actuate the robot while the other three longer sensors are passive. The sensors are made from silicone elastomers (Eco-Flex 30 or DragonSkin 20; Smooth-On, Inc.) encapsulating two layers of a liquid metal paste made from eutectic gallium indium. The two layers of liquid metal compose a parallel plate capacitor whose capacitance varies linearly as the sensor is stretched [11]. The capacitance of each sensor is measured on-board with a commercial capacitive sensing breakout board (MPR121; Adafruit). The sensors are calibrated individually to obtain a linear model mapping the sensor’s capacitance to its length. Then, the sensors are used for feedback control during actuation trajectories, and the length of each sensor is published at each time step. The color and depth image sequences are collected by an RGB-D camera (Intel Realsense L515) at 720p resolution. In order to get the ground truth poses for quantitative evaluation, a commercial motion capture system (Phasespace, Inc.) is used. Markers are placed on the endcaps of rods to measure their positions. Note that measurements from the motion capture system also contain some small error, but this error is difficult to quantify. Figure 3 shows the hardware setup and the environment where the robot experiments are performed. 5.2

Data Collection

The proposed algorithm is evaluated on a set of self-collected robot trajectories. In total, 16 unique trajectories are collected where the tensegrity robot is placed at different starting positions and a different set of controls are performed. For each trajectory, a sequence of RGB-D images and their corresponding cable length measurements are aligned and recorded at a frame rate 10 Hz. The shortest sequence contains 41 frames while the longest contains 130 frames. In total, this dataset contains 1524 frames, and

Tensegrity Robot Perception

145

Fig. 3. Hardware setup of the experiments. An RGB-D camera (RealSense L515) is placed overhead to capture the whole region where the three-bar tensegrity robot performs locomotion. A motion capture system (Phasespace, Inc.) with 6 cameras is placed around this region to capture endcaps markers to generate ground truth for the evaluation purpose.

about 95 frames on average for each trajectory. Note that markers placed on endcaps are not always visible to the motion capture system. The visibility of a marker from the camera and from the motion capture system is, however, different. An endcap occluded from the camera may have its marker visible to the motion capture system and vice versa. In total, the number of ground truth poses for all rods in all videos is 2786. Detailed statistics of the dataset are shown in Fig. 4. 5.3 Evaluation Metric For quantitative evaluation of pose estimation, the mean and standard deviation of both translation and rotation errors are reported. Due to rotational symmetry along the main axis of rods, the rotation error is evaluated using the angle between the estimated and ground truth main axis, i.e. the pose errors of rods are evaluated on five degrees of freedom. Pose estimation results are also evaluated using the “2 cm-5 deg” metric, which reports the percentage of pose estimates that are within both 2 cm translation error and 5◦ rotation error. The translation error of the center of mass is also evaluated, where the center of mass of the robot is computed by averaging the center positions of all rods. Apart from the pose estimation result in the global frame, the local shape reconstruction result is also evaluated. This result is correlated to the pose estimation result, but it is independent of coordinate frames. It is evaluated by comparing predicted distances between endcaps with measured distances from the motion capture system. Smaller distance error indicates a better shape reconstruction in the local frame. 5.4 Baselines and Ablation Studies The first baseline method is naive ICP performed on each rod. Dummy points are added to improve stability because this baseline can easily lose track without them. However,

146

S. Lu et al.

Fig. 4. Statistics of the dataset. The dataset contains 16 videos that are captured 10 Hz. This plot shows the number of frames in each video, and the number of available ground truth poses for rods with R/G/B endcaps respectively. The ground truth pose of a rod is only available when both of its endcap markers are visible to the motion capture system, which is not often the case in real experiments. This, however, motivates the proposed work.

cable length measurements are not used, and no joint optimization is performed. The second baseline first estimates the shape of the tensegrity robot using the endcap distance measurements from on-board sensors. A constrained optimization is done which minimizes the mean squared error (MSE) between the predicted distances between endcaps and the measured ones from on-board sensors, while it constrains the distance between the two endcaps of each rod as the rod length lrod . The reconstructed robot is then considered as a rigid body and ICP is used for registration. The third baseline considers all the constraints in Sect. 4.3, and performs a correction step similar to the proposed method. The correction step, however, is not integrated in the iterative process. Instead, it first performs ICP for a few iterations and then does one correction step at the end. In addition, a set of ablation studies are done to verify the effectiveness of the proposed physical constraints and adaptive weights. The first ablation study removes the ground constraint and physical constraints between rods. The second ablation study only removes the physical constraints. The third ablation study keeps all constraints, but uses a static weight (0.25) instead of dynamic weights in the correction step. In addition to those ablation studies, the error of distance between endcaps, which is measured by the elastic stretch sensors, is reported. Note that all the hyperparameters, such as max correspondence distance thresholds, remain the same for all baselines and all testing trajectories for a fair comparison. 5.5

Quantitative Results

Quantitative results of pose estimation and local shape reconstruction are shown in Table 1. The first baseline, naive ICP, has the worst performance in terms of local

Tensegrity Robot Perception

147

Table 1. Quantitative results of pose tracking and local shape reconstruction. Method

Trans. Err (cm) Rot. Err (deg) 2 cm-5 deg (%) CoM Err (cm) Shape Err (cm)

Baseline 1 1.56 / 1.89

4.55 / 7.96

73.1

1.07 / 0.76

1.30 / 1.81

Baseline 2 1.83 / 1.64

5.59 / 6.62

59.2

1.59 / 1.17

1.18 / 1.67

Baseline 3 1.10 / 0.75

3.23 / 2.91

79.2

0.81 / 0.44

0.92 / 0.91

Ablation 1 1.22 / 0.96

3.81 / 9.28

79.2

1.01 / 0.60

0.89 / 1.00

Ablation 2 1.04 / 0.69

3.46 / 9.10

82.3

0.79 / 0.43

0.85 / 0.80

Ablation 3 1.02 / 0.60

2.88 / 2.16

85.3

0.79 / 0.42

0.84 / 0.68

Measured

N/A

N/A

N/A

N/A

1.09 / 1.19

Proposed

0.99 / 0.58

2.84 / 2.12

85.5

0.77 / 0.44

0.80 / 0.64

shape reconstruction, and the second largest tracking error in the global frame. Even though dummy points are added, it still loses tracking in multiple trials and is not able to recover. The second baseline has an even larger tracking error than the naive ICP, which may seem a little surprising. This is because the on-board sensors are often noisy, and the robot being treated as a rigid body after erroneous shape reconstruction further deteriorates the tracking performance. The third baseline performs much better by considering all the physical constraints and performing a correction step. However, ICP and the correction step are performed separately rather than in an interleaving way. Thus, it has an inferior performance than the proposed approach. The first two ablation studies shows the benefits of the ground constraints and constraints between rods. Without both constraints, the result of the first ablation is not satisfactory despite being better than the naive ICP. With ground constraints, the translation error in the second ablation is much reduced but the rotation error remains large. The third ablation introduces constraints between rods, and significantly decreases the rotation error. The proposed method performs even better than the third ablation using dynamic weights in the joint optimization. Overall, the proposed method has a mean translation error within 1 cm, mean rotation error within 3◦ , 85.5% of the rod pose estimates within 2 cm-5 deg error range, and mean center of mass error of 0.77 cm for the robot as a whole. It also has the smallest shape reconstruction error in the local frame, which is reflected by the mean distance error of 0.8 cm between endcaps. 5.6 Qualitative Results An exemplary trial is shown in Fig. 5, where a robot is moving on a lawn, and the motion capture system is not available. Despite noisy background and self-occlusions, the proposed method is able to give highly accurate pose estimates, while the estimates from the naive ICP baseline have notable errors. In Fig. 6, distances between endcaps at each frame is plotted for another exemplary trial. This is to show how the local shape reconstruction matches with the ground truth. Those predicted distances are compared with the measured distances from the elastic stretch sensors and the motion capture system (ground truth). The predicted results have smaller mean distance error (0.8 cm) than those directly measured by the elastic stretch sensors (1.2 cm). Some regions are highlighted in purple boxes to show the differences. The regions where the elastic stretch

148

S. Lu et al.

Fig. 5. Qualitative results. In this exemplary trial, the tensegrity robot moves from the bottom right to the top left. The left most column shows the raw RGB observation. The second and third columns show the rendered robot at the estimated poses by the naive ICP baseline and the proposed method respectively. The last column show the colored point cloud and robot estimated by the proposed method from some other viewpoints. The proposed method has superior performance over the naive ICP baseline especially when notable occlusion occurs.

sensors have large errors are typically due to the sensors going slack. The proposed method, however, is robust to those noisy measurements. Figure 7 shows the effectiveness of physical constraints via another example. Without physical constraint, rods can freely cross through each other. In this example, the red rod is above the green rod and blue rod in the first frame, but it’s estimated to be underneath them in the next frame, which is physically infeasible. The blue and green rods are also colliding with each other in this estimate. The proposed physical constraint can effectively prevent such unrealistic estimates from being generated, further improving the accuracy. 5.7

Running Time

The running time is computed on a desktop with an AMD Ryzen 5900 CPU. For each iteration, the transition step takes ∼ 2 ms, and the correction step takes ∼ 4 ms. The method can achieve real-time performance (30 Hz) with 6 iterations at each time step. Note that the accompanying implementation is written in Python, and a faster speed is expected with a more sophisticated implementation in C++.

Tensegrity Robot Perception

149

Fig. 6. Plots of distances between endcaps. These plots reflect the quality of local shape reconstruction in this exemplary trial. The predicted distances between endcaps by the proposed method (blue lines) are compared with those measured by on-board sensors (yellow lines) and by the motion capture system (green lines, used as ground truth). The predicted distances have even smaller mean error (0.8 cm) than that from on-board sensors (1.2 cm), which reflects a better shape reconstruction in the local frame. Some notable differences are highlighted in purpose boxes.

Fig. 7. Effectiveness of constraints between rods. This figure compares the pose estimates with and without rod constraints at certain frames. Large error occurs at frame 6 without using rod constraints. The rod with red endcaps crosses through the other two from the previous frame, which is physically infeasible. The blue and green ones are also colliding with each other. As a comparison, the estimates with rod constraints are physical valid and are more accurate.

150

S. Lu et al.

6 Limitations One limitation of this work is that manual annotation of the 2D region of interest is required for the first frame in order to compute an initial pose. Besides, tracking failure does happen at some time step despite being robust for most of the time. A failure case is shown in Fig. 8 where the movement of the robot is significant between two consecutive frames. Although the tensegrity robot is not able to move at a high speed, data transmission via Bluetooth can be unstable sometimes. After investigation, it turns out that not enough iterations were performed before the max correspondence distance dmax decreases to a value that excludes the observed points. This could be resolved by running more iterations. It is, however, tricky to set such hyperparameters, and there is a trade-off between the number of iterations and the running speed. This work is not data-driven and some manual tuning of these hyperparameters is still required.

Fig. 8. Failure case. A large tracking error occurs at the frame 16 of this video. The robot has a significant movement between two consecutive frames and not enough iterations were performed to make the estimates converge. Tracking is fully recovered later at frame 18.

7 Conclusion This work proposes a novel iterative algorithm for 6N-DoF pose tracking of N-bar tensegrity robots. Compared to the standard ICP algorithm for rigid objects, the proposed method fuses the cable length measurement as a soft constraint in the joint optimization process. Meanwhile, a set of physical constraints are introduced to prevent physically invalid pose estimates from being generated. A number of techniques are further applied to improve the robustness of pose estimation against sensor noise. Real-world experiments are performed with an exemplary three-bar tensegrity robot to showcase the robustness and effectiveness of the proposed tracking method, which has a rather small pose estimation error of 1 cm and 3◦ for each rod on average. One possible future work is global registration of tensegrity robots, so that the requirement of manual annotation at the initial frame can be removed, and tracking failures can be easily recovered.

Tensegrity Robot Perception

151

References 1. Baines, R.L., Booth, J.W., Kramer-Bottiglio, R.: Rolling soft membrane-driven tensegrity robots. IEEE Robot. Autom. Lett. 5(4), 6567–6574 (2020) 2. Besl, P.J., McKay, N.D.: Method for registration of 3-D shapes. In: Sensor fusion IV: Control Paradigms and Data Structures, vol. 1611, pp. 586–606. Spie (1992) 3. Booth, J.W., Cyr-Choiniere, O., Case, J.C., Shah, D., Yuen, M.C., Kramer-Bottiglio, R.: Surface actuation and sensing of a tensegrity structure using robotic skins. Soft Rob. 8(5), 531–541 (2021) 4. Bruce, J., Caluwaerts, K., Iscen, A., Sabelhaus, A.P., SunSpiral, V.: Design and evolution of a modular tensegrity robot platform. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 3483–3489. IEEE (2014) 5. Bruce, J., et al.: SUPERball: exploring tensegrities for planetary probes. In: 12th International Symposium on Artificial Intelligence, Robotics, and Automation in Space (i-SAIRAS) (2014) 6. Caluwaerts, K., Bruce, J., Friesen, J. M., SunSpiral, V.: State estimation for tensegrity robots. In: 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 1860– 1865. IEEE (2016) 7. Caluwaerts, K., et al.: Design and control of compliant tensegrity robots through simulation and hardware validation. J. R. Soc. Interface 11(98), 20140520 (2014) 8. Chen, M., Liu, J., Skelton, R.E.: Design and control of tensegrity morphing airfoils. Mech. Res. Commun. 103, 103480 (2020) 9. Chen, Y., Medioni, G.: Object modelling by registration of multiple range images. Image Vis. Comput. 10(3), 145–155 (1992) 10. Friesen, J., Pogue, A., Bewley, T., de Oliveira, M., Skelton, R., Sunspiral, V.: DuCTT: A tensegrity robot for exploring duct systems. In: 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 4222–4228. IEEE (2014) 11. Johnson, W.R., Booth, J., Kramer-Bottiglio, R.: Integrated sensing in robotic skin modules. In: 2021 IEEE Sensors, pp. 1–4. IEEE (2021) 12. Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Acta Crystallogr. A: Cryst. Phy. Diffr. Theor. Gen. Crystallogr. 32(5), 922–923 (1976) 13. Katz, S., Tal, A., Basri, R.: Direct visibility of point sets. In: ACM SIGGRAPH 2007 papers, pages 24-es (2007) 14. Kim, K., Moon, D., Bin, J.Y., Agogino, A.M.: Design of a spherical tensegrity robot for dynamic locomotion. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 450–455. IEEE (2017) 15. Kuzdeuov, A., Rubagotti, M., Varol, H.A.: Neural network augmented sensor fusion for pose estimation of tensegrity manipulators. IEEE Sens. J. 20(7), 3655–3666 (2020) 16. Lessard, S., et al.: A bio-inspired tensegrity manipulator with multi-DOF, structurally compliant joints. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5515–5520. IEEE (2016) 17. Moldagalieva, A., Fadeyev, D., Kuzdeuov, A., Khan, V., Alimzhanov, B., Varol, H.A.: Computer vision-based pose estimation of tensegrity robots using fiducial markers. In: 2019 IEEE/SICE International Symposium on System Integration (SII), pp. 478–483 (2019) 18. Mundermann, L., Corazza, S., Andriacchi, T.P.: Accurately measuring human movement using articulated ICP with soft-joint constraints and a repository of articulated models. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–6. IEEE (2007) 19. NASA. NASA Tensegrity Robotics Toolkit, Accessed (2020). https://github.com/NASATensegrity-Robotics-Toolkit/NTRTsim

152

S. Lu et al.

20. Park, J., Zhou, Q.Y., Koltun, V.: Colored point cloud registration revisited. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 143–152 (2017) 21. Rusinkiewicz, S. Levoy, M.: Efficient variants of the ICP algorithm. In: Proceedings of the Third International Conference on 3-D Digital Imaging and Modeling, pp. 145–152. IEEE (2001) 22. Sabelhaus, A.P., et al.: Design, simulation, and testing of a flexible actuated spine for quadruped robots. arXiv preprint arXiv:1804.06527 (2018) 23. Segal, A., Haehnel, D., Thrun, S.: Generalized-ICP. In: Robotics: Science and Systems, vol. 2, p. 435. Seattle, WA (2009) 24. Shah, D.S., et al.: Tensegrity robotics. Soft Robot. 9, 639–656 (2021) 25. Surovik, D., Bruce, J., Wang, K., Vespignani, M., Bekris, K.E. Any-axis tensegrity rolling via bootstrapped learning and symmetry reduction. In: International Symposium on Experimental Robotics (ISER), Buenos Aires, Argentina (2018) 26. Surovik, D., Wang, K., Vespignani, M., Bruce, J., Bekris, K.E.: Adaptive tensegrity locomotion: controlling a compliant icosahedron with symmetry-reduced reinforcement learning. Int. J. Robot. Res. (IJRR) 40, 375–396 (2019) 27. Tagliasacchi, A., Schr¨oder, M., Tkach, A., Bouaziz, S., Botsch, M., Pauly, M.: Robust articulated-ICP for real-time hand tracking. In: Computer Graphics Forum, vol. 34, p. 101– 114. Wiley Online Library (2015) 28. Tietz, B.R., Carnahan, R.W., Bachmann, R.J., Quinn, R.D., SunSpiral, V.: Tetraspine: Robust terrain handling on a tensegrity robot using central pattern generators. In: 2013 IEEE/ASME International Conference on Advanced Intelligent Mechatronics, pp. 261–267. IEEE (2013) 29. Vespignani, M., Friesen, J.M., SunSpiral, V., Bruce, J.: Design of superball v2, a compliant tensegrity robot for absorbing large impacts. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2865–2871. IEEE (2018) 30. Wang, K., Aanjaneya, M., Bekris, K.: A first principles approach for data-efficient system identification of spring-rod systems via differentiable physics engines. In: Bayen, A.M., et al. (eds.) Proceedings of the 2nd Conference on Learning for Dynamics and Control, volume 120 of Proceedings of Machine Learning Research, pp. 651–665. PMLR, 10–11 June 2020 31. Wang, K., Aanjaneya, M., Bekris, K.: Sim2sim evaluation of a novel data-efficient differentiable physics engine for tensegrity robots. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1694–1701. IEEE (2021) 32. Wang, K., Aanjaneya, M., Bekris, K.: A recurrent differentiable engine for modeling tensegrity robots trainable with low-frequency data. In: 2022 IEEE International Conference on Robotics and Automation (ICRA) (2022)

Scale-Invariant Fast Functional Registration Muchen Sun1(B) , Allison Pinosky1 , Ian Abraham2 , and Todd Murphey1 1

Northwestern University, Evanston, IL 60208, USA [email protected] 2 Yale University, New Haven, CT 06511, USA

Abstract. Functional registration algorithms represent point clouds as functions (e.g. spacial occupancy ﬁeld) avoiding unreliable correspondence estimation in conventional least-squares registration algorithms. However, existing functional registration algorithms are computationally expensive. Furthermore, the capability of registration with unknown scale is necessary in tasks such as CAD model-based object localization, yet no such support exists in functional registration. In this work, we propose a scale-invariant, linear time complexity functional registration algorithm. We achieve linear time complexity through an eﬃcient approximation of L2 -distance between functions using orthonormal basis functions. The use of orthonormal basis functions leads to a formulation that is compatible with least-squares registration. Beneﬁted from the least-square formulation, we use the theory of translation-rotationinvariant measurement to decouple scale estimation and therefore achieve scale-invariant registration. We evaluate the proposed algorithm, named FLS (functional least-squares), on standard 3D registration benchmarks, showing FLS is an order of magnitude faster than state-of-the-art functional registration algorithm without compromising accuracy and robustness. FLS also outperforms state-of-the-art correspondence-based leastsquares registration algorithm on accuracy and robustness, with known and unknown scale. Finally, we demonstrate applying FLS to register point clouds with varying densities and partial overlaps, point clouds from diﬀerent objects within the same category, and point clouds from real world objects with noisy RGB-D measurements.

1

Introduction

Point cloud registration is the problem of transforming one set of points, through rotation, translation, and potentially scaling, in order to align with another set of points. This problem appears as a fundamental component across a variety of tasks in robotics and computer vision, such as object localization [1], medical image processing [2], 3D reconstruction [3], and sensor pose estimation [4], etc. In its essence, the problem of registration is posed as the minimizing the distance between two point clouds. However, point clouds are by nature unordered, and the distance metric between two unordered sets is not well deﬁned. The most commonly used framework computes distance by introducing correspondences between the two point clouds; points that are sampled from the same c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 153–169, 2023. https://doi.org/10.1007/978-3-031-25555-7_11

154

M. Sun et al.

Fig. 1. Functional registration does not rely on local geometry-based features but instead capturing the global structure of the point cloud. Our functional registration algorithm can register diﬀerent objects’ CAD models from the same category with unknown scale.

location or region in the physical world. With correspondences, the unordered point sets can be ordered enabling least-squares optimization problems to be formulated that minimizes the Euclidean distances between corresponding points. The least-squares problem can be solved eﬃciently by methods like the GaussNewton method or the Levenberg-Marquardt method [5], and the computation process is parallelization-friendly. However the accuracy of correspondence estimation is sensitive to noise, and the least squares formulation is known to be vulnerable against incorrect correspondences (outliers) [5]. Therefore, in practice it is hard to obtain correspondences that are suﬃcient for least-squares registration, especially when the point clouds are corrupted by large noise, have varying densities, or are only partially overlapped [6,7]. Another way to compute distance between unordered sets is to abstract the point set into a representation function (e.g., a probability distribution or an occupancy ﬁeld) and minimize the distance between the representation functions—distance metric in function space is well-deﬁned. This approach, generally known as functional registration, avoids unreliable local geometry-based feature extraction and correspondence estimation process by capturing the global geometry of the point clouds (e.g., the general shape of the point cloud). However, functional registration algorithms are often computationally expensive, since the computing the distance between two functions is in general computationally intractable. In one of the early works of functional registration [8], the L2 -distance between Gaussian-mixtures is used, which has a closed-form solution. But computing the closed-form distance has a quadratic complexity with respect to number of points, and thus is only practical for small scale problems. In this work, we combine the computational eﬃciency and simplicity of leastsquares registration with the robustness of functional registration. Inspired by

Scale-Invariant Fast Functional Registration

155

the study of ﬂuid mixing [9], we propose a mapping that maps an unordered vector set to an ordered orthonormal basis function coeﬃcient set. The mapped function space provides us a permutation-invariant, easy to compute distance metric between two unordered sets and formulates a least-squares registration algorithm without correspondences. This set-to-function mapping is surprisingly simple– all one needs is to average the orthonormal basis functions over the point sets. The proposed registration algorithm, named as FLS (functional least-squares) has the following contributions: 1. FLS has competitive accuracy and robustness with state-of-the-art functional registration algorithm, while being an order of magnitude faster; 2. FLS has a linear time-complexity w.r.t. number of points, it can register up to 10,000 points in 1 s using only a single CPU1 ; 3. FLS supports decoupled scale estimation, and thus can register rotation and translation with unknown scale. We validate FLS on standard 3D registration benchmarks, showing that FLS is an order of magnitude faster than state-of-the-art functional registration algorithm without compromising registration accuracy and robustness (Fig. 1). We also show that FLS outperforms state-of-the-art correspondence-based registration algorithm while maintaining a sub-second registration speed with up to 10,000 points.

2

Related Works

Correspondence-Based Registration: The seminal work of iterative closest point (ICP) algorithm [10] provides a simple yet eﬀective way to establish correspondences. This approach has been extended and widely applied for point cloud-based odometry [11,12]. Another family of methods for estimating correspondences uses feature descriptors to ﬁnd geometrically distinctive points and match them across point clouds, such as the fast point feature histogram (FPFH) [13] and the normal aligned radial feature (NARF) [14]. However, with the presence of noise, the local geometry of the point clouds can be blurred and feature descriptors may return wrong correspondences (outliers). Besides better correspondence estimation, performance of correspondencebased registration can also be improved through a more robust least-squares solver. Random sampling consensus (RANSAC) [15] is one of the most commonly used framework for robust least-squares registration, and its result is often used as the initial estimation for ICP algorithms. However, RANSAC has an exponential runtime w.r.t. the outlier ratio and may still fail with large outlier ratio [6]. Other robust registration algorithms include the Guaranteed Outlier REmovel (GORE) algorithm [16] and the practical maximum clique (PMC) algorithm [17]. Yang et al. propose the truncated least-squares (TLS) formulation and provide ﬁrst certiﬁable robust registration algorithm in [6], which could handle registration problems with outlier ratio higher than 90%. The authors also introduce the 1

This is achieved through parallelization using 12 CPU threads.

156

M. Sun et al.

theory of translation-rotation-invariant measurements (TRIMs) and translationinvariant-measurements (TIMs) for decoupled scale estimation (with unknown rotation and translation) and decoupled rotation estimation (with known scale but unknown translation), given correspondences between the original point sets. Another approach to estimate correspondences relies on deep learning pipelines. Deep learning methods are ﬁrst applied to design data-driven feature descriptors, such as in [18,19]. Learning-based methods are also utilized for combined feature extraction and correspondence matching, such as in [20,21]. [22] propose a probabilistic registration method that extracts pose-invariant correspondences through probabilistic point-to-distribution matching inside a learned latent space. However, deep learning methods are inherently limited to the training set and may not generalize to new objects. Functional Registration: As its name indicates, functional registration algorithms represent point clouds as functions, such as spatial density distribution or Gaussian-mixture distribution, and perform registration through minimization of the “diﬀerence” between the functions. Commonly used metrics include the L2 -distance in function space or the Kullback-Leibler divergence. In the seminal work [8,23], a Gaussian-mixture model is constructed by attaching a Gaussian distribution to each point in a point cloud, and the authors compute the closed form solution for the L2 -distance between such Gaussian-mixture distributions. However, this method has a quadratic time-complexity with respect to the number of points and is thus limited to small-scale problems. More recently, Bernreiter et al. [24] introduce a correspondence free and global registration algorithm by correlating the point clouds in the frequency domain through Fourier transform. However, the proposed method is not in a continuous domain, thus the accuracy depends on the resolution of Fourier transform. It also requires extra information such as intensity or range. The work of nonparametric continuous sensor registration [25] is the closest to our work. The authors show that by constructing representation functions from the point clouds in a reproducing kernel Hilbert space, minimizing the L2 -distance between the functions is equivalent to maximizing the inner product of the functions. However, computing the inner product, though it can be accelerated through the sparsity of the kernel, still has a worst-case quadratic time complexity. This approach is also speciﬁed for visual odometry by integrating extra information such as color, intensity, or semantic information in [26,27]. However, none of the aforementioned algorithm supports registration with unknown scale.

3

Notation

A In the point cloud registration problem, we are given two point sets A = {ai }N i=1 NB d and B = {bj }j=1 , with ai , bj ∈ R , d ∈ {2, 3}. We assume there are no redundant elements in the point sets. In this paper, we look for three transformations in order to register point set A to B: scale, rotation, and translation, which are represented by a positive real number s ∈ R+ , a rotation matrix R ∈ SO(d), and a d-dimensional vector t ∈ Rd , respectively. We use the notation

Scale-Invariant Fast Functional Registration

157

A A(R, t, s) = {sRai + t}N i=1 to represent the new set after applying the above transformation to every point in set A. We call point set A the source point set/cloud, and point set B the target point set/cloud.

4 4.1

Functional Point Cloud Registration The Delta-Distance Between Unordered Sets

For any subset of A ⊆ Rd , we can attach a Dirac delta function to each of its element, and thus transform a set to a function. We name this function the delta-mixture function of a set: Deﬁnition 1 (Delta-mixture function). Given a finite set A={a1 . . . aNA }, ai ∈ Rd , the delta-mixture function δA : Rd → {0, +∞} is defined as: δA (x) =

NA 1 δai (x) NA i

(1)

where δai (x) is the Dirac delta function with infinite impulse at ai . The delta-mixture function (1) can be considered as an occupancy ﬁeld, as it fully represents the occupancy of the point cloud in the space. This motivates us to use the L2 -distance between two delta-mixture functions as the distance between two unordered point sets. Furthermore, delta-mixture functions allow closed-form orthonormal decomposition which can simplify the computation of L2 -distance (see Sect. 4.2), unlike Gaussian-mixtures. Deﬁnition 2 (Delta-distance between sets). The delta-distance between two sets is defined as the L2 -distance between the two delta-mixture functions that are generated from the sets: 2 2 dδ (A, B) = δA (x)−δB (x), δA (x)−δB (x) = δA (x)−δB (x) dx (2) = X

1 NA

NA i

δai (x) −

1 NB

NB

X

2 δbi (x)

dx

(3)

i

Lemma 1. The delta-distance between two sets A and B is permutationinvariant (if sets A and B are ordered). Proof. The permutation-invariance of delta-distance can be proved from the permutation-invariance of the summation operator in Eq. (3). The L2 -distance is symmetric and positive-deﬁnite, which makes it an ideal metric to measure geometric similarity. However, this formulation in Eq. (3) cannot be directly used in practice since the delta-distance is zero only when two point sets are exactly the same. Below we propose a continuous relaxation of the delta-distance through orthonormal decomposition of delta-mixture function. This continuous relaxation takes an advantage of the Dirac delta function—the inner product between a Dirac delta function and a continuous function has a closed form solution. This relaxation also allows points to be not exactly at the same location.

158

4.2

M. Sun et al.

Continuous Relaxation of the Delta-Distance

Deﬁnition 3 (Orthonormal basis function). An infinite set of functions {f1 , f2 , . . . } forms the orthonormal bases of the function space with the same domain and co-domain, if the following two properties are satisfied: fi , fj = fi (x)fj (x)dx = 0, ∀i = j (Orthogonality) (4) X fi (x)2 dx = 1, ∀i (Normality) (5) fi , fi = X

where ·, · denotes the inner product in function space. Corollary 1 (Orthonormal decomposition). Given a set of orthonormal function bases {f1 , f2 , . . . }, a function g(x) with the same domain and co-domain can be decomposed as [28]: g(x) =

∞

fk , gfk (x) =

k=1

∞

fk (x)g(x)dx · fk (x).

(6)

k=1

Corollary 2. The inner product between a Dirac delta function δa (x) and a continuous function g(x) has a closed-form solution: δa , g = g(a).

(7)

Proof. This can be proved using the deﬁnition of delta function:

δa (x)dx = 1.

Lemma 2. Based on Corollary 1 and 2, given a set of orthonormal function N bases {f1 , f2 , . . . }, a delta-mixture function δA (x) = N1A i A δai (x) can be decomposed as: δA (x) =

∞ k=1

cA k · fk (x),

cA k =

NA 1 fk (ai ). NA i

(8)

Remark 1. The delta-mixture function (1) can be considered as an occupancy ﬁeld extracted from the point cloud, and the decomposed delta-mixture function (8) can be considered as an implicit representation of the ﬁeld, similar to other implicit ﬁeld representations such as the Neural Radiance Field (NeRF) [29]. In practice we use normalized Fourier basis functions as the orthonormal bases, and NeRF shares a similar strategy by using Fourier feature mapping (also called positional encoding). The diﬀerence is NeRF does not use Fourier features as orthonormal bases, but instead use them to transform the Neural Tangent Kernel (NTK) of the multi-layer perceptron (MLP) to a stationary kernel [29].

Scale-Invariant Fast Functional Registration

159

Theorem 1. The delta-distance between two sets (3) can be computed as: dδ (A, B) = 2

∞ k=1

2 NA NB ∞ A 2 1 1 ck − cB fk (ai )− fk (bi ) = . k NA i=1 NB i=1

(9)

k=1

where {f1 , f2 , . . . } is the set of orthonormal function bases. Proof. See appendix A in [30]. Remark 2. The structure of Eq. (9) exhibits a similar pattern as the PointNet [31], in the sense that they both use summation to achieve permutationinvariance. The diﬀerence is that PointNet has one sum pooling function, whereas our formulation has multiple summations and we show that when the number of summations approaches inﬁnity, Eq. (9) approaches the exact L2 -distance between two delta-mixture functions. With Theorem 1, we can approximate a continuous delta-distance between two sets by choosing a finite set of continuous orthonormal function bases to compute (9). Below we specify the details for using this approximation for point cloud registration. 4.3

Functional Point Cloud Registration

In this paper we choose to use normalized Fourier basis functions for orthonormal decomposition. Deﬁnition 4 (Normalized Fourier basis functions). Define a function f : X → R, where X = [Ll1 , Lu1 ] × · · · × [Lld , Lud ] ⊂ Rd is a d−dimensional rectangular space, Lli and Lui are the lower and upper bound for the d-th dimension, respectively. For any function defined as above, we can define the following normalized Fourier basis functions as the bases for orthonormal decomposition [9]: fk (x) =

d 1

cos k¯i (xi − Lli ) hk i=1

(10)

where x = (x1 , x2 , . . . , xN ) ∈ Rd , k = [k1 , · · · , kd ] ∈ [0, 1, 2, · · · , K]d ⊂ Nd 12 d u l

π L − L k i i i , hk = k¯i = u 2 Li − Lli i=1 Now, we ﬁnally combine (9) and (10) to deﬁne the least-squares formulation for functional registration, named as functional least-squares (FLS). We ﬁrst introduce FLS for rotation and translation estimation with given scale.

160

M. Sun et al.

Deﬁnition 5 (FLS registration with known scale). When the scale s is given, FLS for rotation and translation estimation is defined as: rk2 (R, t) (11) arg min R,t

where

k∈[0,...,K]d

λk · δA(R,t,s) , fk − δB , fk NA NB 1 1 = λk · fk (s · Rai + ti ) − fk (bi ) NA i=1 NB i=1 − d+1 2 λk = 1 + k 2

rk (R, t) =

(12) (13) (14)

where {λk } is a convergent series to bound the L2 -distance. This speciﬁc form of λk above is inspired by the Sobolev norm formulation in [9], which assigns diminishing weights to basis functions with increasing frequencies. It encourages the algorithm to capture the global structure of the point cloud, and thus more robust against outlier points. Note that the complexity of evaluating the FLS residual function (13) has a linear complexity of number of points. Further, even though in (11) the number of residuals is Kd , in practice we found K = 5 to be suﬃcient, thus for a 3D registration task, only 125 residual functions are needed. Furthermore, since (11) is a multi-scale measure [9], the number of residuals does not depend on the size, scale or density of the point cloud, and thus can be assumed ﬁxed. Decoupled Scale Estimation. The concept of translation and rotation invariant measurements (TRIMs) is proposed in [6] in order to decouple scale estimation from rotation and translation estimation. Deﬁnition 6 (Translation rotation invariant measurements). The translation rotation invariant measurements (TRIMs) of a point set A = {a1 , . . . , aNA } is a set of scalar measurements: TRIMA = ai − aj 1 ≤ i, j ≤ NA , i = j (15) The intuition behind TRIMs is that, the distances between one point cloud’s own points are only aﬀected by the scale of the point cloud, and two point clouds with same number of points and same scale should have same TRIMs. Based on this concept, the decoupled scale estimation of FLS is formulated as minimizing the delta-distance between the TRIMs of two point cloud. Deﬁnition 7 (FLS for scale estimation). Given two point sets with unknown relative rotation and translation, FLS for scale estimation is defined as the following one-dimensional registration problem: rk2 (s) (16) arg min s

k∈[0,...,K]

Scale-Invariant Fast Functional Registration

161

where 1 (17) λk · dδ (TRIMA(s) , TRIMB ) 2 NA NA NB NB i=1 j=i+1 fk (s ai − aj ) i=1 j=i+1 fk ( bi − bj ) = 2 λk · − (NA − 1)NA (NB − 1)NB

rk (s) =

2 −1

λk = 1 + k

5

(18) (19)

Experiments and Application

5.1

Implementation Details

We implement FLS with the Ceres nonlinear least-squares solver [32]. We use its implementation of the Levenberg-Marquardt method to solve the least-squares in (11). When estimating the optimal rotation and translation, we optimize directly in the space of SE(3). Through all the following experiments, we use 5 orthonormal basis functions per dimension (thus 125 residuals for 3D point cloud registration, and 5 residuals for scale estimation). Given the fast registration speed of FLS, we can use its registration result to initialize ICP to further reﬁne the result with almost no speed loss. We name this method as FLS−ICP. We release the source code of FLS and benchmark tools at https://sites.google.com/view/ﬂs-isrr2022/. 5.2

Rationale for Test Algorithms

We compare FLS and FLS−ICP with six other algorithms: iterative closest point algorithm (ICP) [10], iterative closest point algorithm with initial estimation from the random sample consensus algorithm (RANSAC) and FPFH features (RANSAC−ICP), global iterative closest point algorithm (Go−ICP) [33], truncated least squares estimation and semideﬁnite relaxation (TEASER) [6], continuous sensor registration (CSR) [25], and deep Gaussian mixture registration (DeepGMR) [22]. We choose ICP and RANSAC−ICP as the baselines, and Go−ICP provides global optimization for classical ICP. We choose TEASER as the state-of-the-art robust correspondence-based registration algorithm, we use the FPFH features [13] to estimate correspondences for TEASER through out the experiments2 . CSR is the state-of-the-art functional registration. DeepGMR is a deep learning-based pipeline that provides globally optimal registration without iterative optimization. We accelerate FLS, CSR, and TEASER all with 12 CPU threads using OpenMP.

2

It’s unclear how FPFH features can be applied across diﬀerent scales, and thus for scale estimation, we use the default order the point clouds as correspondences.

162

5.3

M. Sun et al.

Rationale for Experiment Design

We test algorithms exclusively on the ModelNet40 dataset [34], which contains CAD models of 40 categories of objects. These CAD models allow generation of varying types of synthetic point clouds for testing with a given ground truth. This dataset has been widely used as a benchmark for point cloud registration algorithms. The beneﬁt of synthetic dataset is that it provides a systematic and controllable way to examine diﬀerent aspects of an algorithm. We examine the tested algorithms from the aspects of: (1) robustness against noise; (2) robustness against partial overlap and varying densities; (3) time eﬃciency; (4) sensitivity to initialization. We scale all the test point clouds into a unit cube before testing to better illustrate the value of evaluation metrics. Furthermore, for every algorithm tested, we align the geometrical centers of the source and target point clouds before registration. This does not eliminate the necessity of translation estimation, but will improve the performance of local optimization-based algorithms, such as FLS, CSR, and ICP. For our algorithm, we deﬁne a cube with each dimension being [−1, 1] as the rectangular space in (10) to bound both the source and target point clouds. Some categories in the ModelNet40 dataset contain objects that have more than one correct registration results, for example, any rotation around the center axis of a water bottle can be considered as correct. Thus, we choose ten categories of objects with no obvious symmetric geometry as the test data: airplane, bed, chair, desk, guitar, mantel, monitor, piano, sofa, stairs. Furthermore, since DeepGMR is trained on the ModelNet40 dataset, we select the ﬁrst ﬁve objects from the test set of the ten selected categories, which create a library of 50 objects for our experiments. In addition, we demonstrate using our algorithm for real world object localization from noisy RGB-D point cloud. We use ﬁve metrics to evaluate the performance of each algorithm tested: (1) rotation error; (2) translation error; (3) running time; (4) failure rate; (5) exact recovery rate. A registration result is considered failed if the rotation error is larger than 45◦ or translation error is larger than 0.5 m, such as when the algorithm gets stuck at a poor local minimum or has completely wrong correspondences. We exclude these results when computing the mean and standard deviation of rotation and translation error. A registration result is considered an exact recovery if the rotation error is smaller than 5◦ and the translation error is smaller than 0.03 m. 5.4

Robustness Against Noise

For each object, we generate a point cloud with 1024 points from its CAD model3 , and then scale the point cloud to a unit cube. For ground-truth rotation, we ﬁrst uniformly sample a vector u = [u1 , u2 , u3 ] ∈ R30+ , then we uniformly sample an angle θ from [− π2 , π2 ] to generate a random rotation matrix R = exp([uθ]× ). We uniformly sample the ground-truth translations from [1, 2] for each axis4 . 3 4

We use the pcl obj2ply tool from the PCL library [35]. Due to the alignment of geometrical centers of source and target point clouds in preprocessing, all the tested algorithms estimate the translation within a unit cube.

Scale-Invariant Fast Functional Registration

163

Fig. 2. Left: Our algorithms reaches a similar level of accuracy and robustness as CSR, but is an order of magnitude faster and achieves real-time registration speed (< 0.1 s per registration). Middle: FLS outperforms FPFH feature-based registration algorithms with superior robustness against noise. TEASER−Scale has 100% failure rate due to lack of suﬃcient correspondences. Right: FLS outperforms other correspondence-based registration algorithms on both robustness against noise and computation eﬃciency. The runtime curves of some algorithms overlap due to the high runtime of Go−ICP. (Note: failed registration results are omitted from all rotation and translation plots.)

Lastly, we randomly shuﬄe the source cloud before each test. We add noise sampled from zero-mean normal distributions with standard deviation varying from 0.01 to 0.05. For all algorithms we initialize the estimation as an identity matrix (no rotation and translation). To test the scale-invariance of our algorithm, we randomly sample scale factor between 2 to 5 and apply our algorithm and TEASER (labeled as FLS−Scale, FLS−ICP−Scale, and TEASER−Scale, respectively) to recover rotation and translation with unknown scale. For algorithms tested we initialize the scale

164

M. Sun et al.

estimation as 1. The experimental results are shown in Fig. 2. Note that, even though FLS−ICP−Scale and TEASER−Scale are tested with unknown scale, we present the results alongside tests with known scale for comparison purposes. 5.5

Robustness Against Partial Overlap and Varying Densities

Fig. 3. Example of a synthetic point cloud measurement (red) and the dense noiseless point cloud (green) generated from the CAD model.

To evaluate the performance of the tested algorithms on point clouds with partial overlap and varying densities, we design the task of registering an object from CAD model to noisy real world measurement. For every object tested in the last experiment on robustness against noise, we ﬁrst generate a uniformly covered, dense, noiseless point cloud with 4096 points from the CAD model, then we simulate a “real world” point cloud measurement of the same object by synthesising a sparse point cloud from only three views. This synthesized point cloud measurement only contains 512 points, and we add normally distributed noise with zero mean and standard deviation of 0.02. We choose this noise level to match the commonly observed real world noise from commercial RGB-D cameras. Furthermore, since the synthesized point cloud measurement is generated from only three views, it will not cover the whole surface of the object and the covered areas will have varying densities. An example of the synthetic point cloud measurement is shown in Fig. 3. Similar to the last experiment on robustness against noise, we also generate random scale from [2, 5] to test the scale-invariance of our algorithm. The experiment results are shown in Table 1. 5.6

Time Eﬃciency

To test the time eﬃciency of the algorithms, we generate point clouds with numbers of points varying from 1000 to 10000 and record the registration time. The results are reported in Fig. 4. We did not test time eﬃciency with Go−ICP due to the high computation time of the Branch-and-Bound scheme, which is already reﬂected in the last two experiments. 5.7

Sensitivity to Initialization

The FLS objectives (11) and (16) are not convex, in practice we found the scale estimation can always converge to a good minimum, but it is not the case for rotation and translation registration. To better understand how the inﬂuence

Scale-Invariant Fast Functional Registration

165

Table 1. When registering point clouds with partial overlap and varying densities, FLS−ICP shares the highest recovery rate with the globally optimized Go−ICP, while being two orders of magnitudes faster. Our algorithm’s rotation and translation error is also close to the top. The running time of FLS increases with unknown scale because of the high computation complexity of generating TRIMs. (Note: failed registration results are omitted from all rotation and translation results. We do not report TEAER−Scale due to its 100% failure rate.) Rotation error (deg.)

Translation Runtime error (meter) (second)

Exact Failure recovery rate rate

FLS

9.22 ± 9.33

0.09 ± 0.07

0.16 ± 0.05

8%

10%

FLS−ICP

2.57 ± 3.03

0.04 ± 0.06

0.17 ± 0.05

56%

8%

FLS−Scale

9.56 ± 9.04

0.10 ± 0.08

2.88 ± 0.68

4%

10%

FLS−ICP−Scale 2.35 ± 1.42

0.06 ± 0.07

2.89 ± 0.68

32%

8%

38%

CSR

2.08 ± 1.14 0.04 ± 0.07

8.80 ± 16.82

DeepGMR

22.02 ± 12.76 0.16 ± 0.12

0.007 ± 0.002 2%

62%

TEASER

43.13 ± 1.69 0.26 ± 0.03

0.25 ± 0.08

0%

96%

ICP

3.32 ± 3.75

0.012 ± 0.002

50%

6%

0.06 ± 0.09

10%

Go−ICP

2.20 ± 1.80

0.04 ± 0.07

64.44 ± 28.43

58%

4%

RANSAC−ICP

2.12 ± 0.88

0.03 ± 0.03

0.09 ± 0.03

28%

54%

Runt im e (Seconds)

Runt im e vs. Num ber of point s 1.0

FLS ( Our s)

0.5 0.0 60

TEASER

40

20

FLS-Sca le ( Ou r s)

400

10

200

0

0

15

TEASER-Scale

0.04

CSR

0.012 0.008 0.006

ICP

10

20

0 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

0.00

0.20 0.15

RANSAC-ICP

0.10

0.02

5

0

DeepGMR

0.010

0.05 0.00 1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

1K 2K 3K 4K 5K 6K 7K 8K 9K 10K

Num ber of point s

Fig. 4. Experiment results on time eﬃciency, we ﬁt dashed straight line to every curve. FLS (with known scale) exhibits a linear time complexity. FLS with unknown scale exhibits a superlinear time complexity due to the high computation cost of generating TRIMs, but it is still faster than CSR and TEASER (with known and unknown scale).

from the non-convexity of the objective (11), we generate ground-truth rotation from rotation angles varying from π6 to π and initialize all algorithms with an identity matrix (no rotation and translation). We compare our algorithm with other local registration algorithms: CSR and ICP. We report each algorithm’s exact recovery rate, failure rate and average rotation error in Fig. 5. 5.8

Real World Application

We use a RealSense D435 camera and the ROS package rtabmap ros to generate real-world test point clouds. We generate two point clouds of the same test

166

M. Sun et al.

Fig. 5. Experiment results on sensitivity to initialization. All three local registration algorithms tested share similar sensitivities to initialization.

Fig. 6. FLS is able to register all objects (hat, controller, headphones, phone, glove) in the test environment, while TEASER exhibits large rotation error. This experiment further demonstrates the benefits of the correspondence-free nature of our algorithm.

environment, as shown in Fig. 6(a). First, we translate and rotate the second cloud away from the ﬁrst cloud. Then, we crop the point clouds to each object using the RGB data. This process generates ﬁve objects for registration from each test environment point cloud. In Fig. 6 (b) (c), the objects from the ﬁrst point cloud are shown in red and those from the second point cloud are shown in blue. We register each object individually and show all of the objects aligned on the original (not color-ﬁltered) point cloud. Note that the point clouds are generated from diﬀerent motions of the camera, thus there are no exact pointto-point correspondences. The results are shown in Fig. 6.

6

Conclusions

In this work, we propose a new scale-invariant functional registration algorithm with a new distance metric between unordered point sets (delta-distance). We show that our algorithm has linear complexity with respect to number of points and can be an order of magnitude faster than state-of-the-art functional registration algorithm, without compromising accuracy and robustness. Our algorithm also outperforms state-of-the-art correspondence-based algorithms on accuracy and robustness even with unknown scale and partial overlap. In the future, we will look into improving the robustness of the delta-distance and explore using the delta-distance as the loss function in learning-based pipelines.

Scale-Invariant Fast Functional Registration

167

Acknowledgements. This material is supported by NSF Grant CNS-1837515, ONR Grant N00014-21-1-2706, and HRI Grant HRI-001479. Any opinions, ﬁndings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reﬂect the views of the aforementioned institutions.

References 1. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes (2018) 2. Balakrishnan, G., Zhao, A., Sabuncu, M.R., Guttag, J., Dalca, A.V.: VoxelMorph: a learning framework for deformable medical image registration. IEEE Trans. Med. Imaging 38(8), 1788–1800 (2019) 3. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: richly-annotated 3D reconstructions of indoor scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2443 (2017) 4. Yokozuka, M., Koide, K., Oishi, S., Banno, A.: LiTAMIN2: ultra light LiDARbased SLAM using geometric approximation applied with KL-divergence. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 11619–11625 (2021) 5. Nocedal, J., Wright, S.: Numerical Optimization. Springer, New York (2006). https://doi.org/10.1007/978-0-387-40065-5 6. Yang, H., Shi, J., Carlone, L.: TEASER: fast and certiﬁable point cloud registration. IEEE Trans. Robot. 37(2), 314–333 (2021) 7. Urbach, D., Ben-Shabat, Y., Lindenbaum, M.: DPDist: comparing point clouds using deep point cloud distance. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12356, pp. 545–560. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58621-8 32 8. Jian, B., Vemuri, B.C.: Robust point set registration using gaussian mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 33(8), 1633–1645 (2011) 9. Mathew, G., Mezi´c, I., Petzold, L.: A multiscale measure for mixing. Physica D 211(1–2), 23–46 (2005) 10. Besl, P., McKay, N.D.: A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992) 11. Biber, P., Straßer, W.: The normal distributions transform: a new approach to laser scan matching. In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2743–2748 (2003) 12. Koide, K., Yokozuka, M., Oishi, S., Banno, A.: Voxelized GICP for fast and accurate 3D point cloud registration. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 11054–11059 (2021) 13. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (FPFH) for 3D registration. In: IEEE International Conference on Robotics and Automation, pp. 3212–3217 (2009) 14. Steder, B., Rusu, R.B., Konolige, K., Burgard, W.: Point feature extraction on 3D range scans taking into account object boundaries. In: IEEE International Conference on Robotics and Automation, pp. 2601–2608 (2011) 15. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

168

M. Sun et al.

´ 16. Bustos, A.P., Chin, T.-J.: Guaranteed outlier removal for point cloud registration with correspondences. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 2868–2882 (2018) ´ 17. Bustos, A.P., Chin, T., Neumann, F., Friedrich, T., Katzmann, M.: A Practical Maximum Clique Algorithm for Matching with Pairwise Constraints, arXiv, vol. abs/1902.01534 (2019) 18. Thomas, H., Qi, C.R., Deschaud, J.E., Marcotegui, B., Goulette, F., Guibas, L.: KpConv: ﬂexible and deformable convolution for point clouds. In: IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6410–6419 (2019) 19. Tinchev, G., Penate-Sanchez, A., Fallon, M.: SKD: keypoint detection for point clouds using saliency estimation. IEEE Robot. Autom. Lett. 6(2), 3785–3792 (2021) 20. Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., Funkhouser, T.: 3DMatch: learning local geometric descriptors from RGB-D reconstructions. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 199–208 (2017) 21. Gojcic, Z., Zhou, C., Wegner, J.D., Wieser, A.: The perfect match: 3D point cloud matching with smoothed densities. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5540–5549 (2019) 22. Yuan, W., Eckart, B., Kim, K., Jampani, V., Fox, D., Kautz, J.: DeepGMR: learning latent gaussian mixture models for registration. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 733–750. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7 43 23. Jian, B., Vemuri, B.: A robust algorithm for point set registration using mixture of gaussians. In: IEEE International Conference on Computer Vision (ICCV), vol. 2, pp. 1246–1251 (2005) 24. Bernreiter, L., Ott, L., Nieto, J., Siegwart, R., Cadena, C.: PHASER: a robust and correspondence-free global point cloud registration. IEEE Robot. Autom. Lett. 6(2), 855–862 (2021) 25. Clark, W., Ghaﬀari, M., Bloch, A.: Nonparametric continuous sensor registration. J. Mach. Learn. Res. 22(271), 1–50 (2021) 26. Jadidi, M.G., Clark, W., Bloch, A., Eustice, R., Grizzle, J.W.: Continuous direct sparse visual odometry from RGB-D images. In: Proceedings of Robotics: Science and Systems (2019) 27. Zhang, R., et al.: A new framework for registration of semantic point clouds from stereo and RGB-D cameras. In: IEEE International Conference on Robotics and Automation (ICRA), pp. 12214–12221 (2021) 28. Mallat, S., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41(12), 3397–3415 (1993) 29. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. In: Advances in Neural Information Processing Systems, pp. 7537–7547 (2020) 30. Sun, M., Pinosky, A., Abraham, I., Murphey, T.: Scale-Invariant Fast Functional Registration, arXiv (2022) 31. Aoki, Y., Goforth, H., Srivatsan, R.A., Lucey, S.: PointNetLK: robust & eﬃcient point cloud registration using PointNet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 32. Agarwal, S., Mierle, K., et al.: Ceres Solver. http://ceres-solver.org 33. Yang, J., Li, H., Campbell, D., Jia, Y.: Go-ICP: a globally optimal solution to 3D ICP point-set registration. IEEE Trans. Pattern Anal. Mach. Intell. 38(11), 2241–2254 (2016)

Scale-Invariant Fast Functional Registration

169

34. Wu, Z., et al.: 3D shapenets: a deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 35. Rusu, R.B., Cousins, S.: 3D is here: point cloud library (PCL). In: IEEE International Conference on Robotics and Automation (2011)

Towards Mapping of Underwater Structures by a Team of Autonomous Underwater Vehicles Marios Xanthidis1 , Bharat Joshi2(B) , Monika Roznere3 , Weihan Wang4 , Nathaniel Burgdorfer4 , Alberto Quattrini Li3 , Philippos Mordohai4 , Srihari Nelakuditi2 , and Ioannis Rekleitis2 1

SINTEF Ocean, 7010 Trondheim, Norway [email protected] 2 University of South Carolina, Columbia, SC 29208, USA [email protected], {srihari,yiannisr}@cse.sc.edu 3 Dartmouth College, Hanover, NH 03755, USA {monika.roznere.gr,alberto.quattrini.li}@dartmouth.edu 4 Stevens Institute of Technology, Hoboken, NJ 07030, USA {wwang103,nburgdor,pmordoha}@stevens.edu Abstract. In this paper, we discuss how to eﬀectively map an underwater structure with a team of robots considering the speciﬁc challenges posed by the underwater environment. The overarching goal of this work is to produce high-deﬁnition, accurate, photorealistic representation of underwater structures. Due to the many limitations of vision underwater, operating at a distance from the structure results in degraded images that lack details, while operating close to the structure increases the accumulated uncertainty due to the limited viewing area which causes drifting. We propose a multi-robot mapping framework that utilizes two types of robots: proximal observers which map close to the structure and distal observers which provide localization for proximal observers and bird’s-eye-view situational awareness. The paper presents the fundamental components and related current results from real shipwrecks and simulations necessary to enable the proposed framework, including robust state estimation, real-time 3D mapping, and active perception navigation strategies for the two types of robots. Then, the paper outlines interesting research directions and plans to have a completely integrated framework that allows robots to map in harsh environments.

Keywords: Underwater Localization

1

· Multi-robot · Navigation · Mapping ·

Introduction

Underwater structure mapping is an important capability applicable to multiple domains: marine archaeology, infrastructure maintenance, resource utilization, security, and environmental monitoring. The underwater environment is challenging and dangerous for humans in many ways, while robotic operations face c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 170–185, 2023. https://doi.org/10.1007/978-3-031-25555-7_12

Shipwreck Mapping

171

additional challenges compared to the above-water ones. In particular, sensing and communications are restricted and planning is required in three dimensions based on limited information. Current approaches for underwater autonomous operations are limited to hovering at a distance [1–3], possibly resulting in occluded views and coarse models; operating autonomously near an underwater structure, thus obtaining high-resolution images, has been impossible so far. The overarching goal of this work is to create a 3D model of the underwater structure providing a high-resolution photo-realistic representation. To achieve this goal, we propose a framework that considers a team of robots operating in close cooperation. Some Autonomous Underwater Vehicles (AUVs), termed proximal observers, will be operating close Fig. 1. Aqua2 autonomous underwater to the underwater structure generating a vehicle exploring the top structure of dense vision-based 3D reconstruction of the stavronikita shipwreck, barbados. the observed surface; see Fig. 1 where an Aqua2 vehicle swims over the deck of a wreck. The rest of the robots, termed distal observers, will operate further out maintaining the global picture of the underwater structure and the pose of the proximal observers with respect to the structure. In this paper, we discuss the fundamental components we have developed that contribute to the realization of the above framework. The distal observer monitors the relative pose of the proximal observer utilizing a Cooperative Localization (CL) [4,5] scheme. The proximal observer maintains its current pose estimate even in the face of sensor failures. The visual data from the proximal observer are integrated into a 3D map, utilizing either real-time dense depth map fusion or a formulation of photometric stereo. Finally diﬀerent motion strategies are employed by the distal and proximal observers to maximize target visibility. Experimental results for each component from deployments over a shipwreck are presented together with simulations in a realistic 3D robotic simulator (Gazebo [6]). These results highlight the potential of the proposed approach and provide insights on interesting research directions for mapping in harsh environments and on future plans for full system integration.

2

Related Work

Many approaches utilize visual and visual/inertial data for estimating the pose of a robot [7–10]. However, evaluations on a variety of underwater datasets have demonstrated the challenges of the underwater domain [11,12]. Even when the integration of multiple sensors produces consistent estimations [13], operating around a 3D structure often results in loss of tracking when no unique visual features are present. An alternative to estimating the state of a moving robot is relative localization from another robot [4,5,14]. This is a challenging problem underwater due to limited visibility and potential occlusions.

172

M. Xanthidis et al.

Active perception, which was ﬁrst introduced in the context of exploration [15–17], enables robots to actively map the environment. Due to the challenges of the underwater domain, there are only few active perception applications aiming to minimize uncertainty during coverage [18,19] or explore interesting features [20] in simplistic environments with respect to obstacles. Advancements in deep learning have produced robotic systems that move freely while observing areas of interests [21–23] based on datasets collected by human operators. By construction, these systems are limited by the agility of the operator, the domain of the training set, and the excessive data needed for the production of such frameworks. Past works utilized well-established sampling-based techniques which provide strong guarantees [24] but are computationally expensive. On the other hand AquaVis [25], our previous work, proposed a lightweight realtime framework based on path-optimization that can navigate safely a complex 3D environment and at the same time observe multiple visual objectives using an arbitrary multi-sensor conﬁguration. Several active sensing approaches for 3D mapping require enumerating and simulating sensing from diﬀerent discrete 6-D pose hypotheses [26–29] at high computational cost; other approaches are limited to 2-D slices of constant depth or height [30,31] or they require the use of rough initial models [32,33]; in addition, most operate on occupancy grids reducing the resolution of the reconstructed surfaces drastically. Exploration strategies [34,35] that guide a vehicle towards frontier voxels without requiring sampling in pose space are closely related to our work, but they are limited to a single robot, and require a prior map. Multi-robot 3-D reconstruction methods have been presented [36], but robots are distributed in space to map independently without tight cooperation, and operate at distance to the target structure [37,38].

3

Proximal-Distal Mapping Framework

Our proposed mapping framework relies on proximal and distal observers to overcome the inherent challenges of the underwater domain. Figure 2 shows the full envisioned process. Here, we discuss each fundamental component that will enable the mapping by these two types of robots, highlighting the current results. For grounding our discussion, we refer to the speciﬁc underwater robots used, although the components and framework can be generalized. The main target vehicle is the Aqua2 AUV [39]. Aqua2 utilizes the motion of six ﬂippers, each one independently actuated by an electric motor, to swim. Aqua2 has 6 DOF, of which ﬁve are controllable: two directions of translation (forward/backward and upward/downward), along with roll, pitch and yaw. The robot’s primary sensor is vision, more speciﬁcally three iDS USB 3.0 UEye cameras: two facing forward and one in the back. Aqua2 also has a pressure sensor and an IMU which are used for controlling the motions and can be utilized for visual-inertial state estimation [8,13,40–42].

Shipwreck Mapping

173

Fig. 2. Two AUVs exploring a wreck. Inserts present the view of each observer: the distal observer, in purple, keeps a large portion of the wreck and the proximal observer in view. The proximal observer, in red, has a close-up view of the wreck.

A lower-cost robot considered is the BlueROV2, a thruster-vectored robot that has an inexpensive sensor suite composed of: a monocular camera, an IMU, and a pressure sensor. 3.1

Robust State Estimation

A major challenge underwater is robust robot state estimation, given the lack of global localization infrastructure. In addition to many underwater challenges such as lighting variations, limited visibility, and color absorption by distance, when AUVs operate around underwater structures they often encounter complete loss of visual tracking due to the ﬁeld of view facing only open water, or a featureless surface such as a sandy bottom. Utilizing a dataset collected over a ship- Fig. 3. The view of the Aqua2 AUV just wreck by an Aqua2 AUV we report aver- before traveling over the starboard side age time for loss of tracking on some of of the Stavronikita wreck. the most common VO/VIO software packages. As can be seen from Table 1, most of the packages get completely lost when the robot reaches the starboard side of the deck; see Fig. 3 for the view from the Aqua2 AUV about to travel over the railings at the starboard side of the deck; the image is dominated by blue water. There are two major components that are necessary for proximal and distal observers to coordinate: 1) relative pose estimation so that the local observations can be mapped into a global reference frame; 2) robust single AUV pose estimation so that each robot can localize.

174

M. Xanthidis et al.

Table 1. Performance of popular open-source VIO packages on the wreck dataset. The root mean squared ATE compared to COLMAP trajectory after se3 alignment. Algorithm

Time to ﬁrst Recovery? RMSE (in m) track loss (in sec)

OKVIS [7]

23.4

VINS-Fusion [43] SVIn2 [44]

Partial

5.199

23.6

Partial

53.189

23.4

Yes

1.438

Robust Switching Estimator N/A

Yes

1.295

Relative Pose Estimation. To locate robots in a common reference frame, we formulate a cooperative localization framework, where robots can estimate their relative pose. A major challenge underwater is the lack of ground truth, as setting up a motion capture system is prohibitively complicated. As a result, most learning based approaches face a shortage of training data. We employ a novel approach by Joshi et al. [45] utilizing a Generative Adversarial Network (GAN) [46] to train on simulated images, where the simulator provides the pose of the AUV, and then test on real images. Estimating a number of ﬁxed points on the AUV in conjunction with the vehicle’s geometry and a calibrated camera yields accurate estimates of the 3D pose of the observed AUV; for details please refer to [45]. Robust AUV Pose Estimation. In order to address this common challenge, a novel estimator robust to VIO failures outlined here. A model-based estimator is employed in conjunction with SVIn2 [44], an accurate VIO package in a robust switching estimator framework. The proposed estimator monitors VIO health based on the number of features tracked, their spatial distribution, feature quality, and their temporal continuity. When health deteriorates below a certain threshold, the model-based estimator is utilized, initialized at the last accurate pose of the VIO system. When VIO recovers and features are tracked again, the estimator switches back to SVIn2 which is itself initialized to the corresponding model-based estimator pose. The result is a sequence of segments (model-based and VIO) maintaining a consistent pose through the whole trajectory. In the event of loop closure, the corrections are propagated, through the pose graph, to the complete trajectory. Results: Some of the most popular open-source packages were tested on a sequence collected over a wreck. COLMAP [47,48] was used to estimate reference camera trajectory from the image sequence. The resulting reconstruction from COLMAP can be seen in Fig. 4(a). As can be seen in Table 1, the state-of-the art methods lost track when the AUV faced blue water with no visible structure. SVIn2 managed to recover due to loop closure, while OKVIS without loop closure drifted much further and VINS-fusion drifted even further, with respect to the COLMAP reference trajectory, at 53 meters. Our proposed switching estimator

Shipwreck Mapping

175

managed to keep track throughout the trajectory with the lowest RMSE error. Figure 4(a) presents the sparse reconstruction from COLMAP together with the estimated poses. When the AUV was facing blue water, the camera pose was not tracked. In contrast, the switching estimator managed to accurately track the trajectory throughout as can be seen in Fig. 4(b).

Fig. 4. (a) Wreck reconstruction using COLMAP together with the estimated camera poses [47, 48]. (b) Trajectory estimation utilizing a switching estimator using SVIn2 [44] and a model based estimator.

3.2

Photorealistic Reconstruction

With a robust state estimate, the robots can create a 3D map. Here, we discuss components that enable real-time dense 3D mapping within the proximal/observer framework: 1) using just a stereo camera; 2) using lights to have a more robust 3D map. Real-Time Dense 3D Mapping. Dense surface reconstruction relies on stereo matching across the left and right camera of the proximal observer and on fusing multiple depth maps to achieve improved robustness and accuracy. To achieve real-time performance, we decompose dense surface estimation in stereo matching and depth map fusion modules with constant computational complexity. The core of our dense 3D reconstruction pipeline is a binocular stereo matching module which estimates depth for the pixels of the left image from a rectiﬁed stereo pair of images. We are able to process image pairs at several frames per second on the CPU using a publicly available multi-threaded implementation.1 1

https://github.com/kbatsos/Real-Time-Stereo.

176

M. Xanthidis et al.

Fig. 5. (a)–(b): input images. (c)–(e): input depth maps for fusion. (f): fused depth map.

Stereo matching operates by assigning a cost or score to each possible disparity2 that can be assigned to a given pixel of the reference image, typically the left. (We will use cost in the remainder without loss of generality.) Cost is computed in square windows centered around the pixels under consideration. All matching costs are stored in a cost volume with dimensions equal to the width and height of the images and the number of disparity candidates for every pixel. (The maximum disparity corresponds to the minimum depth of interest, while minimum disparity can be set to 0.) The cost volume can be optimized to impose piece-wise smoothness and extract accurate depth maps via the widely used Semi-Global Matching algorithm (SGM) [49]. Here, we integrate the rSGM implementation of Spangenberg et al. [50] into the stereo matching code. Finally, disparity is converted to depth using the known baseline and focal length of the cameras. Sub-pixel precision is obtained by ﬁtting a parabola to the minimum cost and its two neighboring values [51]. To support the subsequent depth map fusion module we associate a conﬁdence value to each depth estimate. To this end, we adopt the PKRN measure [52], which is the ratio of the second smallest over the smallest cost for a given pixel after SGM optimization. An example of a pair of input images and the resulting depth map can be seen in Fig. 5. Depth maps estimated by the stereo matching module suﬀer from artifacts due to lack of texture, occlusion, and motion blur. Assuming that errors do not persist over multiple frames, we propose to improve the depth maps by fusing them. 2

Disparity is deﬁned as the diﬀerence between the horizontal coordinates of two potentially corresponding pixels in the same epipolar line (scanline) in the left and right image. Disparity is inversely proportional to depth.

Shipwreck Mapping

177

Fig. 6. Constraints used in depth map fusion. Points A, B and C are depth candidates either estimated for the reference view directly or rendered to it from other views. The solid orange polyline is the cross section of the surface estimated by view k. Left: point A is supported by the orange surface. Middle: point B is occluded by B’ which is in front of B in the ray of the reference view. Right: point C violates the free space of C’ on the ray of view k. (Color ﬁgure online)

The principle behind depth map fusion is that, as long as individual overlapping depth maps produce either relatively consistent 3D estimates for the same part of the scene or uncorrelated noisy estimates, measuring the consensus and conﬂicts among depth estimates allows us to improve the accuracy of the correct estimates and to reject outliers. To achieve real-time, scalable 3D reconstruction, our approach operates in sliding window fashion, keeping a small number of recent depth maps in memory at a given time. This decomposition allows the pipeline to operate at constant speed regardless of the size of the scenes and the number of frames that have been collected. At each time step, the middle depth map in the sliding window is used as reference and the remaining depth maps are rendered onto it along with the corresponding conﬁdence maps. In this paper, we adopt visibilitybased fusion from our previous work [53,54]. The setting here is more challenging since the input depth maps are estimated from two images only, and are thus more susceptible to occlusion. The input for computing a fused depth map for a given reference view is a set of Nf depth maps and the corresponding conﬁdence maps. The fusion process Fig. 7. Shipwreck’s partial point cloud begins by rendering the depth and confrom the stereo dense 3D mapping ﬁdence maps to the reference view yield- pipeline. ing a new set of Nf depth and conﬁdence maps from the perspective of the reference view. At the end of the rendering stage, we have at most Nf depth candidates per pixel of the reference view as depths may project out of bounds. For each depth candidate dj , we accumulate support and visibility violations. Support comes from other depth candidates for the same pixel that are within a small distance of dj . dj is then replaced by the conﬁdence-weighted average of the supporting

178

M. Xanthidis et al.

depths. The conﬁdence of the fused depth estimate dj is set equal to the sum of the supporting conﬁdences. See Fig. 6 (left). Visibility violations are of two types: occlusions and free space violations. An occlusion occurs when dj appears behind a rendered depth map from view k, Dkr on the ray of the reference view, as in Fig. 6 (middle), while a free space violation occurs when dj appears in front of an input depth map Dl on the ray of view l, as in Fig. 6 (right). For each detected violation, we penalize the conﬁdence of dj by subtracting the conﬁdence of the conﬂicting depth estimate. At the end, we assign to each pixel the depth candidate with the maximum fused conﬁdence, as long as it is positive. (We can also apply larger thresholds on the conﬁdence to remove noisy depths.) Because processing is independent per pixel, it can be performed in parallel, with the most computationally expensive step being rendering to the original depth maps to detect free space violations. An example is shown in Fig. 5, while a point cloud made of a several fused depth maps is shown in Fig. 7. Photometric Stereo Mapping. Below 20–30 m deep in the water column, the sun’s rays diminish signiﬁcantly. To suﬃciently illuminate the scene, AUVs are commonly equipped with independently controllable lights. Interestingly, what the AUVs perceive through their camera-imagery while their lights are on/oﬀ can provide information on the 3D structure and albedo of the scene. The problem of estimating the visible scene given images illuminated by light sources is called the photometric stereo (PS) problem [55,56]. The main principle of the PS algorithm is that a surface point’s albedo and normal can be recovered by modeling the changes in that surface point’s reﬂectance under various lighting source orientations. From previous work [57,58], four images and their light correspondences ensure that surface points are illuminated suﬃciently and uniquely. This conﬁguration can be a very simple model to solve with the assumption that the camera never moves, the light orientations are known, and the surface material of the object in focus is also known. However, this assumption does not hold since the AUV is traveling underwater. We expand the PS algorithm to address the in-water light behavior as well as the AUV non-stationarity characteristic, that the camera is never still. Our complete model allows the AUV, with at least one camera, to estimate high-resolution 3D models of the scene by ﬂicking on four diﬀerent lights in sequence and capturing their respective images. When light travels through water, it is continuously attenuated over distance as it collides with diﬀerent particles, characterized by the uniqueness of the water-body’s properties. Therefore, the ﬁnal image captured by the camera is composed of the direct signal – the light that traveled from a light source, interacted with the scene, and reﬂected back to the image sensor – and backscatter – the light that traveled from a light source and reﬂected back from particles not part of the scene. We refer to our prior work [59] for a deeper explanation of the image formation model. The attenuation parameters in the image formation model can be calibrated prior to deployment with a color chart or a black and white marker.

Shipwreck Mapping

179

How the light reﬂects from the scene depends on the object’s material. Most underwater PS works assume that the underwater scene is comprised of Lambertian-type surfaces. Simply, the amount of light reﬂected from the surface is independent of the viewing direction. It is only dependent on the incoming light’s intensity and the angle between the light’s direction and the surface normal. The PS objective function is referred to as photometric consistency [58]: NI (|Ii − Ii (n, Z)|) (1) o(n, Z) = i=1 NI As the name dictates, the goal is to minimize the diﬀerence between the predicted I and observed I pixel values of a scene point in the set of NI images, by estimating the scene point’s surface normal n and distance from camera Z. As the AUV is non-stationary, the cameras’ and collocated lights’ changing orientations need to be considered accordingly and fed into the PS model. If the AUV has only one camera, the PS framework requires additional pose estimation information that could be estimated using a SLAM system (e.g., Monocular ORB-SLAM [60]). In our case with Aqua2 equipped with two cameras, pose can be estimated as presented above in Sect. 3.1. The estimated depth maps provided by the stereo cameras can be utilized as initial guesses for the PS model – thus, the PS model can be used to improve the overall 3D scene modelling capabilities. By solving for the unknowns in the non-stationary PS model, one can derive the 3D scene or the distances of the scene points from the camera.

Fig. 8. Non-stationary photometric stereo model results given AUV with one camera and pose estimation and initial depth map from ORB-SLAM [60].

Results: Simple tests were performed with an AUV (BlueROV2) integrated with a single camera and four lights. Pose estimates and initial depth maps are provided via ORB-SLAM [60]. Figure 8 shows the non-stationary photometric stereo modeling results of a synthetic rock viewed at its edge. Eight images were captured, four of them with lights on and four corresponding lights oﬀ. As ambient light was present, the images with lights oﬀ are integral for subtracting the ambient illumination within the PS model. These results show the capability of using lights to improve the 3D modeling of scenes when initial depth maps and pose estimations are available.

180

M. Xanthidis et al.

3.3

AUV Navigation

With localization and mapping techniques, the next goal is to eﬀectively coordinate the proximal and distal observers for a fully integrated system. The two AUVs start at a location near the wreck, with the proximal observer in the front and the distal observer behind. Initially, the two robots convoy together [61] approaching a target structure such as a wreck; the proximal leads and the distal follows keeping the proximal inside its ﬁeld of view. When the two robots have reached the starting positions, they start mapping the target structure in a collaborative way by assuming two diﬀerent behaviors: the proximal observer focuses on covering the target structure in close proximity collecting high resolution observations, while the distal observer focuses on tracking the proximal observer from a larger distance maintaining a more informative and macroscopic perspective of the general structure and aiding the localization of the proximal robot simultaneously. Multi-robot operations in the proposed scheme could expand from reactive coordination for each robot independently by assuming absence or very limited explicit communication, to high-level deliberative collaboration with increased communication capabilities. Addressing the speciﬁcs of the communication (acoustic or light-based) is an interesting future research direction; here, we discuss how the coordinated navigation between proximal and distal observers can be adapted according to diﬀerent levels of communication. Proximal Observer Navigation. As a base case, we assume no communication from the distal to the proximal observer, thus the proximal observer will greedily attempt to cover the entirety of the target structure. Several exploration or mapping strategies could be deployed where the proximal observer actively decides which areas to visit with an informed global planner, but for the sake of simplicity in order to showcase the fundamental concept of the proximal observer, the robot will follow a predeﬁned lawnmower pattern in close proximity to the structure. The proximal observer will have to operate in close proximity to highly unstructured environments through a complex terrain, therefore, instead of blindly reaching local goals, it has to avoid obstacles. For this purpose, it can utilize AquaNav [62], a state-of-the-art real-time vision-based underwater navigation framework developed in our previous work that enabled underwater robots to navigate safely through challenging 3D environments. Though, since the main purpose of the proposed framework is to map and not just navigate challenging terrains, to maximize observations of the target structure, AquaVis [25], an extension of AquaNav that performs active perception on automatically extracted visual objectives, is employed instead. Distal Observer Navigation. The distal observer should move in a way that will keep a distance from both the target structure and the proximal observer while following the latter. To follow the proximal observer, the distal observer in absence of communication could utilize motion predictors. On the other hand,

Shipwreck Mapping

181

assuming limited communication, the proximal observer could publish the trajectory it plans to follow, allowing for more informed decisions. Then, the distal observer could employ the strong capabilities of AquaVis to track the proximal observer by processing as visual objectives the future positions of the target robot, while also avoiding collisions with other potential objects. At this stage, AquaVis was modiﬁed to consider only the expected position of the proximal observer at each corresponding state and produce a solution that ensures visibility of the target robot at all times, assuming no occlusions. Finally, for simplicity the distal observer will be moving towards global waypoints in a similar pattern as the proximal observer. More deliberative and informed policies will be investigated in the future. Results: Fig. 9 presents the trajectory of the distal (purple) and proximal (red) observers as they cover the deck of a USS YP-389 shipwreck [63] model (187.5 m × 63.6 m × 29.7 m) simulated in gazebo with seabed at depth 39.7 m. The dynamics of the AUVs together with the selection of viewpoints result in a variety of poses such that visual objectives (the wreck for the proximal, and the proximal AUV for the distal) are kept in the center of the ﬁeld of view. The proximal AUV utilizes a lawnmower pattern as the starting point for AquaVis [25] while the distal observer uses the states of the planned trajectory of the proximal as the objectives. The simulation takes about 10 min with both robots trying to maintain a forward velocity of 0.5 m/s.

Fig. 9. Two AUVs exploring a wreck. Multiple snapshots combined to illustrate the two trajectories. The distal observer, in purple, keeps a large portion of the wreck and the proximal observer in view hovering above. The proximal observer, in red, utilizes a lawnmower pattern to cover the top of the wreck.

4

Conclusions, Lessons Learned, and Future Directions

While each component is necessary to achieve photorealistic mapping of underwater structures, the results presented here provide insights for the integration and research questions that are interesting to pursue:

182

M. Xanthidis et al.

– What is the trade-oﬀ between real-time 3D reconstructions and accurate ones? The underwater robots need enough information to navigate safely around the underwater structures; at the same time the more details added will contribute to denser reconstructions. We will encode hybrid hierarchical representations that can be used from the diﬀerent components. – How to optimize the AUV navigation given the number of conﬂicting optimization criteria? Examples of criteria include ﬁne-grained coverage of the structure to ensure accurate reconstruction; minimization of overlaps; ensuring that the robots’ state estimates do not diverge. We will explore Paretooptimal decision framework. – How to optimize communication between robots? Here, we did not explicitly discuss the communication bandwidth available; in general, communication is very limited underwater – with a bandwidth in the order of kb/s. It is important to identify eﬃcient data representation of the 3-D reconstruction and on a cross-layer optimization for deciding when and how to share. Once we integrate these diﬀerent components, our plan is to deploy the system in an archaeological expedition to map large shipwrecks. In this work, a novel multi-robot approach was presented for mapping of large and challenging underwater structures, such as shipwrecks, or energy and aquaculture infrastructure. The main contribution of this work is twofold: (a) to present the main components necessary for enabling such mapping – i.e., 1) robust state estimation, where we presented a robust system that is able to switch between state estimators according to their reliability, 2) dense mapping, where we presented approaches that can run in real-time and with low-cost sensor conﬁguration, 3) team coverage with distal and proximal observers – and (b) discussing the insights and interesting research questions, towards a fullyintegrated system for underwater structure mapping. Acknowledgments. The authors would like to thank the National Science Foundation for its generous support (NSF 1943205, 1919647, 2024741, 2024541, 2024653, 2144624) and the Research Council of Norway for support through the project Resilient Robotic Autonomy for Underwater Operations in Fish Farms (ResiFarm, NO-327292). We also thank Halcyon Dive Systems for their support with equipment.

References 1. Mai, C., Pedersen, S., Hansen, L., Jepsen, K.L., Yang, Z.: Subsea infrastructure inspection: a review study. In: 2016 IEEE International Conference on Underwater System Technology: Theory and Applications (USYS), pp. 71–76. IEEE (2016) 2. Maurelli, F., et al.: The PANDORA project: a success story in AUV autonomy. In: OCEANS 2016-Shanghai, pp. 1–8. IEEE (2016) 3. Palomer, A., Ridao, P., Ribas, D.: Inspection of an underwater structure using point-cloud SLAM with an AUV and a laser scanner. J. Field Rob. 36(8), 1333– 1344 (2019) 4. Rekleitis, I., Dudek, G., Milios, E.E.: On multiagent exploration. In: Vision Interface, pp. 455–461. Vancouver, BC, Canada, June 1998

Shipwreck Mapping

183

5. Kurazume, R., Hirose, S.: An experimental study of a cooperative positioning system. Auton. Robot. 8, 43–52 (2000) 6. Koenig, N., Howard, A.: Design and use paradigms for Gazebo, an open-source multi-robot simulator. In: Proceedings IROS, pp. 2149–2154 (2004) 7. Leutenegger, S., Lynen, S., Bosse, M., Siegwart, R., Furgale, P.: Keyframe-based visual-inertial odometry using nonlinear optimization. Int. J. Rob. Res. 34(3), 314–334 (2015) 8. Mourikis, A.I., Roumeliotis, S.I.: A multi-state constraint Kalman ﬁlter for visionaided inertial navigation. In: Proceedings ICRA, pp. 3565–3572. IEEE (2007) 9. Campos, C., Elvira, R., Rodr´ıguez, J.J.G., Montiel, J.M., Tard´ os, J.D.: ORBSLAM3: an accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE Trans. Rob. 6, 1874–1890 (2021) 10. Rosinol, A., Abate, M., Chang, Y., Carlone, L.: Kimera: an open-source library for real-time metric-semantic localization and mapping. In: ICRA. IEEE (2020) 11. Joshi, B., et al.: Experimental comparison of open source visual-inertial-based state estimation algorithms in the underwater domain. In: Proceedings IROS, pp. 7221– 7227 (2019) 12. Quattrini Li, A., et al.: Experimental comparison of open source vision based state estimation algorithms. In: Proceedings ISER (2016) 13. Rahman, S., Quattrini Li, A., Rekleitis, I.: SVIn2: an underwater SLAM system using sonar, visual, inertial, and depth sensor. In: Proceedings IROS (2019) 14. Roumeliotis, S.I., Rekleitis, I.: Propagation of uncertainty in cooperative multirobot localization: analysis and experimental results. Auton. Robot. 17(1), 41–54 (2004) 15. Aloimonos, J., Weiss, I., Bandyopadhyay, A.: Active vision. Int. J. Comput. Vis. 1(4), 333–356 (1988) 16. Bajcsy, R.: Active perception. Technical Report MSCIS-88-24, Un. of Pennsylvania Department of Computer and Information Science (1988) 17. Feder, H.J.S., Leonard, J.J., Smith, C.M.: Adaptive mobile robot navigation and mapping. Int. J. Rob. Res. 18(7), 650–668 (1999) 18. Frolov, S., Garau, B., Bellingham, J.: Can we do better than the grid survey: optimal synoptic surveys in presence of variable uncertainty and decorrelation scales. J. Geophys. Res. Oceans 119(8), 5071–5090 (2014) 19. Chaves, S.M., Kim, A., Galceran, E., Eustice, R.M.: Opportunistic sampling-based active visual SLAM for underwater inspection. Auton. Rob. 40, 1245–1265 (2016). https://doi.org/10.1007/s10514-016-9597-6 20. Girdhar, Y., Giguere, P., Dudek, G.: Autonomous adaptive exploration using realtime online spatiotemporal topic modeling. Int. J. Rob. Res. 33(4), 645–657 (2014) 21. Karapetyan, N., Johnson, J., Rekleitis, I.: Coverage path planning for mapping of underwater structures with an autonomous underwater vehicle. In: MTS/IEEE OCEANS - Singapore. Singapore (Virtual) (2020) 22. Karapetyan, N., Johnson, J., Rekleitis, I.: Human diver-inspired visual navigation: towards coverage path planning of shipwrecks. Mar. Technol. Soc. J. “Best of OCEANS 2020” 55(4), 24–32 (2021) 23. Manderson, T., et al.: Vision-based goal-conditioned policies for underwater navigation in the presence of obstacles. Rob. Sci. Syst. (2020) 24. Vidal, E., Palomeras, N., Isteniˇc, K., Gracias, N., Carreras, M.: Multisensor online 3D view planning for autonomous underwater exploration. J. Field Rob. 37(6), 1123–1147 (2020) 25. Xanthidis, M., et al.: Aquavis: a perception-aware autonomous navigation framework for underwater vehicles. In: Proceedings IROS, pp. 5387–5394 (2021)

184

M. Xanthidis et al.

26. Wenhardt, S., Deutsch, B., Angelopoulou, E., Niemann, H.: active visual object reconstruction using D-, E-, and T-optimal next best views. In: Proceedings CVPR (2007) 27. Potthast, C., Sukhatme, G.S.: A probabilistic framework for next best view estimation in a cluttered environment. J. Vis. Commun. Image Represent. 25(1), 148–164 (2014) 28. Kim, A., Eustice, R.M.: Active visual SLAM for robotic area coverage: theory and experiment. Int. J. Rob. Res. 34(4–5), 457–475 (2015) 29. Daudelin, J., Campbell, M.: An adaptable, probabilistic, next-best view algorithm for reconstruction of unknown 3-D objects. IEEE Rob. Autom. Lett. 2(3), 1540– 1547 (2017) 30. Fraundorfer, F., et al.: Vision-based autonomous mapping and exploration using a quadrotor MAV. In: Proceedings IROS, pp. 4557–4564. IEEE (2012) 31. Vidal, E., Hern´ andez, J.D., Isteniˇc, K., Carreras, M.: Online view planning for inspecting unexplored underwater structures. IEEE Rob. Autom. Lett. 2(3), 1436– 1443 (2017) 32. Hepp, B., Nießner, M., Hilliges, O.: Plan3D: viewpoint and trajectory optimization for aerial multi-view stereo reconstruction. ACM Trans. Graph. (TOG) 38(1), 4 (2018) 33. Smith, N., Moehrle, N., Goesele, M., Heidrich, W.: Aerial path planning for urban scene reconstruction: a continuous optimization method and benchmark. In: SIGGRAPH Asia, p. 183 (2018) 34. Shade, R., Newman, P.: Choosing where to go: complete 3D exploration with stereo. In: Proceedings ICRA, pp. 2806–2811 (2011) 35. Hollinger, G.A., Englot, B., Hover, F.S., Mitra, U., Sukhatme, G.S.: Active planning for underwater inspection and the beneﬁt of adaptivity. Int. J. Rob. Res. 32(1), 3–18 (2013) 36. Golodetz, S., Cavallari, T., Lord, N.A., Prisacariu, V.A., Murray, D.W., Torr, P.H.: Collaborative large-scale dense 3D reconstruction with online inter-agent pose optimisation. IEEE Trans. Visual Comput. Graphics 24(11), 2895–2905 (2018) 37. Kapoutsis, A.C., et al.: Real-time adaptive multi-robot exploration with application to underwater map construction. Auton. Rob. 40(6), 987–1015 (2016). https://doi.org/10.1007/s10514-015-9510-8 38. Paull, L., Seto, M., Leonard, J.J., Li, H.: Probabilistic cooperative mobile robot area coverage and its application to autonomous seabed mapping. Int. J. Rob. Res. 37(1), 21–45 (2018) 39. Dudek, G., et al.: A visually guided swimming robot. In: Proceedings IROS, pp. 1749–1754 (2005) 40. Huang, G.: Visual-inertial navigation: a concise review. In: Proceedings ICRA (2019) 41. Shkurti, F., Rekleitis, I., Scaccia, M., Dudek, G.: State estimation of an underwater robot using visual and inertial information. In: Proceedings IROS, pp. 5054–5060 (2011) 42. Rahman, S., Quattrini Li, A., Rekleitis, I.: Sonar visual inertial SLAM of underwater structures. In: IEEE International Conference on Robotics and Automation, pp. 5190–5196. Brisbane, Australia, May 2018 43. Qin, T., Cao, S., Pan, J., Shen, S.: A general optimization-based framework for global pose estimation with multiple sensors. arXiv preprint arXiv:1901.03642 (2019) 44. Rahman, S., Quattrini Li, A., Rekleitis, I.: SVIn2: a multi-sensor fusion-based underwater SLAM system. Int. J. Rob. Res. 41, 1022–1042 (2022)

Shipwreck Mapping

185

45. Joshi, B., et al.: DeepURL: deep pose estimation framework for underwater relative localization. In: IROS, pp. 1777–1784. Las Vegas, NV (2020) 46. Goodfellow, I., et al.: Generative adversarial nets. In: NeurIPS (2014) 47. Sch¨ onberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4104–4113 (2016) 48. Sch¨ onberger, J.L., Zheng, E., Frahm, J.-M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 501–518. Springer, Cham (2016). https:// doi.org/10.1007/978-3-319-46487-9 31 49. Hirschm¨ uller, H.: Stereo processing by semiglobal matching and mutual information. PAMI 30(2), 328–341 (2008) 50. Spangenberg, R., Langner, T., Adfeldt, S., Rojas, R.: Large scale semi-global matching on the CPU. In: IEEE Intelligent Vehicles Symposium Proceedings, pp. 195–201 (2014) 51. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1–3), 7–42 (2002). https://doi.org/10.1023/ A:1014573219977 52. Hu, X., Mordohai, P.: A quantitative evaluation of conﬁdence measures for stereo vision. PAMI 34(11), 2121–2133 (2012) 53. Merrell, P., et al.: Real-time visibility-based fusion of depth maps. In: ICCV (2007) 54. Hu, X., Mordohai, P.: Least commitment, viewpoint-based, multi-view stereo. In: 3DIMPVT, pp. 531–538 (2012) 55. Woodham, R.J.: Photometric method for determining surface orientation from multiple images. Opt. Eng. 19(1), 139–144 (1980) 56. Drbohlav, O., Chaniler, M.: Can two specular pixels calibrate photometric stereo? In: ICCV, vol. 2, pp. 1850–1857 (2005) 57. Tsiotsios, C., Angelopoulou, M.E., Kim, T.K., Davison, A.J.: Backscatter compensated photometric stereo with 3 sources. In: CVPR (2014) 58. Tsiotsios, C., Kim, T., Davison, A., Narasimhan, S.: Model eﬀectiveness prediction and system adaptation for photometric stereo in murky water. Comput. Vis. Image Underst. 150, 126–138 (2016) 59. Roznere, M., Quattrini Li, A.: Real-time model-based image color correction for underwater robots. In: Proceedings IROS (2019) 60. Mur-Artal, R., Montiel, J.M.M., Tard´ os, J.D.: ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Trans. Robot. 31(5), 1147–1163 (2015) 61. Shkurti, F., et al.: Underwater multi-robot convoying using visual tracking by detection. In: Proceedings IROS, pp. 4189–4196 (2017) 62. Xanthidis, M., et al.: Navigation in the presence of obstacles for an agile autonomous underwater vehicle. In: Proceedings ICRA, pp. 892–899 (2020) 63. NOAA National Marine Sanctuaries: Shipwrecks. https://monitor.noaa.gov/ shipwrecks/. Accessed 14 Apr 2022

Grasping and Manipulation

Contact-Implicit Planning and Control for Non-prehensile Manipulation Using State-Triggered Constraints ¨ un Onol ¨ 1 , Philip Long2 , and Ta¸skın Padır1 Maozhen Wang1(B) , Aykut Ozg¨ 1

Institute for Experiential Robotics, Northeastern University, Boston, MA, USA [email protected] 2 Atlantic Technological University, Galway, Ireland Abstract. We present a contact-implicit planning approach that can generate contact-interaction trajectories for non-prehensile manipulation problems without tuning or a tailored initial guess and with high success rates. This is achieved by leveraging the concept of state-triggered constraints (STCs) to capture the hybrid dynamics induced by discrete contact modes without explicitly reasoning about the combinatorics. STCs enable triggering arbitrary constraints by a strict inequality condition in a continuous way. We ﬁrst use STCs to develop an automatic contact constraint activation method to minimize the eﬀective constraint space based on the utility of contact candidates for a given task. Then, we introduce a re-formulation of the Coulomb friction model based on STCs that is more eﬃcient for the discovery of tangential forces than the well-studied complementarity constraints-based approach. Last, we include the proposed friction model in the planning and control of quasistatic planar pushing. The performance of the STC-based contact activation and friction methods is evaluated by extensive simulation experiments in a dynamic pushing scenario. The results demonstrate that our methods outperform the baselines based on complementarity constraints with a signiﬁcant decrease in the planning time and a higher success rate. We then compare the proposed quasi-static pushing controller against a mixed-integer programming-based approach in simulation and ﬁnd that our method is computationally more eﬃcient and provides a better tracking accuracy, with the added beneﬁt of not requiring an initial control trajectory. Finally, we present hardware experiments demonstrating the usability of our framework in executing complex trajectories in real-time even with a low-accuracy tracking system.

Keywords: Contact modeling

· Manipulation planning · Optimization

¨ Onol—Equal ¨ M. Wang and A. O. contribution. M. Wang—Currently at Amazon Robotics, MA, USA. ¨ Onol—Currently ¨ A. O. at Toyota Research Institute, Cambridge, MA, USA. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 13. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 189–204, 2023. https://doi.org/10.1007/978-3-031-25555-7_13

190

1

M. Wang et al.

Introduction

Non-prehensile manipulation will be a key capability for robots in both home and industrial environments [40]. However, there is still a lack of reliable methods that can compose contact-rich motions given only a high-level goal. The main challenge is that the planning and control of contact interactions require discrete decisions concerning the time and location of contact mode transitions that impose switching constraints altering the evolution of the system dynamics. To avoid predeﬁned contact schedules, a hybrid problem using mixed-integer programming (MIP) may be solved, however, explicit modeling can be prohibitive as the complexity grows exponentially with discrete variables [33]. Fortunately, many discrete elements can be expressed in terms of continuous variables – such as the distance, velocity, and force in the contact frame – by using complementarity constraints (CCs) which provide an eﬃcient way of including bi-directional conditions in programs without the combinatorial complexity of explicitly doing so. Thus, contact dynamics is deﬁned as a smooth optimization problem with CCs in many physics engines and planning algorithms [33,38,49]. Yet, describing the rich discrete aspects of contactrelated problems using only bi-directional statements is restrictive and may be ineﬃcient. We hypothesize that a continuous model that implies uni-directional if conditions would be useful as it would provide a more modular building block. Hence, we propose the use of state-triggered constraints (STCs), ﬁrst put forward by Szmuk et al. [43], and analyze their performance for contact-implicit planning (CIP) through non-prehensile manipulation examples. 1.1

Contributions

The main contributions of our work are: – An automatic contact constraint activation (CA) method that triggers only the contact constraints useful for given task, thus reducing the problem’s sensitivity to the number of contact pairs and enabling automatic contact candidate assignment, i.e., all perceived surfaces can be assigned as contact candidates as they are handled eﬃciently in the optimization. Extensive simulations show that the proposed CA method improves the computational eﬃciency and the success rate. – We reformulate the Coulomb friction model using STCs and compare it to a baseline based on CCs for discovering tangential forces to rotate a box to random goal orientations. The proposed model is signiﬁcantly faster and more robust than the baseline. – The STC-based friction model is used to model the contact modes of planar quasi-static pushing and compared to a MIP-based controller. Our method outperforms the baseline both in speed and accuracy. Moreover, the method can plan contact-interaction trajectories given only a desired state trajectory without any heuristics. Although our method may run slower than the learning-based variant [15], we demonstrate experimentally that our STCbased control can plan and track complex trajectories at a frequency high enough for real-time applications.

Contact-Implicit Planning and Control of Pushing Using STCs

191

To the best of our knowledge, this is the ﬁrst usage of STCs in contact modeling. We show that STCs hold immense promise for modeling discrete elements in this domain and can mitigate the need for explicit combinatorics and heuristics. Our proposed CA and friction methods serve as example applications for STCs. The extensive simulation and hardware experiments prove the eﬃcacy of the proposed methodology. 1.2

Related Work

Mixed-Integer Programming. The planning and control of contact-rich motions can be achieved by building a hybrid problem with explicit discrete variables and then solved by a MIP approach. In [1,7], MIP-based planners are developed and applied to locomotion on uneven terrains. In manipulation, MIP [2,14] and exhausted tree search [8] have been used to achieve desired object behaviors. Nevertheless, these approaches typically require simpliﬁcations and/or heuristics to make them computationally tractable. [9,15,22] introduced various initialization strategies to reduce the computational burden, mostly with data-driven techniques. Contact-Implicit Trajectory Optimization. Alternatively, contactinteraction trajectories can be planned by incorporating a diﬀerentiable contact model into a trajectory optimization framework, i.e., contact-implicit trajectory optimization (CITO). [25,26] have shown complex contact-rich behaviors can be synthesized for animated characters using convex optimization with soft constraints by sacriﬁcing physical realism. In robotics applications, where motions must be physically accurate, CCs are widely used to model inelastic rigid-body contacts with Coulomb friction. In [49] the non-smooth trajectory optimization problem with impacts and discontinuities are transcribed into a bi-level program with CCs. Posa et al. [33] solved this problem simultaneously using direct collocation. Similar methods have been proposed for planar manipulation [10] and dynamic pushing [37]. In [21,24], hierarchical strategies using warm-starting to improve the computational eﬃciency are presented, while [19,32] propose methods to improve the integration accuracy of such numerical schemes. Other works have focused on replacing CCs for more eﬃcient direct programs, e.g., [6,39,48]. The optimization problem can be solved faster than direct optimization via Hessian-approximating diﬀerential dynamic programming (DDP) variants, such as iterative linear quadratic regulator (iLQR) [17], typically with smooth contact models as these approaches cannot easily handle constraints. In [27,44], smoother fragments of CCs are used with DDP variants, and the potential to run CITO as model predictive control (MPC) is demonstrated. [28] utilized an explicit smooth contact model to enable real-time MPC for highly-dynamic quadruped motions. However, these methods usually require either a good initial guess and/or tedious tuning due to the unreliable convergence of DDP methods. Recently, we proposed a CITO framework based on a variable smooth contact model [30] that decouples the relaxation from the contact model by injecting smooth virtual forces into the underactuated dynamics with frictional contacts. Hence, both nonlinear programming and DDP variants can be used to solve the problem, as shown by

192

M. Wang et al.

[20,29,31,46]. While the methods developed here are applicable to our previous work, we build upon a CC-based CITO framework similar to [33] as it is more widely used in the related literature and thus a more relevant comparison. Planar Pushing. The mechanics of planar pushing in the presence of friction was ﬁrst studied in [23] which focuses on the rotational direction and the object’s center. In [18], the mechanics for stable pushing is studied. More recently, [50] have shown that with a sticking contact, Dubins path can plan trajectories eﬃciently. In [14] a hybrid MPC framework is proposed to track a pushing trajectory without limiting the contact mode to sticking only, instead the contact mode for each time step is deﬁned as a discrete variable. To avoid the combinatorial problem’s computational, convex problems are solved for a fraction of potential contact schedules for the prediction horizon, and the solution is determined comparatively. In [15], an aggregated contact mode selection method is proposed to reduce the complexity of MIP and to eliminate the need for a predeﬁned contact mode schedule. They proposed a learning method to predict the proper contact mode to achieve a higher control frequency. However, this approach is not very versatile as it requires re-training for object shape variations. State-Triggered Constraints. STCs were ﬁrst introduced in [43] as a more modular, uni-directional alternative to CCs and were used to avoid MIP and enable real-time performance for a rocket landing task with state-dependent constraints. In [42], this concept is extended into a more general form called compound STCs showing that more articulate trigger and constraint conditions can be obtained by applying Boolean logic operations. STCs have been successfully applied to real-time rocket landing [34,35] and quadrotor path planning [41].

2 2.1

Background Contact Model

Following [38], a contact impulse due to a frictionless inelastic collision between two rigid bodies can be modeled by 0 ≤ φ(q) ⊥ λn ≥ 0; where the non-negativity condition for the signed distance φ(q) prevents interpenetration between bodies, and the non-negativity condition for the normal contact force λn ensures that the contact bodies can only push each other. The frictional forces λt ∈ R2 can be modeled using the relative tangential (slip) velocity in the contact frame ˙ the friction coeﬃcient μ and by splitting tangential force into directional ψ(q, q), − components λ+ t , λt ≥ 0: − 0 ≤ (μλn − λ+ t − λt ) ⊥ γ ≥ 0,

˙ ⊥ 0 ≤ (γ + ψ(q, q)) ˙ ⊥ 0 ≤ (γ − ψ(q, q))

λ+ t λ− t

(1a)

≥ 0,

(1b)

≥ 0,

(1c)

Contact-Implicit Planning and Control of Pushing Using STCs

193

where γ is an auxiliary variable corresponding to the absolute value of the slip velocity. It is noteworthy that this formulation is based on the maximization of the dissipation of kinetic energy subject to pyramidical friction cone constraints, see [19] for a derivation. However, the stationarity condition is neglected, and this leads to not being able to counteract external forces to prevent slippage when the slip velocity is zero. Nevertheless, these constraints are suﬃcient to ensure that the tangential forces selected by the planner are bounded by the friction polyhedron when sticking and counteracting at the boundary when slipping. 2.2

Contact-Implicit Trajectory Optimization

The CIP problem can be transcribed into a nonlinear program (NLP) for N time steps of size h using a backwards Euler discretization of the underactuated dynamics for an nv na +nu degrees-of-freedom (DOF) system with na actuated and nu unactuated DOF: minimize X,U,F,S

N

2

w1 q˙ i 2 + w2 si 1

(2a)

i=1

subject to: qi+1 = qi + hq˙ i+1 , Mi+1 (q˙ i+1 − q˙ i ) = h(Sa τ i + JTc,i+1 λc,i − ci+1 ) ∀i,

(2b)

τ L ≤ τ i ≤ τ U , xL ≤ xi ≤ xU ∀i, x1 = xinit , xN +1 = xgoal , φ(qi ), λn,i , si ≥ 0 ∀i,

(2c) (2d)

φ(qi ) · λn,i ≤ si ∀i,

(2e)

where q ∈ Rnq and q˙ ∈ Rnv are the generalized positions and velocities of ˙ ∈ Rnv denote the mass matrix and the the system; M(q) ∈ Rnv ×nv and c(q, q) Coriolis, centrifugal, and gravitational terms; τ ∈ Rna is the vector of generalized joint forces; Sa ∈ Rnv ×na is the selection matrix for the actuated DOF; and λc ∈ R3nc is the vector of contact forces for nc contacts that are projected onto the joint space through the contact Jacobian Jc (q) : Rnq → R3nc ×nv . Representing the state by x [qT q˙ T ]T ∈ Rn , X [x1 , ..., xN +1 ], U [τ 1 , ..., τ N ], F [λ1 , ..., λN ], S [s1 , ..., sN +1 ] are the state, control, force, and slack trajectories; and τ L , τ U , xL , and xU are the lower and upper control and state bounds. The slack variables s are introduced to relax the CCs (including those for frictional forces) and penalized in the cost by an exact penalty function, as described in [3,19]. The constraints are evaluated element-wise [33]. 2.3

Quasi-static Planar Pushing

In this work, we consider a single point pusher-slider system under the assumptions of quasi-static motion, Coulomb friction, and uniform pressure distribution over the support plane by following the formulation in [14]. We model the dynamics of planar pushing using an ellipsoidal approximation of the limit surface [12].

194

M. Wang et al.

We consider three contact modes between the pusher and the slider: sticking, sliding left, and sliding right. Each mode imposes a diﬀerent set of friction constraints on the system. The planning problem can then be formulated as an MIP by linearizing the dynamics given desired state and control trajectories and a contact mode schedule. This framework is used to control the system in a receding horizon fashion such that the deviations from the nominal trajectories are minimized. We refer the reader to [14,15] for details.

3 3.1

Approach State-Triggered Constraints

An STC consists of a strict inequality trigger condition g(x, u) < 0 and an inequality constraint condition c(x, u) ≤ 0 such that g(x, u) < 0 ⇒ c(x, u) ≤ 0.

(3)

That is, the constraint c(x, u) ≤ 0 becomes active if the trigger condition g(x, u) < 0 is met. To include this logical implication in a smooth optimization problem, it must be represented using continuous variables, by introducing an auxiliary variable σ ∈ R++ and the following constraints 0 ≤ σ ⊥ (g(x, u) + σ) ≥ 0,

(4a)

σc(x, u) ≤ 0.

(4b)

Given (x, u), (4a) deﬁnes a linear complementarity problem in σ which has a unique solution that can be obtained analytically as σ ∗ = −min(g(x, u), 0). Plugging σ ∗ into (4), we can obtain an STC that is continuous in (x, u) – i.e., a continuous STC (cSTC) – as −min(g(x, u), 0) · c(x, u) ≤ 0., see [43] for details. 3.2

Automatic Contact Constraint Activation

In the CC-based CITO, the constraint space grows with the number of contact pairs. Yet, at any instance, only a fraction of contact pairs are useful. Thus, we propose an automatic CA mechanism based on STCs that eliminates unnecessary contact pairs. This concept, based on a simple heuristic for planar pushing, is intended to showcase the versatility of STCs in this domain. Figure 1 illustrates the CA mechanism for a planar pushing scenario where d is the vector from the box’s current location to the desired. In this case, contact candidates are typically deﬁned by pairing each end eﬀector with each surface of the manipuland. However, the contact pairs that cannot generate normal forces to aid task completion can be deactivated by bounding the corresponding normal forces to zero. This is achieved by replacing the CCs in (2e) by the following STCs g1 NT d > 0 ⇒ φ(q) · λn ≤ s,

(5a)

g2 N d < 0 ⇒ λn = 0,

(5b)

T

Contact-Implicit Planning and Control of Pushing Using STCs

195

Fig. 1. Automatic contact constraint activation: the activated and deactivated contact pairs are indicated by green and red complementarity conditions, respectively. Contact pairs become active if the distance is below a threshold, e.g., the ﬁfth pair in this case.

where N [n1 , ..., nnc ] ∈ R3×nc and n denotes the surface normal. Note that when the normal force is set to zero, the tangential forces are also suppressed. In addition to selectively activating contact pairs for manipulation, this mechanism can be used to activate contact constraints eﬃciently in a cluttered environment. For example, the distance between the manipuland and the environment being less than a threshold, dthresh , can be used as the trigger condition to activate important contact pairs on the ﬂy, i.e., g3 φ(q) < dthresh ⇒ φ(q) · λn ≤ s,

(6a)

g4 φ(q) > dthresh ⇒ λn = 0,

(6b)

This is somewhat similar to the collision constraint relaxation used in [36] when the distance is above a safety margin. It should be noted that the activation of constraints based on task utility, i.e., the addition of (5) into (2), is performed only for the contact pairs between the robot and the manipuland. Whereas, the activation of contact pairs that have a distance less than the threshold, i.e., the addition of (6) into (2), is performed for all contact pairs. Thus, contact interactions are enabled between all contact pairs throughout the optimization but in a computationally-eﬃcient way.1 3.3

A Reformulation of Coulomb Friction

The Coulomb friction model can be expressed in terms of if-then conditions as ˙ = 0 ⇒ ||λt || ≤ μλn , ψ(q, q) ˙ = 0 ⇒ λt = −sign(ψ(q, q))μλ ˙ ψ(q, q) n. 1

(7a) (7b)

Experiments demonstrating the distance-triggered contact pairs can be found in the accompanying video.

196

M. Wang et al.

Such logical implications can be conveniently expressed using the following STCs + ˙ > 0 ⇒ λ− g5 ψ(q, q) t = μλn and λt = 0,

(8a)

λ− t

(8b)

˙ 0 is a coeﬃcient for an elementlocal damping term called Rayleigh damping, where K = ∂fe /∂x is called the stiﬀness matrix. The parameter γm ∈ R>0 is a coeﬃcient for a global viscous damping term called mass-proportional damping. Constitutive Model. The choice of the function fe determines the elastic behavior of the system, and is equivalent to the choice of a constitutive model. The choice of such a model for a given object is a ﬁeld in itself [13]. In this work, we select the Neohookean model described in [4], primarily for the fact that compressing an element to zero volume requires inﬁnite energy. We emphasize that this choice is based on the real-world properties of the particular object used in our experiments, and our method is compatible with all major viscoelastic constitutive models. The Neohookean constitutive model is deﬁned by the energy function 1 1 ψ(C) = λ(ln J)2 − μ lnJ + μ(tr C − 3) 2 2 where λ and μ are material parameters called the Lam´e parameters, and are eﬀectively a nonlinear change of variables from Poisson’s ratio and Young’s modulus. For a local deformation gradient F representing the linearized deformation across a single element,

244

D. Millard et al.

Fig. 2. An example state trajectory

C = FT F and J = det(F). C is called the right Cauchy-Green deformation tensor. As mentioned, the ln J terms result in an inﬁnite amount of energy required to compress an element to zero volume, and thus an inﬁnite restitutive force away from such compression. Boundary Conditions. Manipulation problems where the robot can be assumed to have a rigid and unbreakable grasp of the object may be modelled via time-dependent Dirichlet boundary conditions. In the description of a PDE problem, boundary conditions represent the behavior of valid solutions at the edges of the domain. In particular, a Dirichlet boundary condition speciﬁes the value of the solution for regions of the boundary. For the solid mechanics problems considered here, this corresponds to specifying the position and velocity of speciﬁc nodes on the boundary of the mesh, representing the rigid grasp of the robot manipulator. By varying the states speciﬁed by the boundary condition with time, we represent the motion of the robot, and excite motion in the rest of the mesh. In our problem formulation, we assume that the robot has rigid control over a subset of “grasped” nodes with indices Ig ⊆ {1, . . . , n}, and call the state of those nodes xg = (xi )i∈Ig . We model the grasp as a time-varying 6-DOF rigid transform ut ∈ SE(3) of these nodes, resulting in the constraints ¯i xt,i = ut x

∀i ∈ Ig , ∀t ∈ {1, . . . , T },

¯ i is the undeformed position of node i. where x Time Integration. The second-order FEM dynamics (3.1) can be solved by any ODE solver by introducing variables v = x˙ and lowering to a ﬁrst order system. Alternatively, there exist tailored ODE solvers for second-order systems with mass and damping matrices. For general forward simulation tasks, we use the second order implicit Newmark integrator [24], which enables stable timestepping for timesteps on the order of 10 ms.

Parameter Estimation for Deformable Objects

245

Measurement Model. We assume that in the real system, we have the ability to measure the positions of the subset of mesh vertices with indices Io ⊆ {1, . . . , n}. This is denoted by y = h(x) = (xi )i∈Io . 3.2

Nonlinear Programming for Parameter Estimation

The model in Sect. 3.1 describes an entire family of dynamical systems. Collecting free variables from Eqs. (3.1) and (3.1), we see that this family of systems is parameterized by a choice of (M, λ, μ, γr , γm ). In this work we assume that the mass matrix M is known. In experiments, we compute M by weighing the deformable object, assuming uniform density throughout the mesh, and dividing the resulting mass of each tetrahedron equally among its vertices. (This is known as a lumped mass matrix.) We denote the remaining parameters as θ = (λ, μ, γr , γm ). The goal of parameter estimation is to pick system parameters θ∗ which best describe a set of collected real-world observations. We collect data from the real system by supplying a predetermined discretetime input signal and recording the corresponding output signal. To simplify notation, we present our method using only one pair of input/output signals. However, the method is easily extended to a dataset of multiple input/output signal pairs. Let U = (u1 , . . . , uT ) ∈ SE(3)T denote the input signal. We execute U with zero-order hold using a constant time interval of Δt, yielding the output ˆ T ). We will also use the notation X = (x1 , . . . , xT ) to signal Yˆ = (ˆ y1 , . . . , y denote a sequence of mesh states. Broadly, parameter estimation can be described as the following optimization problem: θ∗ , X ∗ =

argmin L(X, θ) X,θ

subject to

T

2

ˆ yt − h(xt )2

t=1

(1)

X consistent with Equation (3.1) and (3.1).

The choice of squared Euclidean norm to quantify error is arbitrary, and could be replaced by other divergences or metrics. Equality Constraints. Given a candidate state trajectory X = (x1 , . . . , xT ) and candidate parameters θ, we wish to evaluate how well they satisfy Equation (3.1). At each timestep t ∈ {1, . . . , T }, we introduce an equality constraint function ct (X, θ) fe (xt ; θ) + fd (xt , vt ; θ) + fext − M(θ)at = 0, where

xt − xt−1 xt − 2xt−1 + xt−2 and at = . Δt Δt2 In our experiments the system begins in its rest state, so it is reasonable to set v1 = 0 and for t ≤ 2, at = 0. vt =

246

D. Millard et al.

Fig. 3. Structure of the matrices in the computation of ∂c/∂x . Left: The overall block-diagonal structure of ∂c/∂x . The ﬁrst block column is the parameter Jacobians ∂ct /∂θ . The rest of the matrix is a block structure with a diagonal and two subdiagonals, coming from the time diﬀerencing in the calculation of vt and at . Right: An example symmetric sparsity pattern of one of the diagonal subblocks. This sparsity pattern comes from the connectivity of the nodes in the FEM mesh.

Equation (3.2) applies only for nodes which are not grasped by the manipulator. For the grasped nodes xg we introduce a separate equality constraint which enforces the Dirichlet constraint in Equation (3.1), and supersedes Equation (3.2): ¯i. cgt (X, θ) xt,i − ut x In realistic FEM problems, a mesh may have thousands of nodes, and a trajectory may have thousands of timesteps. Many powerful local optimization algorithms rely on the Jacobian of the constraint vector to choose a search direction for improvement. As such, evaluation of the combined equality constraint c and its Jacobian require special care for eﬃcient computation. Here we state the equations for ∂ct /∂X and ∂ct /∂θ , with some comments on performant implementation. ∂ct 1 γr γm K(xt ) + γr H(xt ) · vt + M− = 1+ M ∂xt Δt Δt Δt2 ∂ct 1 2 = − (γr K(xt ) + γm M) + M ∂xt−1 Δt Δt2 ∂ct 1 = − 2M ∂xt−2 Δt where the order-3 tensor H(xt ), called the Hessian stiﬀness tensor, has elements Hijk = ∂Kij /∂xtk , and A · b = Aijk bk is a tensor contraction along the last dimension. In practice, this term is computed eﬃciently as the directional derivative Dv K(x).

Parameter Estimation for Deformable Objects

247

As shown in Fig. 3, notice that the structure of ∂c/∂x is block sparse, and block lower triangular, and that each block may be computed independently and in parallel. Additionally, each block is itself a sparse matrix arising from the structure of the FEM mesh. Storing ∂c/∂x using a sparse matrix represenation is crucial, as the uncompressed matrices can approach terabytes of storage of almost all exact zeros. Quadratic Penalty Methods. Equation (3.2) is diﬃcult to optimize. While the objective is convex, there are a large number of nonlinear equality constraints, so the feasible set is highly nonconvex. Furthermore, the problem is very high dimensional. Satisfying hundreds of simultaneous nonlinear equality constraints is challenging, and we ﬁnd that many methods reach very suboptimal local minima quite quickly. Relaxing the constraints into a penalty allows optimization algorithms to explore outside of the strictly physically feasible set of trajectories by transforming the constrained Equation (3.2) into the unconstrained problem μ 2 ct (X, θ)2 , argmin L(X, θ) + 2 X,θ t where μ > 0, which we solve using a nonlinear conjugate gradient method [10]. µ-Scheduling. Equation (3.2) leaves a free parameter μ which weights the relative contribution of the equality constraints c to the output matching loss function L. As is common in the optimization literature, we ﬁnd that by exponentially increasing the value of μ during optimization (by a factor of 10) improves optimizer performance. Initial X Guess. Additionally, the initial guess of the state trajectory X strongly inﬂuences optimization convergence. Given an initial guess θ0 for the parameters, we compute an initial state trajectory as a numerical solution of the dynamics (3.1) where the input u is determined by the zero-order hold of the sequence U and the initial state x1 is set to the solution of the rest-state problem for θ0 , as detailed below. Since X0 satisﬁes Equation (3.1) for θ0 , it also satisﬁes ct (x, θ0 ) = 0 for all timesteps t, up to small diﬀerences caused by the ODE integration scheme. Therefore, it is an approximately feasible point for the optimization problem (3.2). Rest State Problem. The rest state problem refers to ﬁnding an equilibrium point of the FEM dynamics under the Dirichlet boundary condition, that is, solving the nonlinear system of equations fe (x) = −fext ,

xi = u¯ xi ∀i ∈ Ig

for x with u and θ ﬁxed. When solving this system numerically, we must take precautions to ensure that the solver does not query FD for a state x with very large restorative forces caused by the ln det(F) term in the Neo-Hookean

248

D. Millard et al.

Fig. 4. A motivating testbed problem setup for our proposed method. Left: Franka Emika Panda arm rigidly grasping a foam cylinder, with attached Vicon markers. Right: Visualization of a simulated FEM mesh used for the methods described in this paper. Table 1. Comparison of ground-truth and estimated values for the material parameters θ in our experiment. (λ, μ): Lam´e parameters. (γr , γm ): Rayleigh damping and mass-proportional damping constants. Parameter Ground truth Estimated λ μ γr γm

200.0 300.0 0.009 1.0

143.8 347.3 0.011 0.91

constitutive model (3.1). In our experiments it was suﬃcient to use Newton’s method with a backtracking line search to ensure that the stiﬀness matrix K remains positive deﬁnite.

4

Results

We implement a FEM simulation environment in the Julia language [5], and use the Optim [22] software suite’s implementation of a nonlinear conjugate gradient algorithm to optimize Equation (3.2). To store the sparse Jacobian ∂c/∂x , we use the common Compressed Sparse Column (CSC) format. We are able to evaluate the Jacobian of a trajectory of 100 timesteps, with a mesh of 176 3dimensional nodes in less than 100 ms (minimum 57 ms, median 82 ms, mean 98 ms). Given an initial parameter guess θ0 , we are able to identify a set of parameters and state variable which achieve a mean optimization loss L(X, θ) of 1× 10−3 m mean Euclidean error. However, it is important to note that due to the penalty

Parameter Estimation for Deformable Objects

249

method presented, there is still slack in the dynamics constraints during optimization, and that these parameters diverge during forward simulation, where an individual point on the distal tip of the mesh has a mean 1.1 cm tracking error. This points towards possible issues with our choice of a penalty-based optimization algorithm or towards a diﬀerence between the implicit Newmark integrator and the ﬁnite-diﬀerence approximations enforced in the dynamics constraints ct (X, θ). This provides motivation for ongoing work.

5

Conclusion and Ongoing Work

We have proposed a method to estimate the material parameters of a deformable object by observing the object’s motion under non-destructive robotic manipulation. Our method is intended to improve the ﬁdelity of ﬁnite-element models in simulation environments for complex manipulation tasks. Unlike approaches based on the adjoint method or shooting, our method is similar to collocationstyle trajectory optimization and estimates the unobserved trajectory of the ﬁnite-element mesh along with the material parameters. We believe that this method will show an improved set of initial guesses which will converge to a correct system parameter identiﬁcation, compared to other methods in the literature. Our ongoing work focuses on three main eﬀorts. First, we intend to compare the accuracy and computational cost of a wider range of constrained nonlinear optimization algorithms for the collocation-style problem setup. Second, we will investigate the possibility of exploiting the problem structure for faster or more accurate results, such as solving a series of estimation problems over increasing time horizons. Third, we will compare our method against adjoint- and shooting-based parameter estimation algorithms. We are especially interested in comparing stability-related properties such as sensitivity to initial guess.

References 1. Akkaya, I., et al.: Solving Rubik’s Cube with a Robot Hand (2019) 2. Armstrong, B., Khatib, O., Burdick, J.: The explicit dynamic model and inertial parameters of the PUMA 560 arm. In: Proceedings of the 1986 IEEE International Conference on Robotics and Automation, vol. 3, pp. 510–518. IEEE 3. Barbiˇc, J., James, D.L.: Real-time subspace integration for St. Venant-Kirchhoﬀ deformable models. In: ACM Transactions on Graphics (TOG), vol. 24, no. 3, pp. 982–990 (2005) 4. Belytschko, T., et al.: Nonlinear Finite Elements for Continua and Structures. John Wiley & Sons, Hoboken (2014) 5. Bezanson, J., et al.: Julia: a fresh approach to numerical computing. In: SIAM Review, vol. 59, no. 1, pp. 65–98 (2017). https://doi.org/10.1137/141000671 6. Coumans, E., Bai, Y.: PyBullet, a Python module for physics simulation for games, robotics and machine learning (2016). http://pybullet.org 7. de Avila Belbute-Peres, F., et al.: End-to-end diﬀerentiable physics for learning and control. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

250

D. Millard et al.

8. Elmqvist, H., Mattsson, S.E., Otter, M.: Modelica-a language for physical system modeling, visualization and interaction. In: Proceedings of the 1999 IEEE International Symposium on Computer Aided Control System Design (Cat. No. 99TH8404), pp. 630–639. IEEE 9. Ha, H., Song, S.: Flingbot: the unreasonable eﬀectiveness of dynamic manipulation for cloth unfolding. In: Conference on Robot Learning. PMLR, pp. 24–33 (2022) 10. Hager, W.W., Zhang, H.: Algorithm 851: CGD ESCENT, a conjugate gradient method with guaranteed descent. In: ACM Transactions on Mathematical Software, vol. 32, no. 1, pp. 113–137 (2006). ISSN: 0098-3500. https://doi.org/10.1145/ 1132973.1132979 11. Hahn, D., et al.: Real2sim: visco-elastic parameter estimation from dynamic motion. In: ACM Transactions on Graphics (TOG), vol. 38, no. 6, pp. 1–13 (2019) 12. Heiden, E., et al.: Disect: a diﬀerentiable simulation engine for autonomous robotic cutting. In: Robotics: Science and Systems (2021) 13. Holzapfel, G.A.: Nonlinear Solid Mechanics: A Continuum Approach for Engineering. Wiley. ISBN: 978-0-471-82304-9 14. Hu, Y., et al.: DiﬀTaichi: diﬀerentiable programming for physical simulation. In: ICLR (2020) 15. Koenig, N., Howard, A.: Design and use paradigms for gazebo, an open-source multi-robot simulator. In: 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No. 04CH37566), vol. 3, pp. 2149–2154. https://doi.org/10.1109/IROS.2004.1389727 16. Kutz, J.N., et al.: Dynamic Mode Decomposition: Data-Driven Modeling of Complex Systems. SIAM, Philadelphia (2016) 17. Levine, S., et al.: Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. http://arxiv.org/abs/1603.02199 18. Lim, V., et al.: Planar robot casting with real2sim2real self-supervised learning (2021) 19. MacNeal, R.H., McCormick, C.W.: The NASTRAN computer program for structural analysis. In: Computers & Structures, vol. 1, no. 3, pp. 389–412 (1971) 20. Mahnken, R.: Identiﬁcation of material parameters for constitutive equations. In: Encyclopedia of Computational Mechanics 2nd edn. pp. 1–21. John Wiley & Sons, Ltd. ISBN: 978-1-119-17681-7. https://doi.org/10.1002/9781119176817.ecm2043 21. Mitrano, P., McConachie, D., Berenson, D.: Learning where to trust unreliable models in an unstructured world for deformable object manipulation. In: Science Robotics, vol. 6, no. 54, p. eabd8170 (2021). https://doi.org/10.1126/scirobotics. abd8170, https://www.science.org/doi/abs/10.1126/scirobotics.abd8170 22. Mogensen, P.K., Riseth, A.N.: Optim: a mathematical optimization package for Julia. J. Open Source Softw. 3(24), 615 (2018). https://doi.org/10.21105/joss. 00615 23. M¨ uller, M., et al.: Detailed rigid body simulation with extended position based dynamics. In: Computer Graphics Forum, vol. 39, pp. 101–112. Wiley Online Library 24. Nathan, M.: Newmark. American Society of Civil Engineers, A Method of Computation for Structural Dynamics (1959) 25. Preiss, J.A., et al.: Tracking fast trajectories with a deformable object using a learned model. In: IEEE Conference on Robotics and Automation (2022) 26. Reece, A.R.: The fundamental equation of earth-moving mechanics. In: Proceedings of the Institution of Mechanical Engineers, Conference Proceedings, vol. 179, no. 6, pp. 16–22. SAGE Publications, London (1964)

Parameter Estimation for Deformable Objects

251

27. Sharma, A., Azizan, N., Pavone, M.: Sketching curvature for eﬃcient out-ofdistribution detection for deep neural networks. In: de Campos, C., Maathuis, M.H. (ed.) Proceedings of the 37th Conference on Uncertainty in Artiﬁcial Intelligence, vol. 161. Proceedings of Machine Learning Research. PMLR, 27–30 July 2021, pp. 1958–1967. https://proceedings.mlr.press/v161/sharma21a.html 28. Slotine, J.J.E., Li, W., et al.: Applied Nonlinear Control, vol. 199. Prentice Hall Englewood Cliﬀs, New Jersey (1991) 29. Tassa, Y., Erez, T., Todorov, E.: Synthesis and stabilization of complex behaviors through online trajectory optimization. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 4906–4913. IEEE 30. Todorov, E., Erez, T., Tassa, Y.: Mujoco: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE 31. Wang, B., et al.: Deformation capture and modeling of soft objects. ACM Trans. Graph. 34(4), 1–94 (2015) 32. Wang, B., Zheng, M., Barbiˇc, J.: Adjustable constrained soft-tissue dynamics. In: Computer Graphics Forum, vol. 39, no .7, pp. 69–79 (2020). ISSN: 14678659. https://doi.org/10.1111/cgf.14127, https://onlinelibrary.wiley.com/doi/abs/ 10.1111/cgf.14127 33. Weng, T., et al.: FabricFlowNet: bimanual cloth manipulation with a ﬂow-based policy. In: Conference on Robot Learning (2021)

Place-and-Pick-Based Re-grasping Using Unstable Placement Jiaming Hu(B) , Akanimoh Adeleye, and Henrik I. Christensen University of California San Diego, La Jolla, CA 92093, USA {jih189,akadeley,hichristensen}@eng.ucsd.edu https://www.cogrob.org Abstract. Place-and-pick-based re-grasping is an in-hand object manipulation strategy designed for a simple gripper. In this paper, we address the problem of place-and-pick-based re-grasping on objects without stable intermediate placements. We propose a modiﬁed place-and-pick-based strategy by using unstable placements as intermediate placements for regrasping. The robot can place the object in an unstable pose where the gripper can re-grasp the object while using the ﬁngertips to constrain the object from moving. We demonstrate the experimental validity of the proposed re-grasping strategy in both simulation and the real world.

Keywords: Manipulation

1

· Simple gripper · Extrinsic dexterity

Introduction

Picking up an object and re-grasping it is a common skill in robotics manipulation. This is because the initial grasp on the object is often not the grasp needed for the next desired manipulation task. For example, when picking up a hammer from a toolbox, it is unlikely that the “pick-up” grasp is the same position needed to swing the hammer eﬀectively. Frequently, re-grasping, moving from an initial grasp to a better one, will be necessary after an initial pick up of an object. A single-arm with a parallel gripper is a common platform in robotics because of its simplicity. However, the simplicity implies a lack of intrinsic dexterity. Thus, instead of considering only intrinsic dexterity, extrinsic dexterity has became more popular in re-grasping using a parallel gripper. Since the 1990s,s, place-and-pick-based re-grasping [1,4,10,11] is a common strategy for re-grasping using a parallel gripper. In place-and-pick-based re-grasping, the planner utilizes an object model, searches for force-closure grasps, computes the intermediate placements of the object, and constructs a re-grasp graph. Then, the planner ﬁnds a sequence of re-grasp actions to reorient the object by exploring the shortest path through the re-grasp graph. Given a supporting surface, a robot with a parallel gripper can reorient an object from one pose to another by using a sequence of place-downs and pick-ups. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 17. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 252–267, 2023. https://doi.org/10.1007/978-3-031-25555-7_17

Place-and-Pick-Based Re-grasping Using Unstable Placement

253

Fig. 1. Re-grasp the book by sliding the ﬁngertips over the surface.

Even though those place-and-pick-based re-grasping methods have exhibited robustness and eﬃciency, they have limitations. That is, each re-grasping action requires the object to be graspable in a stable placement. If the object is not graspable in a stable placement, this placement cannot be used as an intermediate placement. We present a modiﬁed place-and-pick-based re-grasping method by incorporating a re-grasping action without using stable placement. Between each place-down and pick-up, the robot can place an object in a speciﬁc unstable placement, called “semi-stable placement”, and use the ﬁngertips to constrain it from moving. Rather than leaving the object on a supporting surface, the parallel gripper can slide its ﬁngertips over the object’s surface, so it can re-grasp and constrain the object simultaneously. For example, as shown in Fig. 1, when a book is placed on a table vertically, the gripper can slide its ﬁngertips from the top to the side while holding it. Re-grasping without using stable placements then becomes feasible. The structure of the paper is as follows. First, in Sect. 2, we review the extrinsic dexterity works on re-grasping by using a parallel gripper. In Sect. 3, we present the modiﬁcations in our re-grasping system for improvement. Finally, in Sect. 4, we run the experiment of searching re-grasping solution in the simulation and test our modiﬁed re-grasping system in the real-world task for showing its feasibility.

2 2.1

Related Work Place-and-Pick-Based Re-grasping

Place-and-pick-based re-grasping approaches use a gripper to reorient an object from one placement to another by using a sequence of place-and-pick actions. Signiﬁcant works [1,2,4,11] have been reported over the last three decades, yet, the main idea remains the same today. First, compute both the stable placement pose set and the collision-free grasp set of the object. From these, build a regrasping graph. Generally nodes of this graph represent stable placements. Two nodes are connected if the robot can pick and place the object between those two placements. After inserting initial grasp and target grasp to the re-grasping graph, a sequence of place-and-pick actions can be computed for re-grasping the object via any graph-based planning algorithm. Building such a re-grasping

254

J. Hu et al.

graph can be expensive because of its reliance on solving IK (Inverse kinematics) for each step and checking for collisions (CD). A key goal has been to improve the eﬃciency of re-grasping planning. In 2000, [11] developed a re-grasping planner, which uses a probabilistic roadmap framework with two levels. The ﬁrst level of the planner is a re-grasping graph. The second level of the planner consists of actual motion planning for pick and place actions. In the ﬁrst level, each edge is annotated with a feasible probability computed by a low cost local motion planner. This probability represents the planner’s conﬁdence that the grasp edge is feasible, and does not necessarily guarantee success. Therefore, the second level of the planner executes motion planning for the pick-place edges with high probability ﬁrst to reduce the number of IK solving and CD. Later in 2015, [4] provided another two-level re-grasp planner, which can reduce the IK solving and CD further. At the ﬁrst level, the algorithm uses the standard re-grasping graph conﬁguration outlined above. During planning, this level will produce a sequence of placements for re-grasping. The second level will extract the sub-graph containing only the sequence of placements from the ﬁrst level planner, and search for a path. If the sub-graph does not contain the solution, it will go back to the ﬁrst level and remove the edge that failed. Then, it will extract a new sub-graph from the ﬁrst level and search for another solution. Finally, in 2019, [12] used the work of [4] as the midlevel connector in a robotics manipulating system and tested its performance in tasks like packing, object assembly, and using tools in the real world. 2.2

Re-grasping with Support from the Environment

Place-and-pick-based re-grasping depends on accessible environmental locations to stably place the object between grasps. Prior works make the assumption that the object has many diﬀerent ways of being placed on a surface, and each placement has many valid grasps to pick it back up. However, some objects may have only a few stable placements, like a cylinder, or stable placements that do not support any valid grasp at all, like a book placed ﬂat on a table. Thus, to address the challenge, some researchers tried to have new kinds of intermediate placement like using a supporting pin [16]. Nevertheless, placing an object on a support pin may not be stable enough, so later they also added an external support gripper [17] in the environment to provide a more stable intermediate placement. However, this solution requires multiple assumptions about the environment and will not be applicable in the real-world situations we consider in this work.

3

Methodology

The main goal of this paper is to extend the place-and-pick-based re-grasping by including a new re-grasping action. In a place-and-pick action, we can leverage a set of semi-stable placements as the intermediate placement where the gripper can re-grasp by sliding over the object’s surface. For incorporating this

Place-and-Pick-Based Re-grasping Using Unstable Placement

255

Fig. 2. Red dot: mass center. Light blue face: the supporting face of a placement. Yellow line: the projection line with length dmass between the mass center and the supporting face. Purple line: the shortest line with length dboundary between the mass center projection on the supporting face and the supporting face boundary. Orange line: the closest boundary edge to the mass center projection. (Color ﬁgure online)

new re-grasping action into the re-grasping system, there are three modiﬁcations. The ﬁrst modiﬁcation is searching for semi-stable placements to include as part of the intermediate placement set. The second modiﬁcation is planning the gripper motion to slide over the object’s surface in a semi-stable placement. Finally, the third modiﬁcation is re-designing the re-grasp graph construction for incorporating the new re-grasping action. 3.1

Intermediate Placement for Re-grasping

Intermediate placement is an indispensable part of the place-and-pick-based re-grasping method. In previous place-and-pick-based re-grasping methods [4, 11],the generally used intermediate placements are stable placement which have following two requirements: 1. The mass center projection along the supporting face’s normal falls into its interior. 2. The ratio dmass /dboundary must be less than a threshold thstable . where dboundary is the shortest distance between the mass center projection on the supporting face and the supporting face boundary; dmass is the shortest distance between the mass center and the supporting face. When the ratio dmass /dboundary is too high, the object becomes a lever system around the closest edge (the orange line in Fig. 2) of the supporting face boundary to the mass center projection. The placement is stable if only the following equation is hold: m · g · dboundary ≥ dmass · Fdisturbance where m is the object mass and Fdisturbance is a possible disturbance force applied to the object. Based on the equation, if the ratio dmass /dboundary is greater , the object may fall over the than the stable threshold thstable = Fdisturbance m·g closest boundary edge (orange line in Fig. 2) of the supporting face. Given that

256

J. Hu et al.

Fig. 3. The pipeline of placement classiﬁcation.

a placement is stable, the object must also be graspable with multiple collisionfree grasp for the placement to be considered an intermediate placement. On the other hand, semi-stable placement is a type of placement satisfying the only ﬁrst requirement of stable placement. For avoiding possible falling, the semi-stable placement requires the gripper to constrain the object. As shown in Fig. 1, the book is in a semi-stable placement where it contacts with the table using its side face. Due to the short distance between the mass center projection on the supporting face and the supporting face boundary (Fig. 4), the book will be unstable once the gripper moves away. Besides this, because the mass center projection is inside the interior of the supporting face, no force should exist from the book to the ﬁngertips. As long as the gripper does not clamp the book but holds it, it can slide its ﬁngertips over the book surface in semi-stable placement with little ﬁction. Therefore, the gripper can re-grasp an object in a semi-stable placement as an intermediate placement. 3.2

Stable and Semi-stable Placement Searching

The ﬁrst modiﬁcation in our re-grasping system is searching for both semi-stable and stable placements as intermediate placements. First, we compute the object’s convex hull and consider each convex hull face as a contact face between the object and the supporting surface. Then, from those possible contact faces, we use the ﬂowchart shown in Fig. 3 to classify the supporting faces of both stable and semi-stable placement and determine which of them are the intermediate placement. For convenience, we call the contact faces to support the object in stable and semi-stable placement as “stable placement face”’ and “semi-stable placement face” respectively. Similarly, the contact face of an intermediate placement is called the “intermediate placement face.” With a given contact face, we calculate the projection of the mass center on it along its normal. If the mass center projection is inside of the contact face’s

Place-and-Pick-Based Re-grasping Using Unstable Placement

257

Fig. 4. Blue face: the semi-stable placement face of the book; Orange arrow: the possible falling direction; Red dot: the mass center.

boundary, then this contact face is a placement face used to support the object in a placement. Then, we calculate the ratio Rmass/boundary between dmass and dboundary of a placement face, so we can determine what type of placement this placement face belongs to. Given a ratio threshold thstable , if the ratio Rmass/boundary of a placement face is less than thstable , then this placement face is considered as a stable placement face. Otherwise, it is an semi-stable placement face. To determine whether a placement is an intermediate placement or not, we need to verify whether the object is re-graspable in that placement. Furthermore, because the re-grasping action is used diﬀerently in stable and semi-stable placements, the veriﬁcation methods in those two placements are distinct. To determine whether a stable placement is an intermediate placement, we need to determine whether the object is graspable with multiple grasp poses in that stable placement. Given a stable placement Pstable , if multiple collisionfree grasp poses exist over the object’s surface, then the object is re-graspable in Pstable . Hence, we can consider the stable placement Pstable an intermediate placement. On the other hand, a semi-stable placement is an intermediate placement where requires the gripper to constrain the object for re-grasping. This paper only considers the gripper sliding motion where the ﬁngertips axis, which is the axis passing through two ﬁngertips, is parallel with the supporting surface. Given an semi-stable placement face Fsemi−stable , the direction Vf all (Fig. 4) from the mass center projection on Fsemi−stable to the closest face boundary is the possible falling direction of Fsemi−stable . That is, when the object is in a semi-stable placement with the supporting surface Fsemi−stable , the object may fall in the direction of Vf all due to uncertain disturbance. Hence, we search for grasp poses where the gripper’s ﬁngertips axis is parallel to the Vf all so that the ﬁngertips can constrain the possible falling. Furthermore, even though the ﬁngertips constrain the object

258

J. Hu et al.

Fig. 5. Yellow objects are in semi-stable placement, while orange objects are in stable placement. Because there is not way to pick up the L shape object in stable placement, there is no valid grasps for it.

motion in a plane perpendicular to Vf all , the falling motion in the constrained plane is still possible. Thus, we must verify whether the object is stable with the supporting of the ﬁngertips. After projecting the mesh model along with Vf all as shown in the right of Fig. 4, the contact face becomes a line (green line in the right of Fig. 4) in 2D. If the mass center projection (orange dot) on that line is close to one of the contact line’s endpoint, the object may fall over that endpoint. For this reason, we ﬁnd the closest distance dendpoint between the mass projection and the contact line’s endpoints. If the dmass /dendpoint is less then thstable , the object is stable with constraint of the ﬁngertips. The next step is checking whether gripper sliding motion is possible or not in the semi-stable placement. We will discuss this in Sect. 3.3 which demonstrates how to plan the gripper sliding motion in a semi-stable placement. If the gripper sliding motion exists in the semi-stable placement, we consider this semi-stable placement as an intermediate placement. For demonstration, we visualize ﬁve stable and semi-stable placements of an L-shape plate in Fig. 5. The orange plates are in stable placements, but no grasp poses exist. The yellow objects are in semi-stable placement, and those green grippers are the collision-free grasp poses on the object. Then, similar to [4], for each computed intermediate placement, we can continue to sample more intermediate placement poses by rotating it around the normal of the table’s normal. 3.3

Gripper Sliding Planning

The second modiﬁcation of the system is the gripper sliding motion planning for re-grasping the object in semi-stable placement. We propose to build a “sliding graph” used for planning the gripper sliding motion over the object’s surface in a semi-stable placement. The sliding graph is a disconnected undirected graph for a semi-stable placement. Each node in the graph is a collision-free grasp pose, while each edge represents a viable sliding motion between two grasp poses. The ﬁrst step of sliding graph construction is generating a set of collisionfree grasp poses over the object’s surface. Then, given semi-stable placement Psemi−stable , we ﬁnd a set of collision-free grasp poses where the gripper’s ﬁn-

Place-and-Pick-Based Re-grasping Using Unstable Placement

259

Fig. 6. The sliding graphs in diﬀerent semi-stable placements of H-shape.

gertips axis is parallel with the possible falling direction of Psemi−stable ’s contact face. These grasp poses are regarded as sliding graph nodes. The second step of construction is verifying the connections between sliding graph nodes by checking the feasible movement between them. Because the gripper can only slide in plane motion, we only verify the connection between two grasp poses sharing the same ﬁngertips plane which is a plane between two ﬁngertips and perpendicular to the ﬁngertips axis. For nodes whose grasp poses shares the same ﬁngertips plane, we use Voronoi diagrams to ﬁnd the possible connections among them based on their grasp pose’s position. From those possible connections, we consider two nodes are connected if the gripper can use one of the following ways to move from one to another without collision. 1. Translation in ﬁngertips plane. During translation, the gripper’s orientation remains constant, and the ﬁngertips must contact with the object’s surface. 2. Rotation around ﬁngertips axis. Because of the limitations of the sliding motion, the sliding graph of a semi-stable placement usually contains multiple disconnected components {G1 , G2 , ...}. Each component has a set of grasp poses where the gripper can re-grasp among them by sliding. As in Fig. 6, where the partial visualization of the sliding graph network for two semi-stable placements of a H-shape is shown. On the left placement, the gripper cannot slide through the gap of the H-shape, there are two disconnected components in one side. Because the parallel gripper cannot ﬂip its pose to reverse its ﬁngertips positions, there are four total disconnected components in the left placement. Similarly, there are six disconnected components in the right placement. When planning a sliding gripper motion from the start grasp pose gi to the target grasp pose gt in a semi-stable placement, we ﬁnd common component Gc where gi and gt can move to by using translation or rotation mentioned above. If Gc exists, we add both gi and gt into Gc , then search the shortest path between them, so we have the solution by extracting the grasp poses on the path in order. Furthermore, given a semi-stable placement, we can verify whether the existence of the sliding motion between two grasp poses by searching the common component.

260

3.4

J. Hu et al.

Re-grasp Graph Construction

The third modiﬁcation of the system is taking the new re-grasping action into account in the re-grasp graph construction. Because our system considers two types of intermediate placements, using the intermediate placements as nodes makes the re-grasp graph complex. Thus, we treat each collision-free grasp pose as a node for simplicity. Two grasp pose nodes are connected by an intermediate placement if the robot can use one grasp pose to place the object in that placement and pick it up with another grasp. Given an placement P , if P is stable, the gripper can re-grasp between all collision-free grasp poses of P by releasing and grasping. Thus, we connect each pair of collision-free grasp poses in P . On the other hand, if P is semi-stable, the gripper can only re-grasp between grasps of the same component of P ’s sliding graph. Thus, we connect each two grasp poses if they are in the same component. As a result, each edge between two grasp pose nodes should contain a set of intermediate placements for re-grasping between them. 3.5

Re-grasp Action Motion Planning

Re-grasp action motion planning involves both action path searching and action motion planning. The action path searching is searching a sequence of re-grasping actions without considering motion planning. The action motion planning is computing the arm motion trajectory involving obstacle avoidance for accomplishing each action of the sequence computed by action path searching. Action Path Searching. We leverage the re-grasp graph and search for a sequence of re-grasping actions for re-orienting the object pose in hand from the initial grasp pose to the target grasp pose. First, we need to insert both initial and target grasp poses as nodes into the re-grasp graph. We search for intermediate placements where the initial and target grasp poses are collisionfree. In each intermediate placement P where the initial grasp is collision-free, we connect the initial grasp pose to grasp poses where the gripper can re-grasp from the initial grasp pose in P . We add the target grasp pose to the graph in a similar way. Second, we search for the shortest re-grasp action path, as shown in the part (b) of Fig. 7, from the initial node to the target node by using the Dijkstra search [26] on the re-grasp graph, as shown in the part(a) of Fig. 7. Action Motion Planning. After planning the re-grasp action path, we plan the arm motion for accomplishing each re-grasp action in the path. First, given the action path {ginit , ...., gtarget }, we extract the intermediate placements on the edges and expand them for generating a action graph AG as shown in the part (c) of Fig. 7. Between each pair of grasp pose nodes in part (c), all directed edges are the intermediate placements used for re-grasping from the last grasp pose to the next one.

Place-and-Pick-Based Re-grasping Using Unstable Placement

261

Fig. 7. (a) re-grasp graph. (b) planned action path. (c) action graph expanded from the action path. gi is a collision-free grasp pose, while pj is a placement.

For each two consecutive grasp pose nodes, we search for an intermediate placements between them for re-grasping. If the placement is stable during searching, we have two motion planning tasks. First, place down the object with the last grasp pose. Second, grasp the object with the next grasp pose. Diﬀerently, when the placement is semi-stable, MPEP (Motion Planning along End-Eﬀector Paths) [23] is required. Given the initial and target grasp poses, we plan a gripper motion trajectory by searching the sliding graph, then plan the arm motion trajectory to make the gripper follow it. Once the solution is found, we can claim the re-grasping is feasible between them. If none of intermediate placements can be used for re-grasping between two grasp pose nodes, a re-planing must be done after removing the edge between those two nodes.

4

Experiments

We validate our improved place-and-pick-based re-grasping system in a set of re-grasping tasks in simulation and the real-world with Fetch robot [27]. 4.1

Re-grasping in Simulation

Plate-Like Object Re-grasping. The goal of these experiments are regrasping a set of thin plate-like objects, as shown in Fig. 8, which are not regraspable with previous re-grasping system. For each object, we randomly select

262

J. Hu et al.

Book

H-shape

L-shape

O-shape

U-shape

Cut

Triangle

Random

Fig. 8. All plate-like objects used in simulation experiment.

(a) No possible motion to move out from the gap. (b) Two gaps causes no motion from one side to another.

Fig. 9. The failure cases of gripper sliding motion planning.

100 pairs of grasps and plan to re-grasp from the one grasp to another. However, because the parallel gripper cannot ﬂip its pose by reversing its ﬁngertips position, the 100 re-grasping task does not include tasks to reverse the ﬁngertips position. Furthermore, because our proposed work does not focus on motion planning, we expect that the motion planning algorithm returns a solution. Table 1. Success rate of re-grasping on each object. Objects

Book H-shape L-shape O-shape U-shape Cut

Success rate 100% 52%

100%

21%

81%

Triangle Random

100% 100%

100%

Table 1 shows that our proposed re-grasping system can successfully re-grasp most thin plate-like objects. However, not all re-grasping tasks have a 100% success rate. For example, during re-grasping the U-shape object, shown in the left of Fig. 9a, failure happens when one grasp is in the gap of the U while another is not. In this case, when the gripper is in the gap, it is constrained by the sides of the gap. As a result, there is no solution for sliding out from it. When re-grasping the H-shape object, as shown in the right of Fig. 9b, once

Place-and-Pick-Based Re-grasping Using Unstable Placement

263

Fig. 10. The set of objects for comparison.

the sampled grasp pairs are on diﬀerent sides, no sliding solution exists because of the upper and lower gaps of the H where the gripper cannot slide through. During re-grasping O-shape object, most of its sides are round, so it has only one semi-stable placement. If one grasp of the sampled pair is not collision-free in that placement, the re-grasping task will fail. Based on the analysis above, we conclude two main reasons causing failures. First is a gap on the object impeding the sliding motion. Because the translation movement requires the ﬁngertips to hold the object, sliding over the gap will break the requirement. The second is the lack of semi-stable placement. The goal of this work is increasing the possible intermediate placement for improving the connectivity of the re-grasp graph. If the object does not have enough intermediate placements, the connectivity of the re-grasp graph cannot be guaranteed and the re-grasping system may fail in these cases. Comparison with Original Place-and-Pick-Based Re-grasping. For showing the improvement, this section compares our modiﬁed re-grasping system with the original place-and-pick-based re-grasping system. We select nonthin objects with a low success rate with the original re-grasping system. These objects have at least one semi-stable placement used for re-grasping. Thus, we select an L-pipe, bottle, and can object, as shown in Fig. 10. Those objects have some semi-stable placements where the gripper can constrain their motion to roll around during re-grasping. Next, we randomly sampled 100 pairs of collision-free grasps for each object without further constraints over the object’s surface. Then, we ran both our modiﬁed system and the original system to plan the re-grasping solution from one grasp to another in each sampled pair. The right of Fig. 11 shows the success rate of the two diﬀerent re-grasping systems on each object. Based on the table, our modiﬁed re-grasping system has almost twice the success rate on each object compared to the original re-grasping system. Success Rate with Considering Motion Planning. In both previous experiments, we did not consider the performance of the motion planning algorithm, so this section analyzes how the performance of the motion planning system impacts the success rate of both re-grasping systems. In this experiment, we want to compare two systems, so we select the can that can be re-grasped in both systems. We ran both re-grasping systems and measured their success rates

264

J. Hu et al.

Fig. 11. The success rate.

with diﬀerent motion planning success rates. During testing, we ran re-grasping between 100 collision-free grasp pairs with each motion planning success rate. On the left of Fig. 11, the horizontal axis is the probability of the motion planning ﬁnding solution, while the vertical axis is the success rate of the re-grasping system. Based on the result, we have two conclusions. First, as long as the motion planning has more than 28% success rate, our re-grasping system can always return a solution to re-grasp the can. Thus, with the motion planning system having a low success rate, our re-grasping system can still ﬁnd a solution for the can task. Furthermore, the success rate of our re-grasping system was consistently higher or equivalent to the original system. 4.2

Re-grasping in Real World

The goal of our real-world experiment is to demonstrate the feasibility of our re-grasping system. Thus, in this section, we tested our re-grasping system on the Fetch in two diﬀerent tasks for demonstration. The ﬁrst task is grasping a book with its side. Given the RGBD image, we ran MaskRCNN [20] and Super4PCS [21] to detect the book and estimate its pose on the table, so the robot could plan to grasp it with the target grasp pose. However, the robot realized that the target grasp pose was not achievable with the current conﬁguration. Therefore, the robot ﬁrst picked the book with a achievable grasp and used our re-grasping system to plan a sequence of re-grasp actions for reorienting the object from the current grasp to the target grasp. Later, the robot found an open space on the table as the supporting surface by performing plane ﬁtting on the detected point cloud. After that, as shown in Fig. 12, the robot placed the book with its side face on the table. Next, while holding the book, the gripper slid its ﬁngertips to the target grasp. Finally, the robot can pick up the book with the target grasp. The second task is grasping an almond can from the top side. Compared to the last task, the almond can with a cylinder shape has stable and semi-

Place-and-Pick-Based Re-grasping Using Unstable Placement

265

Fig. 12. After picking up the book, re-grasp by sliding ﬁngertips over its surface.

Fig. 13. Re-grasping the can with both semi-stable and stable placement.

stable placements for re-grasping. After estimating the object’s pose, the robot realized that a re-grasping is necessary in the task. Therefore, our re-grasping system planed two actions for re-grasping the object with the target grasp pose. The ﬁrst action is sliding in a semi-stable placement, as shown in the upper of Fig. 13. After ﬁnding the open area on the table, the robot ﬁrst placed the almond can with its side face on the table. Then, after opening the gripper, the robot used its ﬁngertips to constrain the rolling motion of the can. At the same time, the gripper slid to the next grasp pose. The second action is re-grasping by releasing and grasping in a stable placement, as shown in Fig. 13. After the ﬁrst action, the robot lifted the can and placed it into the subsequent intermediate placement. Then, the robot opened the gripper and left the can in a stable placement. Finally, the robot moved the gripper to the target grasp pose and grasped the can.

5

Conclusion

This paper points out the limitation of the previous place-and-pick-based regrasping system and improves it by adding a new kind of re-grasp action. For incorporating the new re-grasping, this paper provides three modiﬁcations. First,

266

J. Hu et al.

we introduce a new type of intermediate placement called “semi-stable placement” and a pipeline to search for them. Second, we include the re-grasp gripper motion planning by sliding the gripper’s ﬁngertips over the surface while holding it in semi-stable placement. Third, we propose a modiﬁed re-grasp graph and utilize it for planning. Furthermore, in both simulation and the real-world, we have tested our modiﬁed system in a set of re-grasping tasks and show its ability to improve re-grasping compared to the original re-grasp system.

References 1. Tournassoud, P., Lozano-Perez, T., Mazer, E.: Regrasping. In: Proceedings 1987 IEEE International Conference on Robotics and Automation (1987) 2. Rohrdanz, F., Wahl, F.M.: Generating and evaluating regrasp operations. In: Proceedings of International Conference on Robotics and Automation (1997) 3. Stoeter, S.A., Voss, S., Papanikolopoulos, N.P., Mosemann, H.: Planning of regrasp operations. In: Proceedings 1999 IEEE International Conference on Robotics and Automation (1999) 4. Wan,W., Mason, M.T., Fukui, R., Kuniyoshi, Y.: Improving regrasp algorithms to analyze the utility of work surfaces in a workcell. In: 2015 IEEE International Conference on Robotics and Automation (2015) 5. Cruciani, S., Smith, C., Kragic, D., Hang, K.: Dexterous Manipulation Graphs. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (2018) 6. Pfanne, M., Chalon, M., Stulp, F., Ritter, H., Albu-Sch¨ aﬀer, A.: Object-level impedance control for dexterous in-hand manipulation. IEEE Robot. Autom. Lett. 5, 2987–2994 (2020) 7. Bai, Y., Liu, C.K.: Dexterous manipulation using both palm and ﬁngers. In: 2014 IEEE International Conference on Robotics and Automation (2014) 8. Chavan-Daﬂe, N., Mason, M.T., Staab, H., Rossano, G., Rodriguez, A.: A twophase gripper to reorient and grasp. In: 2015 IEEE International Conference on Automation Science and Engineering (2015) 9. Daﬂe, N.C., et al.: Extrinsic dexterity: in-hand manipulation with external forces. In: 2014 IEEE International Conference on Robotics and Automation (2014) 10. Lozano-Perez, T., Jones, J.L., O’Donnell, P.A., Mazer, E.: HANDEY: a robot task planner. MIT Press (1992) 11. Nielsen, C.L., Kavraki, L.E.: A two level fuzzy PRM for manipulation planning. In: Proceedings 2000 IEEE/RSJ International Conference on Intelligent Robots and Systems (2000) 12. Wan, W., Igawa, H., Harada, K., Onda, H., Nagata, K., Yamanobe, N.: A regrasp planning component for object reorientation. Auton. Robot. 43(5), 1101–1115 (2019) 13. Sintov, A., Shapiro, A.: Swing-up regrasping algorithm using energy control. In: IEEE International Conference on Robotics and Automation (2016) 14. Vi˜ na B.F.E., Karayiannidis, Y., Pauwels, K., Smith, C., Kragic, D.: In-hand manipulation using gravity and controlled slip. In: 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (2015) 15. Shi, J., Woodruﬀ, J.Z., Umbanhowar, P.B., Lynch, K.M.: Dynamic in-hand sliding manipulation. IEEE Trans. Rob. 33(4), 778–795 (2017)

Place-and-Pick-Based Re-grasping Using Unstable Placement

267

16. Cao, C., Wan, W., Pan, J., Harada, K.: Analyzing the utility of a support pin in sequential robotic manipulation. In: IEEE International Conference on Robotics and Automation. IEEE (2016) 17. Cao, C., Wan, W., Pan, J., Harada, K.: An empirical comparison among the eﬀect of diﬀerent supports in sequential robotic manipulation. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (2016) 18. Haustein, J.A., Cruciani, S., Asif, R., Hang, K., Kragic, D.: Placing objects with prior in-hand manipulation using dexterous manipulation graphs. In: 2019 IEEERAS 19th International Conference on Humanoid Robots (2019) 19. Bhatt, A., Sieler, A., Puhlmann, S., Brock, O.: Surprisingly robust in-hand manipulation: An empirical study. In: Robotics: Science and Systems (2021) 20. He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN. In: 2017 IEEE International Conference on Computer Vision (2017) 21. Mellado, N., Aiger, D., Mitra, N.J.: Super 4pcs fast global pointcloud registration via smart indexing. Comput. Graph. Forum 33(5), 205–215 (2014) 22. Coleman, D., Sucan, I.A., Chitta, S., Correll, N.: Reducing the barrier to entry of complex robotic software: a moveit! case study. CoRR, vol. abs/1404.3785, 2014 23. Oriolo, G., Ottavi, M., Vendittelli, M.: Probabilistic motion planning for redundant robots along given end-eﬀector paths. IEEE/RSJ International Conference on Intelligent Robots and Systems (2002) 24. Kavraki, L.E., Svestka, P., Latombe, J.C., Overmars, M.H.: Probabilistic roadmaps for path planning in high-dimensional conﬁguration spaces. IEEE Trans. Robot. Autom. 12(4), 566–580 (1996) 25. Raessa, M., Wan, W., Harada, K.: Planning to repose long and heavy objects considering a combination of regrasp and constrained drooping. Assem. Automat. 41(3), 324–332 (2021) 26. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numer. Math. 1(1), 269–271 (1959) 27. Wise, M., Ferguson, M., King, D., Diehr, E., Dymesich, D.: Fetch & Freight: standard Platforms for service robot applications’.workshop on autonomous mobile service Robots, held at the 2016 International Joint Conference on Artiﬁcial Intelligence, NYC, July (2016)

Safe, Occlusion-Aware Manipulation for Online Object Reconstruction in Confined Spaces Yinglong Miao, Rui Wang, and Kostas Bekris(B) Rutgers University, New Brunswkick, NJ 08901, USA [email protected] Abstract. Recent work in robotic manipulation focuses on object retrieval in cluttered spaces under occlusion. Nevertheless, the majority of eﬀorts lack an analysis of conditions for the completeness of the approaches or the methods apply only when objects can be removed from the workspace. This work formulates the general, occlusion-aware manipulation task, and focuses on safe object reconstruction in a conﬁned space with in-place rearrangement. It proposes a framework that ensures safety with completeness guarantees. Furthermore, an algorithm, which is an instantiation of this abstract framework for monotone instances is developed and evaluated empirically by comparing against a random and a greedy baseline on randomly generated experiments in simulation. Even for cluttered scenes with realistic objects, the proposed algorithm signiﬁcantly outperforms the baselines and maintains a high success rate across experimental conditions. Keywords: Interactive perception · Manipulation · Task and motion planning · Object rearrangement · Object reconstruction

1

Introduction

Robotic manipulation has wide applications in manufacturing, logistics, and domestic domains. Unstructured scenes, such as logistics and home environments, require perception (mostly vision) for scene understanding [3]. Thus, uncertainty in perception, such as occlusion, complicates the manipulation process. This work aims to address this issue; speciﬁcally, to safely reconstruct unknown objects online in a conﬁned space and fully reveal the scene through in-hand sensing and in-place relocation, given a static RGB-D camera. Such online object reconstruction is valuable functionality for manipulation in realistic scenarios where novel and unknown objects can appear frequently and challenge model-based methods [22]. Most applications, such as object retrieval in clutter and object rearrangement (e.g., setting up a dinner table), may require reconstructing only a subset of the objects in the scene, however, it is essential Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 18. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 268–284, 2023. https://doi.org/10.1007/978-3-031-25555-7_18

Safe, Occlusion-Aware Manipulation

269

to understand how to reconstruct objects in these unstructured scenes with occlusion safely. Hence this work focuses on the case where all objects need to be reconstructed safely to completely reveal the scene while avoiding potential collisions. This approach contributes a foundational understanding of the object reconstruction domain and the conditions under which safety and probabilistic completeness can be achieved. Addressing this task, however, can be challenging for a variety of reasons. For instance, robot reachability is restricted in conﬁned spaces, such as shelves; this complicates viable object placements and planning safe robot movements. Reachability constraints introduce dependencies between objects, which restrict the order with which objects can be manipulated safely [30]. On the other hand, occlusion among objects occurs given a single-camera view of the conﬁned space. In such cases, it is hard to extract occluded objects safely, and thus poses another constraint on the order of object movements. These two challenges are interleaved and need to be solved simultaneously. The majority of work in robot manipulation has assumed known object models [10,20,24,33]. More recently, the focus has shifted to cases where object models are unavailable. Nevertheless, most of these methods do not provide theoretical guarantees, either due to the dependence on machine learning [2,8,15,16], or assumptions that are valid only in limited scenarios [14]. Learning-based methods bring great promise to achieve eﬃciency and are highly desirable if they can be applied within a safe and complete framework. This work aims to provide such a framework on top of which more eﬃcient solutions can be built. In summary, this work ﬁrst categorizes problems in object manipulation under occlusion and surveys past work in this ﬁeld. Then, the task of safe object reconstruction in a conﬁned space with in-place relocation is formalized. This formalization geometrically deﬁnes occlusion and the incurred object relationships. This leads to a probabilistically complete algorithmic framework with accompanying proof. The framework is general and can have diﬀerent instantiations based on the implementation of primitives. An instantiation for monotone problems is proposed, which empirically shows signiﬁcant improvements against a random and a greedy baseline in terms of computation time and number of actions taken, with a high success rate1 .

2

Parameters of Occlusion-Aware Manipulation

This section considers parameters that deﬁne the diﬃculty of occlusion-aware manipulation problems. Uncertainty in Perception. Continuous uncertainty arises mainly due to noise in perception, while discrete uncertainty arises due to occlusions. Most eﬀorts have focused on continuous uncertainty by introducing a probability distribution (e.g., a Gaussian PDF) to represent a belief state. In contrast, [32] explicitly models the uncertainty as a discrete set of hypotheses and constructs 1

Code and videos: manipulation.

https://sites.google.com/scarletmail.rutgers.edu/occlusion-

270

Y. Miao et al.

a motion planner for a picking task. Discrete uncertainty can be modeled via geometric reasoning. This work focuses on this direction and assumes perfect sensing without continuous noise. Alternatives in the literature do not explicitly model visual uncertainty and often apply learning-based methods [28]. Workspace Constraints: Conﬁned Spaces. Conﬁned spaces, such as shelves, which are the focus of this work, are more constraining than a tabletop, which is a setup widely studied [26,27,33], since lateral access limits grasp poses. External Buﬀers. Problem diﬃculty also depends on whether external buﬀers for object placement exist. When unlimited external buﬀers are provided, objects can be gradually extracted. Otherwise in-place rearrangement is necessary, which requires ﬁnding placement poses [1,5,11]. Object Model Knowledge. The level of knowledge about objects can range from known models, to category-level shape knowledge to unknown object models. With known object models, objects are often assumed to be perfectly recognized once a part becomes visible [10]; or with an error that is modeled probabilistically [20]. Category-level knowledge can provide hints to estimate object models. For instance, when objects are cylinders, the pose and size can be estimated from visible parts [24]. When object models are unknown, objects may need to be reconstructed [14,22]. When the number of hidden objects is given, their poses can be sampled inside the occlusion region [2,20,33]. Otherwise, the scene needs to be fully revealed when it is unknown in the worst case. Clutter and Object Relationships. The level of clutter aﬀects the ability to grasp objects and place objects freely [1,5]. Furthermore, objects may interact in non-trivial ways. For instance, objects can be stacked on top of each other, which constrains the sequence of grasps [18,28]. Alternatively, objects can be physically unstable in unstructured scenes, such as heaps and piles [8,19]. Camera Placement. The camera, when attached to the robot arm, can be moved to reconstruct the scene [2]. On the other hand, when the camera is static, the robot must manipulate objects to place them in the camera’s ﬁeld of view, which requires solving motion planning problems under visibility constraints. Action Type. A mobile-based robot has higher reachability than a ﬁxed base one. Non-prehensile actions allow more ﬂexibility, such as moving groups of objects, than prehensile actions but increase the level of uncertainty. Manipulation Among Movable Obstacles [29] often employs such non-prehensile operations. Combinatorial Aspects. A monotone rearrangement problem is one where objects need to be moved at most once. Non-monotone problems are harder instances of rearrangement [31]. This work aims to extend notions of monotonicity in the context of manipulation under occlusion and visibility constraints.

Safe, Occlusion-Aware Manipulation

3

271

Related Work

Object manipulation are generally solved by Task and Motion Planning (TAMP) [12], where past works have focused on object rearrangement in cluttered and conﬁned spaces [30,31]. Interactive Perception focuses on the interplay of perception with manipulation [3,13]. Model-Based Object Retrieval in a Shelf. Many works in the literature focus on object search and retrieval from a shelf with known object models or through learning methods [2,10,15,16,20,24,33]. In this setting, the perception module is assumed to fully recognize the object once a suﬃcient part is observed, and ignores partial occlusions. Prior work uses a graph to encode object occlusion and reachability dependencies and constructs an algorithm to solve the problem optimally [10]. An alternative encodes the traversability of objects through a graph named T-graph, which limits to cylinders [24]. Both of these eﬀorts remove objects from the scene, in which case completeness is trivial to satisfy. POMDP-Based Approaches. The POMDP framework has been used to formulate the problem as “Mechanical Search” [8], which has been applied in diﬀerent settings [15,16]. Various works have relied on POMDP formulations [2,20,33], which are solved by standard solvers or Reinforcement Learning (RL) algorithms. In this context, the trustworthiness of object recognition has been explicitly modeled based on the occlusion ratio [20]. Learning-based methods have been proposed to sample target poses [16], which are then used to guide a tree-search framework with an RL policy [2]. Some of the eﬀorts use learningbased estimators to predict the target object pose in 1D along the image width, which then guides the planner to ﬁnd the policies, but is incomplete in solving the problem compared with baselines that have full knowledge of the scene [16] [15]. The most related eﬀort to this work [14] constructs a multi-step look-ahead algorithm that completely explores occluded regions in a grid-world scenario. While it extends the algorithm to realistic settings, the proposed perception module is ad-hoc. Meanwhile, the planner relies on a sampling of actions to move valid objects instead of analyzing the dependency structure, which may provide better properties for more eﬃcient planning. Furthermore, the method does not reconstruct objects to retain information and better represent the objects. This Work. A complete solution can be easily constructed with known object models by combining prior work. This work considers a complicated task with unknown object models and focuses on safe object reconstruction with in-place relocation. At the same time, this work introduces a set of assumptions for theoretical analysis. In particular, objects are assumed to be convex, speciﬁcally cylinders or rectangular prisms, and do not stack on top of each other. This work adopts a TSDF representation [25], similar to prior eﬀorts [22], to reconstruct object models. To ensure safety, it models the object by a conservative

272

Y. Miao et al.

volume, and an optimistic volume following prior work [23]. In-place object rearrangement is a hard problem since the simpliﬁed version of circle packing in 2D is already NP-hard [9]. Recent works [1,5] sample placements and evaluate them based on reachability conditions, but limit to cylindrical objects. Instead, this work applies to general-shaped objects by a novel convolution-based method to sample collision-free placements, similar to the method used in [15].

4

Problem Formulation

A workspace W is deﬁned as the volume inside a shelf with a supporting surface where objects can be placed stably. At any moment, it is composed of free, collision, and occlusion space, given the robot’s knowledge. An object model is deﬁned as the set of points occupied by the corresponding object given an object pose. The ground-truth object model is denoted as Xi (si ), which indicates the set of points occupied by object oi at pose si . As only an estimate of the object model is available during execution, two approximations are used: • a conservative volume Xicon (si ) that contains the space occluded or occupied by object i; • an optimistic volume Xiopt (si ) that contains the revealed surfaces of object i. At any time, these subsets satisfy that Xiopt (si ) ⊆ Xi (si ) ⊆ Xicon (si ). The conservative volume is used during motion planning for collision checking to ensure safety (discussed later in this paper); the optimistic volume is used to compute a picking pose for a suction gripper. s0 (i) denotes the initial pose of object oi . A robot is represented by its geometry R(q), which is the set of points occupied by the robot at the robot state q (e.g., the set of the robot’s joints’ angles). A suction cup is attached to the robot arm and can perform two actions: attach and detach. After attaching, the object is assumed to have a ﬁxed transformation relative to the gripper until detached. The prehensile action of picking and placing is parameterized as (oi , si , τi ), which encodes moving object i to pose si via the robot trajectory τi . Occlusion. A pinhole camera is ﬁxed relative to the robot base, with a pose denoted as c. Perception is performed at discrete-time steps after each manipulation action. The occlusion space due to object oi at time t is then represented as: Oi (si , t) ≡ {c + α(x − c) : α ∈ (1, ∞), x ∈ Xi (si )}, which arises from the RGBD image captured by the camera c at time t. The total occlusion from all objects at time t is deﬁned as O(t) = i Oi (si , t). The variable t will be often omitted if its value is obvious or unimportant, resulting in the occlusion space Oi (si ) for oi up to time an object pose si of object oi . The occlusion space due to object t−1 t given the observation history is then Oi (si , 1..t) ≡ Oi (si , t) ∩ k=1 O(k). The t total occlusion space up to time t is k=1 O(k). The visible surface of an object oi at pose si can be extracted from the image: Fi (si ) = {x : φ(x) ∈ φ(Xi (si )), d(x) = depth(φ(x))}. Here φ(·) denotes

Safe, Occlusion-Aware Manipulation

273

the projection of the 3D geometry on the 2D image; d(x) denotes the depth of point x relative to the camera; and depth(φ(x)) denotes the value of the depth image at the projected pixel φ(x). Due to occlusion, an object may not be fully visible given an image. An object is deﬁned to be fully revealed at time t if all points on its visible surface have been revealed up to time t. It is partially revealed if some points on the visible surface of its model have been seen but not all yet. Otherwise, the object is hidden, and the camera has not seen any part of the visible surface of its model. Safety. The notion of safety considered here imposes two requirements. First, the conservative model of an object should not collide with other objects or the scene. Meanwhile, an object should not collide with hidden objects or parts, which requires a careful analysis of the extraction motion2 . Extraction motions can be found for speciﬁc object shapes, such as cylinders and rectangular prisms, which are assumed by this work. For such cases, this work uses an extraction direction perpendicular to the shelf opening in its reasoning to guarantee safety. Thus, the problem is to ﬁnd a sequence of actions [(oi , si , τi )] that allows to fully reconstruct all objects (i.e., remove the occlusion volume) safely.

5

Algorithms and Analysis

This section deﬁnes a data structure to represent object dependencies and then illustrates a theoretical framework that attains probabilistic completeness. A concrete instantiation is introduced later to solve monotone tasks. 5.1

Occlusion Dependency Graph (ODG)

Our algorithm depends on a graph which represents visibility constraints among objects. To construct the graph, the occlusion relationship between objects needs to be deﬁned. Recall that the occlusion space of an object i is deﬁned to be Oi (si ). Then for objects i and j at poses si and sj , the part of j occluded by i can be deﬁned as Oij (si , sj ) ≡ Oi (si ) ∩ Xj (sj ). Following this, object i at pose si is deﬁned to occlude (directly or indirectly) another object j at pose sj if Oij (si , sj ) = ∅. As objects may jointly occlude a region, it is helpful to separate the occlusion contribution of each object. To achieve this, the direct occlusion ˜ i (si ) ≡ Oi (si ) ∩ ( Oj (sj )), space of an object i at pose si can be deﬁned: O j=i where Oj (sj ) denotes the complement of set Oj (sj ). In other words, it is the region where the camera ray does not pass through other objects after object i. Then the part of object j directly occluded by object i can be deﬁned as ˜ ij (si , sj ) ≡ O ˜ i (si ) ∩ Xj (sj ). Similar as before, object i at pose si is deﬁned O ˜ ij (si , sj ) = ∅. The diﬀerences between to directly occlude j at pose sj if O occlusion space and direct occlusion space of an object are illustrated in Fig. 1. 2

Please refer to the Appendix for an analysis of object extraction. Appendix can be found at: https://arxiv.org/abs/2205.11719.

274

Y. Miao et al.

Building on the above-deﬁned notions, a directional graph can be built in order to model object extraction dependencies, which is named the Occlusion Dependency Graph (ODG). In this graph, each node i represents the object ˜ ij (si , sj ) = ∅, i.e. if object i. Two nodes i and j are connected by edge (j, i), if O i directly occludes j. An example of the ODG is shown in Fig. 1, where a node C is included to represent the camera, and an edge (i, C) exists if part of object i is visible from the camera. Safety constraints require the topological order of ODG to be followed so that only fully revealed objects are extracted ﬁrst. The existence of such an order is guaranteed when the ODG is a Directed Acyclic Graph (DAG), the condition of which is shown in Lemma 13 . Lemma 1. Define the relation A ≺ B if there exists a on the 2D projection of object A and b on the projection of B such that (c, a, b) forms a straight line, where c indicates the camera 2D projection. Then if the top-down projections of objects are convex and do not intersect with each other, objects are acyclic with the defined relation. If object A occludes B, it must hold that A ≺ B. Thus, if the ODG contains a cycle, there exists a cycle for the relation “≺”. Hence for objects whose top-down projections are convex and do not intersect, the ODG is a DAG. In practice, the ODG can only be gradually revealed from partial observations but not fully constructed from the initial observation. Nevertheless, the topological order of ODG can still be followed by image cues even without ODG4 . Having a topological order is not suﬃcient, as other objects can block extraction, and in-place relocation5 can modify the scene. To jointly tackle these challenges, while following

5

4

3

4

3

2 1

4

5

5 2

2

3

1

1

C (a)

(b)

(c)

Fig. 1. Illustration of ODG. (a): Occlusion region and direct occlusion region of object 1. The dark red region indicates the direct occlusion region, while the occlusion region includes the light and the dark regions. (b): Direct occlusion region of each object. Each occlusion region is shown using a similar color as the corresponding object. (c): ODG of the objects. An edge (i, j) exists if object j directly occludes object i. C indicates the camera. 3 4 5

The proof can be found in the Appendix: https://arxiv.org/abs/2205.11719. See Appendix for a detailed discussion: https://arxiv.org/abs/2205.11719. In-place relocation refers to rearrangement of objects within the workspace, i.e., not utilizing buﬀer space for the rearrangement that is external to the workspace.

Safe, Occlusion-Aware Manipulation

275

the topological order of the ODG, a theoretical framework is introduced in the following section. 5.2

A Probabilistically Complete Algorithm

The above challenges can be decoupled into the reachability and visibility challenges. The former involves ﬁnding valid object poses and motion planning of the robot, while the latter involves reconstructing objects and revealing occlusion. The following primitives can be introduced to tackle them individually: 1. reconstruct(i): this primitive extracts fully-revealed object i, reconstructs it, and places it back to its initial pose. 2. reach rearrange(V, i): this primitive rearranges objects in set V so that object i becomes reachable, and can be safely extracted. 3. vis rearrange(V, O): this primitive rearranges objects in set V so that part of the occlusion space O becomes revealed, and returns the revealed space. These primitives are general and can be implemented diﬀerently in practice. Building upon them, a framework can be obtained as shown in Algorithm 1. This algorithm only reconstructs fully revealed objects and relocates objects after they are reconstructed. Hence it follows the topological order of ODG and ensures safety. The algorithm ensures an object i is extractable by reach rearrange(V ∗ , i) where V ∗ is the set of reconstructed objects. It makes sure the direct occluded space of object i is revealed by ˜ i (s0 )), where O ˜ i (s0 ) can be determined based on images. vis rearrange(V ∗ , O i i The algorithm postpones failed actions and attempts them again when more space is revealed. It terminates when all actions fail, indicating that the problem is unsolvable, and avoiding inﬁnite loops that make no progress. Lemma 2. The algorithm defined in Algorithm 1 is probabilistically complete (and interchangeably resolution-complete). Proof. If the algorithm ﬁnds a solution, the solution is correct, as it follows the topological order of ODG and reveals objects and occlusion safely. Conversely, if a solution exists, it can be converted to a sequence of the primitives mentioned above. Denote (ij , sj , τj ) as the j-th action in the solution, and denote the index of the ﬁrst action applied to object i as π(i). Since the solution follows the topological order of ODG, it can be reformulated as: [(iπ(1) , sπ(1) , τπ(1) ), (iπ(1)+1 , sπ(1)+1 , τπ(1)+1 ), . . . , (iπ(2) , sπ(2) , τπ(2) ), (iπ(2)+1 , sπ(2)+1 , τπ(2)+1 ), . . . , (iπ(n) , sπ(n) , τπ(n) )] As the solution is safe, {iπ(k)+1 , iπ(k)+2 , . . . , iπ(k+1)−1 } are objects that are reconstructed before object iπ(k+1) , hence only includes objects in {iπ(1) , . . . , iπ(k) }. Notice that (iπ(k) , sπ(k) , τπ(k) ) can be converted into two actions: 1 2 ), (iπ(k) , sπ(k) , τπ(k) ), (iπ(k) , s0 (iπ(k) ), τπ(k)

276

Y. Miao et al.

Algorithm 1: Target Extraction Under Occlusion Framework 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

init action queue Q = [], revealed objects V , and reconstructed objects V ∗ = ∅ update action queue: Q = [reconstruct(i), ∀i ∈ V ] while |V ∗ | = n do If all actions in Q failed, then terminate as an inﬁnite loop occurs, and the problem is unsolvable action ← Q.pop() if action == reconstruct(i) then if reach rearrange(V ∗ , i) ==FALSE then Q.push(reconstruct(i)) else reconstruct(i) V ∗ ← V ∗ ∪ {i} ˜ i (s0i ))) Q.push(vis rearrange( , O if action == vis rearrange( , O) then O ← vis rearrange(V ∗ , O) if O == ∅ then Q.push(vis rearrange( , O)) else V ←obtain newly revealed objects with revealed occlusion space Q.push(reconstruct(i)) ∀i ∈ V V ←V ∪V Q.push(vis rearrange( , O \ O ))

which puts object iπ(k) back to its initial pose after reconstruction, and then to pose sπ(k) . The ﬁrst action is equivalent to primitive reconstruct(iπ(k) ). Hence 1 2 [(iπ(k) , sπ(k) , τπ(k) )] −→ [reconstruct(iπ(k) ), τπ(k) ), (iπ(k) , sπ(k) , τπ(k) )].

After the conversion, as an object must be reachable before extracted, a dummy action reach rearrange(V ∗ , iπ(k) ) can be inserted before each reconstruct(iπ(k) ) action. The action is dummy as the object is already reachable and nothing needs to be done. The object set V ∗ , as argued before, includes all reconstructed objects in {iπ(1) , iπ(2) , . . . , iπ(k−1) }. Hence [reconstruct(iπ(k) )] −→ [reach rearrange(V ∗ , iπ(k) ), reconstruct(iπ(k) )]. ˜ i (s0 ) for some object i can be changed to Then any action that reveals part of O i ∗ ˜ 0 ∗ vis rearrange(V , Oi (si ), where V includes fully reconstructed objects at that time. For simplicity of analysis, it is safe to assume reconstruct(i) only reconstructs object i but does not reveal occlusion, adding an extra action afterward. ˜ i (s0 )) must be after reconstruct(i). In such cases, vis rearrange(V ∗ , O i After applying the above conversions, there may still be unconverted actions that do not reveal any part of occlusion. Since these actions must be applied to a reconstructed object at that time, it is safe to merge them to the next reach rearrange(·, ·) or vis rearrange(·, ·) primitive. After the above steps, the

Safe, Occlusion-Aware Manipulation

277

solution is changed to a sequence of primitives, which can be found by Algorithm 1 probabilistically and asymptotically as the number of iterations increases. Hence the algorithm is probabilistically complete. The framework 1 is probabilistically complete but depends on two rearrangement primitives, which may run inﬁnitely. The time complexity of the resolutioncomplete version of the framework is included in the Appendix section. A monotone algorithm is introduced in the following section to handle practical cases. 5.3

A Monotone Algorithm

There are two primitives which may require repeated moves of some objects: vis rearrange(·, ·) when multiple calls are required to reveal an occluded region; and when object rearrangement requires repeated moves of objects. The monotonicity can thus be deﬁned where each directly occluded region can be revealed by one vis rearrange(·, ·) action, and object rearrangement moves each object at most once. Algorithm 2 solves the case when both subproblems are monotone. The algorithm rearranges reconstructed objects so that the object to be

4 1

4

5 2

3

(a) Initial scene

1

4

5 2

3

(b) Intermediate scene

1

(c) Rearrangement

5 2

3

(d) After reconstruction

Fig. 2. One iteration of the monotone algorithm. The 2D projection of objects and the ODG are illustrated. In (a), objects 1, 2, and 3 are fully revealed and highlighted in ODG. 5 is hidden and hence is dashed. 4 is partially occluded. After reconstruction of 1, 2, and 3, in (b), they are marked by double circles in ODG. 4 and 5 become fully revealed and hence highlighted. Before extracting 4, the sweeping volume shown in red in (c) and the vision cone shown in blue are visualized. The blocking objects 1 and 2 are rearranged by sampling placements. Then 4 is extracted and reconstructed, with the occlusion reduced as in (d).

278

Y. Miao et al.

extracted and reconstructed next is not blocked or visually occluded by others. Thus, the direct occlusion space can be fully revealed after extracting the object. As both reachability and visual occlusion are tackled in the rearrangement primitive, the algorithm deﬁnes a primitive rearrange(i) that combines reach rearrange(V, i) and vis rearrange(V, O). As illustrated in Fig. 2 step (c), during rearrangement, the algorithm computes the sweeping volume and the vision cone in front of the object to check blocking objects and visual occlusion. If objects that have not been reconstructed exist in this region, then rearrange(i) fails as they cannot be safely moved. Otherwise, the primitive rearranges all reconstructed objects in this region. It may also move other reconstructed objects outside this region to open space for placements. After that, It samples and veriﬁes placements to avoid the computed region. Finally, a trajectory is found by motion planning and executed. The primitive reconstruct(i) decomposes the trajectory into a suction phase, an extraction phase, a sensing phase, and a placement phase. The suction phase computes a trajectory to move the manipulator to a valid suction pose. The extraction phase then extracts the object out of the shelf. The sensing phase completely reconstructs the object by a sequence of in-hand moves. Finally, the placement phase places the object back to its initial pose and resets the ˜ i (s0 ) is revealed when object i is reconrobot. The algorithm guarantees that O i structed. Then following the topological order of ODG, it reconstructs all objects safely and reveals the entire scene. The algorithm is probabilistically complete for monotone cases following a similar proof. However, experiments show it can solve many practical tasks and perform better than the constructed baselines.

Algorithm 2: Target Extraction Under Occlusion Monotone Algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13

initialize reconstructed objects V¯ = ∅, failure count NF = 0 while |V¯ | = n do ˆ ← perception() visible objects S, revealed objects V , occlusion region O ¯ i ← (V \ V )[0] rearrange(i) ˆ updated status, newly revealed object set V˜ , updated occlusion region O, ˆ object model Xi ← reconstruct(i) if status == FAILURE then V ← V [1 :] ∪ {i}, NF ← NF + 1 else reset failure count NF ← 0 V ← V ∪ V˜ , V¯ ← V¯ ∪ {i} if NF == |V \ V¯ | then return FAILURE

Safe, Occlusion-Aware Manipulation

6

279

System Design

The system setup uses the left arm of a Yaskawa Motoman robot with a suction gripper. It is an 8 DOF kinematic chain including the torsional joint, with a static RGBD camera. Voxels are used to model the workspace. Each voxel can be free, occupied, or occluded. The occlusion space of each object is computed given ˆ i (si , t) = {x : φ(x) ∈ φ(Xi (si )), d(x) ≥ the depth and segmentation images by: O depth(φ(x))}. The net occlusion up to time t is obtained through the intersection of occlusion at each time. Each object is modeled via voxels and TSDF for reconstruction [25]. Lastly, objects’ pixel positions and depth values are used to determine their occlusion relationship, which is suﬃcient for convex objects6 . The occlusion caused by the robot is ﬁltered out using the ground-truth segmentation image obtained from the simulator and is not used for occlusion update. To sample viable placements in rearrange(i), the system uses a novel method by convolution. It ﬁrst projects the collision space and object geometry to 2D maps, then computes a convolution of them to obtain collision-free 2D placement poses. Samples are then drawn from this set and validated by IK and collision check. The system reconstructs an object by moving it to several waypoints outside the shelf. Waypoints are sampled such that the information gain (infogain), computed as the volume of unseen space of the object, is maximized, similar to [17]. Figure 3 illustrates the process, where the pose with the max info-gain is selected as the next waypoint. The system uses the PyBullet physics engine [7] to model the scene. The RGBD and segmentation images captured by the simulated camera are used for perception. Motion planning is achieved through MoveIt [6].

(a) Sample and select pose with max info-gain

(b) After reconstruction

Fig. 3. Object reconstruction by sampling poses and computing “info-gain”. Sensing is done at several waypoints, each obtained through the process shown in the ﬁgure. Potential poses are ﬁrst sampled with info-gain computed. Then the pose with the max info-gain is used as the next waypoint for sensing.

6

The details are included in the Appendix: https://arxiv.org/abs/2205.11719.

280

7 7.1

Y. Miao et al.

Experiments Experimental Setup

A random and a greedy baseline are constructed for comparison. The random baseline uniformly samples objects to pick and locations to place them. The greedy one follows prior work [14] by sampling choices of objects and placements and picking the action with the max info-gain, which is the volume of revealed occlusion. Both methods ensure safety by moving only fully revealed or reconstructed objects and reconstructing objects before other actions. The greedy baseline prioritizes reconstructing unreconstructed objects if it is safe to do so. As the methods can be running for an inﬁnite amount of time due to randomness, they are terminated once all objects are reconstructed or a timeout is reached. Hence the number of objects is given only in both baselines but not available in the proposed algorithm. The three algorithms are compared using a PyBullet simulated environment with random rectangular prisms and cylinders on the shelf, similar to the setting in prior work [15]. Experiments include three scenarios with 5, 8, or 11 objects. Each scenario has three diﬃculty levels deﬁned according to the size of objects. Objects are considered small if the width and length scale is 4 cm or 5 cm, medium if 7 cm or 8 cm, and large if 10 cm or 11 cm. Heights are randomly picked for each object from 9 cm, 12 cm, and 15 cm. The easy level includes all small objects; the medium level includes 60% small, 30% medium, and 10% large objects; the hard level includes 25% small, 50% medium, and 25% large objects. Each scenario contains 10 randomly generated problem instances with at least one fully-hidden object. The instances are ﬁxed across algorithms to allow fair comparison. A total of 5 trials of each algorithm is executed for each instance to account for the randomness in the algorithm. Hence in total, there are 90 instances, with 450 total trials for each algorithm. Easy, medium, and hard levels are denoted by indices 1, 2, and 3. Experiments are denoted as n-k for n objects in the scene with diﬃculty level k. Due to a large number of trials, the timeout is chosen to be 250 s for 5 objects; 350 s, 400 s, and 450 s for 8-1, 8-2, and 8-3; and 500 s, 600 s, and 700 s for 11-1, 11-2 and 11-3. In particular, it increases linearly as the diﬃculty level increases. Each trial is considered a success if all objects are reconstructed before timeout and failure otherwise. The random and greedy algorithms fail to reveal the occlusion volume even when all objects are reconstructed. In contrast, the proposed algorithm fully explores the environment at termination in most trails, which is illustrated in Table 1. Success rate across scenarios. For tasks with rectangular prisms and cylinders, a total of 450 trials are run for each algorithm. For tasks with YCB objects, a total of 100 trials are run for each algorithm. Algo

5-2

5-3

8-1

8-2

8-3

11-2

11-3 ycb-5 ycb-8

Random 68%

5-1

68%

54%

26%

16%

16 % 0 %

11-1

2%

8%

46%

10%

Greedy

86%

68%

70%

38%

40%

36%

14%

24% 84%

46%

Ours

100% 100% 100% 100% 100% 94% 100% 100% 98% 96% 90%

22%

Safe, Occlusion-Aware Manipulation

281

the remaining occlusion volume shown in Fig. 5. Experiments also include YCB objects [4] obtained from prior work [21] with 5 or 8 random objects. As the objects are larger than objects in simple geometry experiments, the scene with 11 objects is more cluttered and harder to solve, taking signiﬁcantly more time. Hence YCB experiments with 11 objects are omitted and left for future work after the system is optimized. For each number of objects, there are 10 instances, and each instance is executed for 5 trials7 . All experiments were executed on an AMD Ryzen 7 5800X 8-Core Processor with 31.2 GB memory. 7.2

Results and Analysis

As shown in Table 1, the proposed algorithm achieves a near 100% success rate in all scenarios, consistently better than random and greedy algorithms given a large number of trials. The running time and number of actions of the three algorithms are provided in Fig. 5. The number of actions is at least the number of objects in each scene but may be more if objects need multiple moves for rearrangement. The proposed method needs less time to solve problems and fewer actions. It fully reveals the scene, consistently outperforming the two baselines. The proposed algorithm’s implementation is not optimized and is left as future work. The latency of each component, including perception, motion planning, pose sampling, and rearrangement has been measured. Rearrangement takes more time as diﬃculty increases, with its time ratio ranging from 9% on scenario 5-1 to 30% on scenario 11-3. The remaining time is spent mainly on motion planning and ROS communication, both taking roughly 30%. Pose generation and perception take 23% and 17% of the time without rearrangement. The number of calls to each component increases as the number of objects increases.

Fig. 4. Example setup for 8 YCB objects. 7

Code and videos: manipulation.

https://sites.google.com/scarletmail.rutgers.edu/occlusion-

282

Y. Miao et al.

(a) Time (s)

(b) Number of actions

random greedy ours

(c) Remaining occlusion volume

Fig. 5. Box plots showing running time (a), number of actions (b), and the remaining occlusion volume (c) of the algorithms. The red line in (a) indicates the predetermined timeout for each scenario to compute success ratio in Table 1.

Given this, the average time of each component can be derived, with 0.18 s on average for perception, 0.86 s for motion planning, 0.78 s for pose generation, and 0.13 s for ROS communication. Depending on the number of objects, each rearrangement call takes 10 s to 37 s across the experiments. Failure cases are often caused by timeout due to unsuccessful motion planning queries and rearrangement routines. Therefore, more eﬃcient implementations of the motion planning and rearrangement components will improve performance.

8

Conclusion and Future Work

This work ﬁrst provides a classiﬁcation of occlusion-aware manipulation tasks and then focuses on the safe object reconstruction in conﬁned spaces with inplace rearrangement. This work deﬁnes the notion of monotonicity for this challenge. It proposes a probabilistically complete framework with an eﬃcient algorithm instantiation for monotone instances. Experiments in simulation show that the proposed method attains a high success rate with less computational time and fewer actions than a random and a greedy baseline. As the current work focuses on a foundational understanding of the domain, the optimization of each primitive component is left for future work. A more informed search can also optimize the algorithm to reduce the number of actions needed. Furthermore,

Safe, Occlusion-Aware Manipulation

283

this work will be integrated with perception primitives for object pose tracking and reconstruction to address real-world platforms.

References 1. Ahn, J., Lee, J., Cheong, S.H., Kim, C., Nam, C.: An integrated approach for determining objects to be relocated and their goal positions inside clutter for object retrieval. In: ICRA, pp. 6408–6414 (2021) 2. Bejjani, W., Agboh, W.C., Dogar, M.R., Leonetti, M.: Occlusion-aware search for object retrieval in clutter. In: IROS, pp. 4678–4685 (2021) 3. Bohg, J., et al.: Interactive perception: leveraging action in perception and perception in action. IEEE TRO 33(6), 1273–1291 (2017) 4. Calli, B., Singh, A., Walsman, A., Srinivasa, S., Abbeel, P., Dollar, A.M.: The VCB object and model set: towards common benchmarks for manipulation research. In: ICAR, pp. 510–517 (2015) 5. Cheong, S.H., Cho, B.Y., Lee, J., Kim, C., Nam, C.: Where to relocate?: object rearrangement inside cluttered and conﬁned environments for robotic manipulation. In: ICRA, pp. 7791–7797 (2020) 6. Coleman, D., Sucan, I., Chitta, S., Correll, N.: Reducing the barrier to entry of complex robotic software: a moveit! case study. arXiv:1404.3785 (2014) 7. Coumans, E., Bai, Y.: Pybullet: a python module for physics simulation for games, robotics and machine learning, pp. 2016–2021. https://pybullet.org/wordpress/ 8. Danielczuk, M., et al.: Mechanical search: multi-step retrieval of a target object occluded by clutter. In: ICRA, pp. 1614–1621 (2019) 9. Demaine, E.D., Fekete, S.P., Lang, R.J.: Circle packing for origami design is hard. arXiv:1008.1224 (2010) 10. Dogar, M.R., Koval, M.C., Tallavajhula, A., Srinivasa, S.S.: Object search by manipulation. Auton. Robot. 36(1), 153–167 (2014) 11. Gao, K., Lau, D., Huang, B., Bekris, K.E., Yu, J.: Fast high-quality tabletop rearrangement in bounded workspace. In: ICRA (2022) 12. Garrett, C.R., et al.: Integrated task and motion planning. Annual Rev. Control Robot. Auton. Syst. 4, 265–293 (2021) 13. Garrett, C.R., Paxton, C., Lozano-P´erez, T., Kaelbling, L.P., Fox, D.: Online replanning in belief space for partially observable task and motion problems. In: ICRA, pp. 5678–5684 (2020) 14. Gupta, M., R¨ uhr, T., Beetz, M., Sukhatme, G.S.: Interactive environment exploration in clutter. In: IROS, pp. 5265–5272 (2013) 15. Huang, H., et al.: Mechanical search on shelves using a novel “bluction” tool. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 6158– 6164. IEEE (2022) 16. Huang, H., et al.: Mechanical search on shelves using lateral access x-ray. In: IROS, pp. 2045–2052 (2020) 17. Huang, K., Hermans, T.: Building 3D object models during manipulation by reconstruction-aware trajectory optimization. arXiv:1905.03907 (2019) 18. Kumar, K.N., Essa, I., Ha, S.: Graph-based cluttered scene generation and interactive exploration using deep reinforcement learning. In: ICRA (2022) 19. Kurenkov, A., et al.: Visuomotor mechanical search: learning to retrieve target objects in clutter. In: IROS, pp. 8408–8414 (2020)

284

Y. Miao et al.

20. Li, J.K., Hsu, D., Lee, W.S.: Act to see and see to act: POMDP planning for objects search in clutter. In: IROS, pp. 5701–5707 (2016) 21. Liu, Z., et al.: OCRTOC: a cloud-based competition and benchmark for robotic grasping and manipulation. IEEE RA-L 7(1), 486–493 (2021) 22. Lu, S., Wang, R., Miao, Y., Mitash, C., Bekris, K.: Online model reconstruction and reuse for lifelong improvement of robot manipulation. In: ICRA (2022) 23. Mitash, C., Shome, R., Wen, B., Boularias, A., Bekris, K.: Task-driven perception and manipulation for constrained placement of unknown objects. IEEE RA-L 5(4), 5605–5612 (2020) 24. Nam, C., Cheong, S.H., Lee, J., Kim, D.H., Kim, C.: Fast and resilient manipulation planning for object retrieval in cluttered and conﬁned environments. IEEE TRO 37(5), 1539–1552 (2021) 25. Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: ISMAR, pp. 127–136 (2011) 26. Novkovic, T., Pautrat, R., Furrer, F., Breyer, M., Siegwart, R., Nieto, J.: Object ﬁnding in cluttered scenes using interactive perception. In: ICRA (2020) 27. Price, A., Jin, L., Berenson, D.: Inferring occluded geometry improves performance when retrieving an object from dense clutter. In: ISRR (2019) 28. Qureshi, A.H., Mousavian, A., Paxton, C., Yip, M.C., Fox., D.: NeRP: neural rearrangement planning for unknown objects. In: RSS (2021) 29. Stilman, M., Schamburek, J.-U., Kuﬀner, J., Asfour, T.: Manipulation planning among movable obstacles. In: ICRA, pp. 3327–3332 (2007) 30. Wang, R., Gao, K., Yu, J., Bekris, K.: Lazy rearrangement planning in conﬁned spaces. In: ICAPS (2022) 31. Wang, R., Miao, Y., Bekris, K.E.: Eﬃcient and high-quality prehensile rearrangement in cluttered and conﬁned spaces. In: ICRA (2022) 32. Wang, R., Mitash, C., Lu, S., Boehm, D., Bekris, K.E.: Safe and eﬀective picking paths in clutter given discrete distributions of object poses. In: IROS (2020) 33. Xiao, Y., Katt, S., Ten Pas, A., Chen, S., Amato, C.: Online planning for target object search in clutter under partial observability. In: ICRA (2019)

Systems and Design

Cooperation of Assistive Robots to Improve Productivity in the Nursing Care Field Yasuhisa Hirata(B) , Jose Victorio Salazar Luces, Ankit A. Ravankar, and Seyed Amir Tafrishi Department of Robotics, Tohoku University, Sendai 980-8579, Japan [email protected] https://srd.mech.tohoku.ac.jp/ Abstract. This paper introduces an overview of the “Adaptable AIenabled Robots to Create a Vibrant Society” project, which is part of the “Moonshot R&D Program” led by the Cabinet Oﬃce of Japan. We also introduce CARE, Cooperation of Ai-Robot Enablers, which are being researched and developed to improve productivity in the nursing care ﬁeld. The importance of building an educational system for the successful use of advanced technologies will also be presented, and then we propose a nursing care motion guidance system using AR glasses that allows non-expert caregivers to learn appropriate nursing care. Keywords: Moonshot program · CARE coordination · Nursing care guidance

1

· Multiple robots

Introduction

The authors are currently working on a project under the “Moonshot R&D Program” led by the Cabinet Oﬃce of Japan. The moonshot R&D program is a generally 5-years (maximum 10-years) program in which the government sets ambitious goals (moonshot goals) that attract people and promotes challenging R&D to address important social issues such as a super-aging society and global warming [1]. Moonshot Goal 3 is to “Realization of AI robots that autonomously learn, adapt to their environment, evolve in intelligence and act alongside human beings, by 2050”, and one of the sub-targets is to develop “AI robots that humans feel comfortable with, have physical abilities equivalent to or greater than humans, and grow in harmony with human life” [2]. To achieve this goal, the authors are working on a project titled “Adaptable AI-enabled Robots to Create a Vibrant Society” [3]. The outline of this project in 2050 is to create a vibrant society with multiple AI robots installed in various public facilities such as commercial, cultural, tourist, sports, nursing, hospital, childcare, etc., and maintained as social infrastructure. The AI robots we are developing can change their shape and form according to the location of use and the user’s condition, automatically select the appropriate AI robot from a group of heterogeneous AI robots with various c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 287–294, 2023. https://doi.org/10.1007/978-3-031-25555-7_19

288

Y. Hirata et al.

functions, and provide support and services to the user alone or with multiple AI robots. This will enable anyone to use AI robots anytime, anywhere, and will create a vibrant society in which all people can actively participate in society, in other words, a “Smarter Inclusive Society”. The research results of this project are expected to be used in a wide range of ﬁelds, including nursing care, childcare, sports, and education. In the initial phase of the project, we envision the use of AI robots especially in the nursing care ﬁeld, aiming to solve social issues such as the shortage of nursing care workers. The introduction of AI robots at nursing care facilities will not only reduce the burden on nursing care workers but also improve the quality of care by appropriately combining multiple AI devices and nursing care robots to accommodate the various disabilities of care recipients as shown in Fig. 1. In addition, we are collaborating with the “Platform Project for the Development, Demonstration, and Dissemination of Nursing Care Robots” being promoted by the Ministry of Health, Labour and Welfare. In this project, we aim not only to present new and innovative solutions that can be used in the future as a moonshot project but also to propose realistic solutions that can be adapted in the near future by utilizing existing AI devices, sensors, nursing care robots, etc. Through these projects, the authors are developing several AI robots that can improve productivity in the nursing care ﬁeld. In particular, this paper introduces CARE -Cooperation of Ai-Robot Enablers-, an IoT-based assisted care technology that links existing AI devices, sensors, and nursing care robots via the Internet. In addition, we introduce a nursing care motion guidance system using AR (Augment Reality) glasses to enable anyone to provide appropriate care in the midst of the shortage of caregivers.

Development of actuators and mechanisms with adjustable softness, firmness, and flexibility, and development of robots that can change shape according to the purpose

Year 2050 AI robots as social infrastructure that can provide various types of support without limiting the places of application

Development of a group of AI robots to accompany individuals in their lives Development of assistive technology that does not provide excessive support but makes use of people's remaining and latent abilities Make people think that they can do it by themselves with this robot

Development of simulation technology to automatically determine which AI robots will assist when, where, who, what, and how

Fig. 1. Project concept for nursing care in 2030

Cooperation of Assistive Robots to Improve Productivity

2

289

CARE -Cooperation of Ai-Robot Enablers-

In the current nursing care ﬁeld, various robots and monitoring sensors are already in use. However, most of these robots and sensors are used as standalone devices. We have also proposed many stand-alone assistive robots such as sit-to-stand assist robot [4], robotic walker [5], cycling wheelchair robot [6], etc. While there are some sites where they can be used successfully, there are many sites where they have been introduced but are not being used. In order to introduce robots and other advanced technologies to nursing care facilities, we believe that (1) various devices must be able to be used in cooperation with each other, and their combinations must be freely changeable according to various forms of support, (2) they must be easy to use without specialized knowledge, and (3) it must be possible to verify whether they can be used in the ﬁeld before they are actually introduced. To overcome these kinds of issues, other relevant projects that have used socially assistive robots for long-term support in homes or care facilities have been proposed [7,8]. Our project particularly focuses on providing appropriate support to the user by distributing tasks to an IoT-based multi-robot control system where each robot can do a speciﬁc task. Unlike a single robot system, our proposed system is less complex, and task distribution and allotment between the robots can be eﬃciently handled without hindering the overall task at hand. In order to solve the above issue (1), the authors have developed a common control framework using ROS, Robot Operating System that connects devices in order to cooperatively use commercially available robots, assistive devices, interfaces, sensors, etc. The ROS is a common operating system that provides software developers with libraries and tools to support the creation of robot applications [9]. In recent years, not only university researchers but also companies have been adopting ROS for robot control and sensor measurement, making it possible to link various robots and sensors. On the other hand, even if engineers can utilize robots and sensors using ROS, there are many people who are not familiar with these devices and ROS programming in the actual nursing care ﬁeld. Therefore, it is necessary not only to make it easy to use these devices but also to appropriately recombine multiple devices depending on the situation and the target person to solve the above issue (2). The authors have constructed a system that can easily operate in cooperation with existing IoT (Internet of Things) sensors/actuators, AI speakers, and ROS-compatible robots by connecting them to the Internet. These systems also enable the construction of robot and sensor operation interfaces via IFTTT, If This Then That [10]. IFTTT is a web-based service that uses APIs related to web-based services to trigger something actions. It is said that there are more than 700 IFTTT-compatible services, enabling users to choose the interface that best suits their needs. Furthermore, if there is a system that can verify in advance the cooperative operation of multiple robots and sensors in response to changes in the support scenarios, robot devices, sensors, and environment, it will be possible to conﬁrm the degree to which these systems can be used at the actual nursing care site before introducing them to the site. This is an issue shown above (3). To address

290

Y. Hirata et al.

Fig. 2. Living lab simulator

this issue, the authors have developed a virtual environment called the living lab simulator as shown in Fig. 2, which can verify in advance the cooperative operational conﬁguration of multiple robots and sensors and their operational algorithms in response to changes in support scenarios, robots, sensors, environments, and so on. Using this simulator, it is possible to realize a digital twin of the nursing care ﬁeld and to study the most suitable operation method of nursing care robots and sensors for each nursing care ﬁeld. In addition, the living lab simulator can help determine the form of support provided by the multi-robots in real-time. The project is developing a micro-level simulator for human musculoskeletal simulation, a macro-level simulator for determining how an entire group of robots should behave, and so on. By combining these several levels of simulators, the project aims to be able to determine which robot should provide assistance, when, where, to whom, and how.

3

Verification Experiment for CARE in the Living Lab

The authors have established the “Aobayama Living Lab” as shown in Fig. 3 at Tohoku University’s Aobayama Campus as a center for research and development of nursing care robots, ﬁnding both new and innovative solutions for the future in 2050 and realistic solutions that can be applied in the near future. The Aobayama Living Lab is a 250 m2 space that simulates both nursing home and home care environments and is equipped with various sensors and nursing care robots necessary for research and development. In this section, we show an example of a supported scenario in which multiple robots and sensors are operated in a cooperative manner using IoT sensors/actuators, AI speakers, and ROS-enabled robots installed in the Aobayama Living Lab. Although there are many possible scenarios in which robots can be used in a nursing care facility, as the ﬁrst eﬀort in this study, a scenario was constructed from the perspective of

Cooperation of Assistive Robots to Improve Productivity

291

Fig. 3. Aobayama living lab at Tohoku university

morning activities. In this scenario, the patient gets up from bed, drinks water, then moves to the living room using a walker-type robot, and then goes for a walk outside using a wheelchair-type robot. For example, (1) the back of the bed is raised and adjusted to an appropriate height for standing up by calling “Good morning” to the AI speaker, (2) the object transport robot automatically brings water based on the call “I’m thirsty”, (3) a walker-type robot automatically comes to the user when the user calls out “I want to go to the living room”, and (4) a wheelchair-type robot picks the user up at the front door when the user calls out “I want to go for a walk”. The scenario is shown in Fig. 4. The authors’ research and development in the Moonshot project aim to realize human-centered support that encourages the user’s independent movement rather than providing excessive support or services to the user. Therefore, this project focuses on “self-eﬃcacy,” which is the recognition of one’s own potential, and aims to improve “self-eﬃcacy” by accumulating users’ success experiences between users and AI robots. In other words, we aim to design the behavior of the AI robot so that the user feels that he/she can do it by himself/herself if he/she works with the AI robot. Currently, the above scenarios are created in advance, and the robots operate according to the scenarios. In the future, we intend to develop simulation technology that automatically determines which AI robot will assist when, where, who, what, and how, and technology to improve self-eﬃcacy by making use of a person’s remaining and latent abilities without excessive assistance.

292

Y. Hirata et al.

t = 0:01

t = 0:20

t = 1:59

t = 2:20

t = 0:35

t = 0:45

t = 2:36

t = 2:53

t=

t = 1:15

t=

t = 3:59

0:57

3:44

(1) & (2)

(3)

t = 4:14

t = 4:38

t = 5:04

t = 5:18

t=

t = 5:34

5:25

(4)

Fig. 4. Validation of CARE in morning activities scenario

4

Nursing Care Guidance Using AR Glasses

As indicated in the previous section, various cutting-edge technologies will be introduced into the nursing care ﬁeld in the future. However, for those who have little experience with such technologies, it is not easy to introduce them into nursing care ﬁeld. This does not simply mean that technicians should enter the nursing care ﬁeld, but people who understand nursing care well should be able to use these advanced technologies. This will make it possible to respond to individual needs by combining various sensors, actuators, robots, AI speakers, etc., in accordance with the characteristics of the site and the patient being cared for. If we can educate such human resources, we will be able to provide many individualized supports even with current technology, without waiting for the development of high performances robots such as a humanoid robots. In addition, in the future, more and more people, especially in Japan, will want to

Cooperation of Assistive Robots to Improve Productivity

293

Fig. 5. Guidance using AR glasses for novice in the nursing Care

care for at home, and this will create problems that caregivers not only of young families but also of the elderly have to provide care for the family members. In such cases, family members with little care-giving experience will need to provide various types of support, but it is likely that many situations will arise in which they will not know what type of support is appropriate. To solve these educational problems, the authors are developing a system using AR glasses, which can project a virtual environment such as people, objects, text, etc. onto the real scene to show what care actions should be performed depending on the situation as shown in Fig. 5. For example, when transferring a patient from a bed to a wheelchair, the system can show the caregiver where the patient should lie in bed, what position the arms should be in, how the legs should be bent, etc., according to the actual condition of the person receiving care. Also, in the case of transferring a patient from an endoccupant position to a wheelchair, it is possible to conﬁrm in real-time where and at what angle the wheelchair should be positioned and how to support the transfer operation from there. We believe that this will help to realize nursing care movements that are appropriate for each individual in a place, rather than using instructional books or video materials. In addition, the system can warn the caregiver if he or she is about to perform an incorrect or inappropriate care-giving action or if the caregiver is in a potentially dangerous position if he or she continues with the action. This allows the caregivers themselves to perform care-giving actions with peace of mind. When caregivers are able to perform care-giving without warnings, they can feel that their own skills have improved, and as a result, their sense of self-eﬃcacy can be improved.

294

5

Y. Hirata et al.

Conclusions

This paper introduced the Moonshot project that the authors are currently working on, and presents, in particular, CARE -Cooperation of Ai-Robot Enablers-, which is a cooperative assistance technology that responds to the individuality of care-receivers by cooperatively using multiple AI robots, assistive devices, sensors, interfaces, and other devices. We also focused on the importance of building an educational system for the successful use of advanced technologies. For such education we proposed a nursing care motion guidance system that enables caregivers to provide appropriate nursing care even if they are not skilled caregivers. We hope that the active use of these technologies will help improve the productivity of nursing care facilities in the future. Acknowledgements. This work was partially supported by JST [Moon Shot Type Research and Development Project] Grant No. [JPMJMS2034].

References 1. Moonshot R&D program in Japan. https://www8.cao.go.jp/cstp/english/ moonshot/top.html. Accessed 12 May 2022 2. Moonshot goal 3. https://www.jst.go.jp/moonshot/en/program/goal3/index.html. Accessed 12 May 2022 3. Moonshot project homepage by authors. https://srd.mech.tohoku.ac.jp/ moonshot/en/. Accessed 12 May 2022 4. Hatsukari, T., et al.: Self-help standing-up method based on quasi-static motion. In: IEEE International Conference on Robotics and Biomimetics (ROBIO 2008), pp. 342–347 (2009) 5. Hirata, Y., et al.: Motion control of intelligent walker based on renew of estimation parameters for user state. In: IEEE International Conference on Intelligent Robots and Systems (IROS 2006), pp. 1050–1055 (2006) 6. Hirata, Y., et al.: Regenerative brake control of cycling wheelchair with passive behavior. In: IEEE International Conference on Robotics and Automation (ICRA 2013), pp. 3873–3879 (2013) 7. Vogel, J., et al.: An ecosystem for heterogeneous robotic assistants in caregiving: core functionalities and use cases. IEEE Rob. Autom. Mag. 28(3), 12–28 (2020) 8. Di Nuovo, A., et al.: The multi-modal interface of robot-era multi-robot services tailored for the elderly. Intel. Serv. Robot. 11(1), 109–126 (2018) 9. Quigley, M., et al.: ROS: an open-source robot operating system. In: ICRA Workshop on Open Source Software, vol. 3, no. 3.2 (2009) 10. IFTTT homepage. https://ifttt.com/. Accessed 12 May 2022

Adaptive Radiation Survey Using an Autonomous Robot Executing LiDAR Scans in the Large Hadron Collider Hannes Gamper1,2(B) , David Forkel1,3 , Alejandro D´ıaz Rosales1,4 , Jorge Play´ an Garai1 , Carlos Veiga Almagro1,3 , Luca Rosario Buonocore1 , Eloise Matheson1 , and Mario Di Castro1 1

CERN - European Organization for Nuclear Research, Meyrin, Switzerland 2 Institute of Robotics, Johannes Kepler University Linz, Linz, Austria [email protected] 3 Department of Engineering and Architecture, Jaume I University, Castell´ on de la Plana, Spain 4 Department of Cognitive Robotics, Delft University of Technology, Delft, The Netherlands Abstract. At CERN, radiation surveys of equipment and beam lines are important for safety and analysis throughout the accelerator complex. Radiation measurements are highly dependent on the distance between the sensor and the radiation source. If this distance can be accurately established, the measurements can be used to better understand the radiation levels of the components and can be used for calibration purposes. When surveys are undertaken by the Train Inspection Monorail (TIM) robot, the sensor is at a constant distance from the rail, which means that it is at a known distance and height from the centre of the beam line. However, the distance of the sensor to the closest surface of the beam line varies according to what kind of equipment is installed on the beam line at this point. Ideally, a robotic survey would be completed with online adaption of the sensor position according to the equipment present in the LHC. This new approach establishes a scan of the surface with a 2D LiDAR while moving along the tunnel axis in order to obtain a 3D scan of the environment. This 3D scan will be used to generate online trajectories that will allow the robot to accurately follow the beam line and thus measure the radiation levels. Keywords: Hazardous environment LiDAR · Radiation

1

· Autonomous inspection ·

Introduction

CERN operates the world’s largest accelerator complex, providing high energy particle beams to an international community of physicists. The 4 Large Hadron Collider (LHC) [8] experiments (ATLAS, CMS, LHC-b and ALICE) at the socalled collision points of the LHC proton beams oﬀer researchers the opportunity H. Gamper, D. Forkel, A. D. Rosales and J. P. Garai—Equal contribution of authors. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 295–303, 2023. https://doi.org/10.1007/978-3-031-25555-7_20

296

H. Gamper et al.

to study the basic constituents of matter. Besides the injectors LINAC4, Proton Synchrotron (PS) and Super Proton Synchrotron (SPS), the 26.7 km long LHC represents the largest accelerator of the complex. Due to the planned dumping, collimation and collisions of the beam particles as well as accidental degraded beam transmission, a loss of beam particles occurs in the accelerators. These collide with other particles or matter, producing radioactive isotopes by various nuclear processes. As a result, accelerator components, as well as liquids and air in the close environment can become radioactive. The largest source of radiation exposure of the personnel during maintenance and repair of the complex is residual radiation, which results from the radioactive decay of the induced radioactive isotopes [1]. The main objective of Radiation Protection (RP) at CERN focuses on minimizing the exposure of individuals to ionizing radiation. In addition, the reduction of the radiological impact on the surrounding environment is assigned a major role. The execution of radiation surveys of the accelerators are an essential part of CERN’s ALARA (as low as reasonably achievable) approach. In the general survey, a continuous radiation measurement is carried out in the sectors of the LHC. This has two main purposes: 1. Establishment of a radiological risk assessment and dose planning for all works carried out by CERN personnel in the accelerator complex. 2. Identiﬁcation of beam-loss locations and optimization of transmission. General radiation surveys in the sectors of the LHC have historically been performed by RP personnel after a short cool down period of the machine after a beam stop. They measure the radiation dose throughout the LHC on foot or by bike. Based on this procedure, radiation levels were measured at non constant distance from the beam line and the clustered environment as well as non linear behavior of the radiation levels with respect to the distance, made it very diﬃcult to derive accurate results. Thus, robots increase the eﬃciency of the operation by oﬀering a precise and repeatable survey with respect to the sensor distance from the beam line. Furthermore, the overall downtime of the accelerator can be decreased not only by faster measurements but also due to the reason that a robot can start the scan immediately after a beam stop. Most importantly though, robots can reduce the radiation dose of personnel. In the LHC, a robotic system specialized in inspection and measurement called the Train Inspection Monorail (TIM) has been developed and installed on a ceiling mounted monorail. The main purpose of this robot consists of the functional test of Beam Loss Monitors (BLM), the visual inspections, and the execution of the general radiation survey [2,3]. TIM represents a multi-functional and eﬃciency-enhancing tool to perform inspections and radiation measurements in an hazardous environment. By attaching a radiation sensor to the end-eﬀector of the robot arm present on a TIM wagon, a radiation scan can be achieved under a standardized measurement condition and timing. This paper presents the system to perform a full robotic radiation scan of the LHC, with adaptive trajectories that optimize the positional measurements of radiation along the beam line. In Fig. 1 the cross section of the LHC tunnel including TIM is shown.

Robotic Radiation Survey

297

Fig. 1. Cross section of the LHC tunnel with TIM, see also [8].

2

Methodology

Currently, when TIM performs a survey, the radiation sensor is ﬁxed with respect to the rail. Thereby, the distance between the sensor and the beam line central axis can be calculated at every point of the tunnel. However, the distance between the sensor and the closest surface of the beam line varies according to which kind of equipment is installed on the beam line. This variation is not currently taken into account by the robotic system. Future scans should instead vary the position of the sensor in order to stay at a constant distance from the closest surface of the beam line equipment. This problem can be approached in a number of ways. As the beam line equipment installation points are pre-known, an oﬄine trajectory can be generated from a Look-Up table. This method would be a robust solution in an ideal scenario where the position of the robot in the LHC is perfectly known. Furthermore this intervention is conducted in a semi-structured environment and hence not all obstacles (e.g. ladders, maintenance equipment or cable, etc.) are known a-priori, see Fig. 2. Instead, an online trajectory can be computed based on feedback of the environment. The success and safety of this trajectory depends on the selection of adequate sensors to measure the environment, the robot arm state and an algorithm that can run in real time on the robot. This can be achieved through a set of diﬀerent instruments such as radars, LiDARs, RGB-D cameras that are further described in Sect. 3. On the upper right in Fig. 1 TIM is visualized including the LiDAR that lies in the origin of the reference coordinate system with axes x and y. The dashed blue lines indicate the range (alpha = 75◦ ) in which the LiDAR records data. The red line visualizes the recorded data representing

298

H. Gamper et al.

Fig. 2. The TIM robot in the LHC (left), the robotic wagon when the arm is deployed (middle) and obstacles that may not be modeled (right).

the proﬁle of the installation in the tunnel and the dashed green lines indicate the Region of Interest (ROI) with the hight roiHeight and width roiW idth. The vector rP represents the point of the installation proﬁle with the highest x value, called Point of Interest (POI). The radiation sensor will be positioned at a distance d = 400 mm from the POI in direction x-axis, represented by rE.

3

Robotic System

The TIM robot consists of multiple wagons, each responsible for diﬀerent aspects of the system shown in Fig. 3. For the radiation survey, the robotic wagon, equipped with a robotic manipulator and gripper, is used. For complex trajectory generation, a wagon consisting of a custom 3 DOF rotation and translation R Jaco 2 robot arm has been developed. stage that holds a commercial Kinova This allows for a 9 DOF arm that has the workspace to reach the beam line with suﬃcient motion complexity to control the end eﬀector pose in 3D space [7].

Fig. 3. Configuration of the TIM robot with the LiDAR at the front [2].

The robotic wagon is equipped with pan and tilt cameras, static cameras and RGB-D cameras. However to accurately measure the distance to the correct surface of the beam line from the radiation sensor, these cameras may not have the necessary resolution or range. For this reason, to image the beam line, a LiDAR has been used to map the environment. Through the uni-dimensional movement of the TIM base, a 3D scan of the environment can be achieved using a 2D LiDAR as in [5,6], placed in a vertical structure of the train perpendicular to the train movement as presented in Fig. 3. The chosen LiDAR is a 360◦ C omnidirectional laser range scanner placed upstream of the robot arm, in a static position on the TIM. The robotic arm holds

Robotic Radiation Survey

299

R the radiation sensor, a dosimetric detection unit type BDKG24 from Atomtex that measures gamma radiation in the range from 30 nSv/h–1 Sv/h with a measuring frequency 50 Hz at a baud rate of 19200 bps. To acquire and process the sensor information, an onboard PC is present, which also runs the control algorithms for the robotic arm on the wagon.

4

Data Acquisition and Processing

To test the feasibility of the described method, a scan of an LHC tunnel section was performed. The robotic platform was moving at a constant speed vR while recording the raw data from the LiDAR scans. The maximum speed of the robotic platform vR depends on the LiDAR frequency and the maximum displacement of the LiDAR during one turn. Thus, the maximum train speed can be written as vR,max < dmax fLiDAR ,

(1)

with the maximum train speed vR,max , the maximum LiDAR displacement per turn dmax and the LiDAR frequency fLiDAR . For the feasibility test, the TIM ran a section of 3167 m at an average velocity of 1 m/s. Data was recorded every second. This results in approximately 3200 points of raw data, visually depicted in Fig. 4.

Fig. 4. Raw data of 100 m of the LHC tunnel.

After the raw data was recorded with the LiDAR, the data undergoes a linear and rotational transformation in order to compensate oﬀsets due to the mechanical placement of the sensor. Then a next step of ﬁltering removes the outliers present in the raw data. Outliers are points that exceed the known diameter of the tunnel. This is only possible if the LiDAR sensor is not exactly directed perpendicular to the tunnel walls. A rotation of the sensor can not be detected and thus the recorded points would suggest an increase in tunnel diameter, which is not the case in the whole complex. These outliers and all data points recorded during the same revolution of the LiDAR sensor will be removed. This ensures a reduction of noise even for the circular area with a diameter less then the tunnel diameter. Later, a gaussian ﬁlter is applied to reduce noise in the remaining data. In a ﬁnal step, the data will be reduced to a set of points that lie in a Region of Interest (ROI), deﬁned by a rectangle as shown in Fig. 5.

300

H. Gamper et al.

Fig. 5. Obtaining the ROI of the section. This process reduces computational cost and focuses on the necessary parts of the scan.

Once the ROI is obtained, the relevant control points for the trajectory are extracted. These must be 40 cm from the closest surface of the beam line, at the beam line height.

5

Online Trajectory Planning and Control

The data points identiﬁed in the ROI are split up along the beam line longitudinal axis in equidistant segments of length SL, see Fig. 6. Within each segment the point with the most signiﬁcant impact to the radiation will be identiﬁed. Then, a smooth path, using a sinusoidal method, see [4,9], will be deﬁned. This method creates a smooth response to a step input that is diﬀerentiable until the fourth derivative and thus ideal for mechanical system with gear elasticity. The maximum ﬁrst derivative of the sinusoid path νS can be found considering two other constraints, the maximum end-eﬀector velocity of the robotic arm vE and the maximum speed of the robotic platform vR , to ensure the trajectory can be achieved by the robotic system. This can be written in the form νS =

vE . vR

(2)

Considering all constraints it is now clear that the survey time per meter τ is a function of vR , vE and the segment length SL and thus can be written as τ = f (vR , vE , SL).

(3)

To follow a step of e.g. 0.7 m with an end-eﬀector speed of 0.25 m/s, it was found that a train speed of 0.2 m/s was required, which is achievable with the robotic platform. The path according to the above mentioned parameters and constraints is visualized in Fig. 6. The control algorithm to ensure that the arm accurately follows the generated trajectory is presented in Fig. 7. Since there is a time disparity between the

Robotic Radiation Survey

301

Fig. 6. Recorded online trajectory over 100 m. The green lines indicate the segments with length SL.

collection of the data and when it will be required by the arm, the block diagram is divided into 2 diﬀerent frames: t and t + k, where k is the time for the train to move the arm to where the LiDAR recorded the data. The calculated trajectory is added to the execution list and carried out when the arm reaches the start of the trajectory. Previously executed trajectories are deleted. The time k must be suﬃcient to allow the system to do the necessary calculations and extract the trajectory before arriving to the execution position. If the necessary time for the calculations is bigger than k, the results can be appended to the full trajectory. From here, they can be fetched when the train arrives to the exact position, and executed through the control loop.

Fig. 7. Block diagram of the full control system.

The arm control loop is a velocity feed-forward [10] type. After the calculation of the trajectory in t + k, the system waits until t to execute it. The desired next point of the trajectory (qd , q˙d , q¨d ) is extracted, and sent to the control loop to follow.

6

Conclusion

This paper presents the concept of a completely automated and adaptive system for robotic radiation surveys along the LHC tunnel. The rail guided system, combined with a six degree of freedom robotic arm, establishes a constant distance

302

H. Gamper et al.

to the closest beam line surface, allowing for precise surveys to be conducted. The use of a 2D LiDAR on the robotic system can be meshed to create a 3D scan of the tunnel environment which is used to generate an online trajectory planner that adapts to changes in the semi structured environment of the LHC tunnel. The calculations for simple sinusoid path planners are feasible in the required control loop time frames using embedded PCs, and create smooth trajectories for the robotic arm. The ﬁrst simulations using recorded data from the LHC tunnel have shown great improvements in survey times, reliability and accuracy with respect to the real activation of installation material in the tunnel. Future proof of concept studies will review and validate these simulated results in the real environment using the TIM robot. The system can also be adapted to be used on ground based mobile robotic platforms for radiation surveys in other experiments at CERN where no monorail is installed. The orientation of the tangential plane on the POI will be taken into account in order to establish the optimal orientation of the radiation sensor. These systems will lead towards the creation of automatic, repeatable and precise radiation surveys across the full accelerator complex at CERN. Evaluation of the repeatability and accuracy of the positioning system as well as the resulting accuracy of the radiation measurements compared to non adaptive trajectories will be subject to future work.

References 1. Forkel-Wirth, D., Silari, M. (eds.): Radiation Protection Group Annual Report 2009 (2010). https://cds.cern.ch/record/2221663/files/AnnRep-RP-2009.pdf 2. Di Castro, M., et al.: i-TIM: a robotic system for safety, measurements, inspection and maintenance in harsh environments. In: 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), pp. 1–6 (2018). https://doi.org/ 10.1109/SSRR.2018.8468661 3. Di Castro, M., Ferre, M., Masi, A.: CERNTAURO: a modular architecture for robotic inspection and telemanipulation in harsh and semi-structured environments. IEEE Access 6, 37506–37522 (2018). https://doi.org/10.1109/ACCESS. 2018.2849572 4. Gandolfo, D., et al.: Path following for unmanned helicopter: an approach on energy autonomy improvement. Inf. Technol. Control. 45 (2016). https://doi.org/10.5755/ j01.itc.45.1.12413 5. Kim, H., Choi, Y.: Location estimation of autonomous driving robot and 3D tunnel mapping in underground mines using pattern matched LiDAR sequential images. Int. J. Min. Sci. Technol. 31(5), 779–788 (2021) 6. Neumann, T., et al.: Towards a mobile mapping robot for underground mines. In: Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium, Cape Town, South Africa, pp. 27–28 (2014) 7. Pachidis, T., et al.: Forward kinematic analysis of JACO2 robotic arm towards implementing a grapes harvesting robot. In: 2020 International Conference on Software, Telecommunications and Computer Networks (SoftCOM), pp. 1–6. IEEE. https://doi.org/10.23919/SoftCOM50211.2020.9238297

Robotic Radiation Survey

303

8. Potter, K.M.: The Large Hadron Collider (LHC) project of CERN (No. LHCProject-Report-36) (1996). https://cds.cern.ch/record/308243 9. Wu, X.W., et al.: Sinusoidal path planning for attitude maneuver of flexible spacecraft. Appl. Mech. Mater. 532, 187–190 (2014) 10. Zheng, Q., Gao, Z.: Motion control design optimization: problem and solutions. Int. J. Intell. Control Syst. 10(4), 269–276 (2005)

Reference-Free Learning Bipedal Motor Skills via Assistive Force Curricula Fan Shi(B) , Yuta Kojio, Tasuku Makabe, Tomoki Anzai, Kunio Kojima, Kei Okada, and Masayuki Inaba The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656, Japan [email protected] http://www.jsk.t.u-tokyo.ac.jp

Abstract. Reinforcement learning recently shows great progress on legged robots, while bipedal robots in high dimensions but narrow solution space are still challenging to learn. The typical methods introduce the reference joints motion to guide the learning process; however, obtaining a high-quality reference trajectory is nontrivial, and imitation suffers from the local minimum. For general reference-free scenarios, the bipedal robot is discouraged by the early termination and biased sample collection. Inspired by the assistive learning commonly shown in biped animals, we introduce the assistive force to aid the learning process without the requirement of reference trajectories. The learned assistant could be a curricula to lead motor skills learning and is eliminated in the end to shape the learned motion to be plausible. We analyze the assistive system and verify its effectiveness in multiple challenging bipedal skills (Videos and supplementary materials: https://fanshi14.github.io/me/isrr22.). Keywords: Bipedal locomotion · Reinforcement learning · Curriculum learning

1 Introduction Bipedal robots have the potential as a general solution in the human-centered environment. On the one hand, they could adapt to the environment with the similar reachability as humans [1, 7, 11]; on the other hand, they could help better understand the human body and intelligence [3, 17]. Recently, with the development of learning techniques in robotics and computer graphics, bipedal robots have shown great progress in several complex motor skills, such as achieving multiple dynamic gaits [24], robust locomotion on uneven terrain [14, 25], and imitating complex human motion behavior [19, 35]. The learning approach, especially the model-free deep reinforcement learning, shows the advanced performance in controlling bipedal robots in these tasks. Compared to the optimizationbased controller, the learning approach could largely reduce the online computational costs with the benefit of the neural-network-based controller [8]. Compared to the model-based controller, the model-free approach does not assume the environment model parameters (such as friction coefficients), which helps achieve robust performance against the environment change and unexpected scenarios [13, 15, 23]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 304–320, 2023. https://doi.org/10.1007/978-3-031-25555-7_21

Learning Bipedal Skills via Assistance

305

Fig. 1. Proposed assistive learning framework for bipedal robots with 3 curriculum stages: (a). apply the assistive force to guide the skill learning; (b). gradually reduce the influence of assistance; (c). completely remove the assistance. We demonstrate its effectiveness for reference-free scenarios with several bipedal skills and humanoid platforms. High-resolution video: https:// fanshi14.github.io/me/isrr22

Most of the previous work in bipedal robots requires the prior reference motion to help guide the learning procedure. The reason is the biased sample collection during the bipedal learning process [19]. Compared to other multi-legged robots, such as quadrupedal robots, the bipedal robot has a more narrow solution space during locomotion but higher dimensions and non-linearities to control. Without the reference guidance, the robot would immediately fail in the initial learning stage for the most tasks, which makes it difficult to collect effective samples to learn, and the collected samples are limited to be partial without the whole-stage information of the task. However, obtaining feasible motion reference data is not easy. The motion reference is the trajectory of each joint because most of the controller directly controls in the joint level. To reduce the noisy and verbose motion, the high-quality reference data is preferred, which could be obtained from the specific human motion capture data [19, 35] or predeveloped robot controller [14, 26]. We see the disadvantages of these reference-based methods as follows: – imitation from human data occurs in dynamic mismatches between human data and robot configuration, which could lead to less optimal and even infeasible motion behavior [35] – imitation from a model-based controller requires an amount of work by firstly developing a feasible model-based controller and then imitating in the learning fashion [26] – robustness and generalization ability of the imitated controller would decline compared to learning from the scratch [24, 25] In this work, we develop the reference-free learning framework for bipedal robots. To tackle the immediate termination and biased-samples problem, we apply the assistive force to the robot body in the initial learning stage. The inspiration comes from the nature: (1). The human infant learns biped walking with parents’ assistance in the first stage; (2). The bird learns new bipedal skills with its wing’s assistance initially. The assistive force helps the bipedal agent explore the environment and collect effective

306

F. Shi et al.

Daily locomotion

Dynamic skills

Gymnastics

Biped animals

(b)

(c)

(d)

Fixed

(a)

Fig. 2. Multiple assistive learning behaviors appeared in nature, and achieved bipedal skills in this paper with the assistive force as a curricula.

samples with more safety support in the initial stage. To achieve independent motion behavior (infants) or save energy consumption (birds), the assistive force would be gradually reduced and eliminated in the following stage as shown in Fig. 1. Therefore, the assistive force could be utilized as the curriculum to guide the learning process. Motivated by these biped learning behaviors, we could develop a similar bio-inspired learning framework for the bipedal robot. However, there still exists several fundamental robotic problems for applying the assistive force, including: (1). How many forces to apply; (2). Where to apply; (3). How to calculate the force; (4). ranges of the force, etc. We will discuss these problems in the following sections of this article. We verify the effectiveness of the proposed framework on several challenging bipedal tasks and demonstrate the learned controller in the simulation and real robot. To our best knowledge, it is the first work to analyze the assistant format and demonstrate several challenging bipedal skills as shown in Fig. 2 under reference-free settings. We verify the learning process on 5 robots with different kinematics, including the asymmetrical configuration which has large dynamic mismatches with the usual human data. We hope the proposed bio-inspired method could open up new possibilities for general bipedal learning and benefit the multi-skills framework in future work.

2 Related Work 2.1

RL-Based Legged Controller

The legged controller based on deep reinforcement learning (RL) has recently achieved great progress. Quadrupedal robots show superior performance with the learned controller in both blind [8, 13] and perceptive locomotion [15]. Among them, curriculum learning [4] is widely leveraged to guide behavior learning. Driven by both animation and robotic communities, the bipedal robot with RL-based controller has obtained great attention meanwhile. Animation researchers focus on imitating the human motion and utilize reinforcement learning to robustify the imitated controller [19, 27, 35]. Most robotic researchers adopt a similar idea to lead the bipedal robot to learn with

Learning Bipedal Skills via Assistance

307

the pre-defined reference motion [14, 26, 31]. Few researchers attempt to develop the framework without the reference trajectory [24, 25], in which several walking gaits are successfully learned and demonstrated on the robot. 2.2 Assistive Control Previous work on bipedal assistive control is mainly discussed in computer graphics. Some researchers leverage the external assistance force to achieve the physically implausible motion on their computer agents [5, 10, 18, 30, 35]. They focus on generating high-quality motion digitally and are not serious about the external assistive force not physically existing in the real world. Some improve it by reducing the helper force as an optimization progress [16]. For learning-based methods, some researchers develop the PD-based assistive force and eliminate it in the end to learn a walking controller in the RL fashion [29, 34], in which [29] requires the reference motion to guide and [34] shows a running controller to track the target speed. Motivated by the multiple successful assistive learning behavior in nature, we believe more complex motor skills and several fundamental problems in assistive force deserve further investigation. 2.3 Learning with Recovery In case of the learning failure, researchers have recently developed the recovery policy coping with the original controller for manipulation [28] and legged locomotion [22, 33], in which the control policy and recovery policy are trained simultaneously. For the post-failure stage, the reset policy could be learned as well to automatically reset the environment and make the agent continue to learn [32]. On the contrary, some researchers develop the adversarial agent to robustify the controller through the recovery response [20, 36], in which the problem becomes a zero-sum game. Our framework is different from both sides. Instead of executing the recovery policy only close to or after the task failure, the assistance agent is expected to participate during the whole stage. Our goal of the assistive force is learned to be cooperative other than being adversarial to the robot controller.

3 Problem Formulation 3.1 Notation Our system is implemented in 3D settings. Only proprioception information could be obtained during the task, including the joints position and IMU sensor. The notation would be denoted as follows (Table 1):

308

F. Shi et al. Table 1. List of symbols utilized in the problem formulation Symbol

Description

N

Number of the controlled joint

θ = [θ1 , θ2 , ..., θN ] θ˙ ∈ RN

T

Angular velocity vector of the controlled joint

θ d ∈ RN

Target angle vector of the controlled joint

{W }, {CoG}

World frame, center-of-gravity frame

M

Total mass of the robot

p = [px , py , pz ]T

Position vector of the robot

v = [vx , vy , vz ]T q = [qw , qx , qy , qz ]

3.2

Angle vector of the controlled joints

Linear velocity vector of the robot T

Orientation the robot in quaternion form

w = [wx , wy , wz ]T

Angular velocity vector of the robot

FT

Assistive force vector

a, s

Action space, observation space

Problem Statement

The bipedal control problem could be formulated as the Markov decision process (MDP). MDP is a mathematical framework for modeling discrete-time stochastic control processes. It is defined by (S, A, R, P, γ), including states, actions, reward function, state transition probabilities, and discounter factor. The policy π is a distribution over action given states as π(at ∈ A|st ∈ S) and receive reward rt ∈ R through interaction with the environment, following the transition probability P (st+1 |st , at ). The objective of RL agent is to find the optimal policy π ∗ to maximize the totally discounted reward t t γ rt . We utilize the neural network to approximate the control policy and predict the probability distribution of actions. For bipedal robots, the actions would be related to the joints command of legs and assistive force. The assistive force only appears in the simulation during the initial learning stage. It is eliminated in the final learning stage. Therefore, the assistive force does not physically exist on the eventually learned controller.

4 Learning with Assistive Force Bipedal skills are difficult to learn without motion reference. Inspired by animal behaviors in nature, we propose the idea of learning with the assistive force curricula. In this section, we will discuss the detailed settings for bipedal robots to answer the questions like how many forces, where to apply the assistance, how to compute the assistive force, and how to develop the curricula for robots. 4.1

Assistance Placement

Bipedal animals and robots are both in a multi-link fashion. There are many choices to place the assistive force. In nature, the assistance from parents is mostly applied to the torso and arms of a baby. The assistance could be placed on the lower body as

Learning Bipedal Skills via Assistance

309

well, considering the artificial rehabilitation device such as the lower-limbs exoskeleton system. In computer agents, the assistance is applied to the torso, feet, or each link of the robot in the previous work. When applying the assistive force on each link, the agent usually has the reference joints trajectory to imitate and generate the PD-feedback assistance [10, 29]. For reference-free scenarios, computing the force values becomes difficult because it requires a large extra action space. When directly applying the assistive force on the feet, the agent could have unexpected joints displacement on the ankle joints because of the respectively large drag force from the assistance. Different from the goal of supporting implausible motion with the residual force in [35], our goal is to assist the learning process and eliminate it in the end, for which the joints displacement values may generate the misleading samples during learning. In addition, because feet usually have larger movement during a skill, nonlinearities concerning the CoG become more complex than applying to the torso. Another limitation is it could not evenly cover the wrench space. When applying on feet, the assistive torque in the frontal plane is much weaker because of the short moment arm. Applying the assistance to the hands could have the similar problem.

CoG CoG

Ground

(a)

(b)

Fig. 3. Placement analysis of assistive force. (a). relative position on robot body in the frontal plane; (b). analysis with ground contact in the sagittal plane, in which the robot is simplified as a single rigid body and the influence of assistive force is approximated as a 2D pivoting problem.

Therefore, we place the assistive force on the robot torso. The assistive force has less negative interference on the robot joint motion than end effectors. Because the torso takes up a large percentage of the total weight, the assistant’s moment arm varies in small magnitude during motion. We limit an individual assistive force in each axis to be no more than 20% of the robot weight: |FTi | ≤ 0.2 M g. If the assistive force is respectively too small, it could not have the effective influence comparing to the scenario without assistance. If the assistive force is too large, the robot could even “fly” and the learned motion would mainly rely on assistance, which could not help the robot master the expected skills by itself. To avoid the robot collecting too biased samples to cause a local minimum, early termination is commonly defined in the bipedal learning [19], for which one of the critical judgments are based on the rotation angle. If the bipedal robot is in a large roll/pitch angle, the subsequent samples are regarded to be biased and early termination would be immediately triggered to reset the simulation. With the proposed assistive force, the bipedal robot becomes more stable in the early stage. To avoid collecting the biased samples, we limit the maximum recovery ability as supporting a tilted robot

310

F. Shi et al.

within 30◦ rotation angle. We simplify the analysis as a static 2D pivoting problem in the sagittal plane as shown in Fig. 3(b), in which the robot is approximated as a single rigid body with the point contact. The force and momentum in the equilibrium state are as follows: F˜T + F˜g − M g˜ = 0 p˜T × F˜T + p˜g × F˜g − p˜{CoG} × M g˜ = 0

(1) (2)

where F˜T , F˜g , g˜ ∈ R2 denote the assistive force, ground contact force and gravity accleration in the sagittal plane, and p˜T , p˜g , p˜{CoG} ∈ R2 denote the moment arm of each force. When analyzed in the contact frame, p˜g = 0, |θi | ≤ 30◦ . Therefore, p˜T could be computed to satisfy the conditions mentioned above. 4.2

Assistance Quantity

The quantity of assistive forces varies in different settings: 1. applying the assistive force on each link, and the number of forces equals the links’ number. For high-dimensions bipedal robots, it is difficult to learn assistance forces in a large number. 2. applying to the end effectors, and two separate forces are on both hands or feet. 3. applying to the torso: in animation, a single force is placed on the pelvis to assist the forward walking behavior and maintain lateral balance [34]; in nature, two force sources are available in birds to generate the versatile recovery behavior during bipedal motion. To be flexible for multiple bipedal learning tasks while avoiding too large action space, we apply two sets of the 3D assistive force with a placement offset on the robot torso. Comparing to the single-force placement, two 3D forces could generate almost the full 6D wrench space (except the moment about the line connecting two forces) to assist the behavior learning. Our experiments in Sect. 5 show the current setting could be effective in multiple complex skills, such as the backflip and rotation jump, while the single-force placement fails because it could not assist in the full wrench space. If applying a single torque on the torso alternatively, learning would fail in these complex skills with a similar reason. Comparing to directly applying 6D assistive wrench, the assistive force is more intuitive to develop and could be generalized to the physical thruster-assisted robot [22] as well. Therefore, the relative position of two assistants could be defined as shown in Fig. 3(a). For the placement offset of two assistive forces, we further analyze its wrench space. The bipedal robot is more likely to generate motion in the sagittal and frontal plane, while the movement in the transverse plane is less frequent. To be feasible for learning multiple behaviors, we require the maximal torque generated by the assistive force to be equal in the sagittal and frontal plane and 1.5 times the maximal torque in the transverse plane concerning the CoG of the robot. Therefore, the robot motion during learning in the simulation is represented as: ˙ + g(θ) = J (θ)T Fg + τ + wstage 2 JT (θ)T FT (3) M (θ)θ¨ + C(θ, θ) i i i=0 where M , g, C denote the inertia matrix, gravity, Coriolis and centrifugal terms. τ ∈ RN +6 denotes the generalized actuated space, Fg , FTi denote the contact force and assistive force, and J (θ), JTi (θ) denote the related Jacobian matrix. wstage denotes the gate parameter which decides whether applying the assistive force on the robot.

Learning Bipedal Skills via Assistance

311

4.3 Assistance Embedment In this subsection, we discuss the manner to embed the assistive force into the learning framework with the following two problems: Force Computation: PD-Based vs. Learned Agent In the previous work, given the motion reference, the assistive force is calculated through a PD controller to track the target joints trajectory of multiple behaviors [10, 29]. In contrast, for reference-free scenarios, the PD-based assistance control could help for the forward walking task [34]; however, a simple PD-based force controller could not work well when learning the complex behavior, because it is difficult to define the mid-time target state to track. Therefore, to make our system more flexible for behavior learning, we choose to directly learn the assistive force. It would be learned together with the robot joints command through trials and errors in the simulation. The learned two sets of forces would be in 6 dimensions and applied to the torso as discussed above. Agent Format: Additional Policy vs. Dual Policy The original control policy is to output the target joints command of the robot. To embed the assistant agent into the original framework, there are two alternatives to learn the assistive force: – Additional policy: build one unified policy by extending the action space with 6 more dimensions to learn two sets of 3D assistive force – Dual policy: keep two separate policies and define the action space of the assistant policy to be the assistive force The advantage of the dual policy is to have the minimum influence on the original control policy owing to the separate framework, which is beneficial especially when the assistive state is in high dimensions; while the advantage of the additional policy is the robot controller being aware of the assistive force during training, which leads to better performance. Since the dimensions of the assistive force are much less than the bipedal actuated joints space, we choose the additional policy as shown in Fig. 4 to uniformly learn the robot joints command and assistive force values.

Joint commands

Assistive force

Action state ……

Hidden layers ……

Observation state

……

Fig. 4. Additional policy extends the action space with the additional assistance state and learns the joints command and assistive force simultaneously.

312

4.4

F. Shi et al.

Curriculum Stages

Assistive force could help the animation agent learn the complex behavior and even the physically implausible motion [10, 35]. Nevertheless, the goal in the robotic application is different because the assistive force could not exist on the real robots. We thereby leverage the assistance as a curricula to guide the bipedal robot in the curriculum learning fashion: – Stage 1: robot learning the behavior from scratch with the assistive force – Stage 2: robot continues to learn while enlarging the punishment reward on the assistive force to eliminate it gradually – Stage 3: robot disables the assistive force and achieves the behavior learning without any assistance The whole learning process could be summarized as Algorithm 1. Algorithm 1. Learning control policy with assistive force - initialize policy weights W - initialize state-action transition set S ∈ ∅ - set wstage = 1 in Eq. 3 to enable assistive force repeat -initialize robot configuration repeat - calculate action from current policy at = πW (at |xt ) - clip action and run simulation as Eq. 3 to integrate next state st+1 - collect reward as Eq. 9 to calculate total reward rt - append (st , at , rt , st+1 ) into transition set S until S reaches to batch size - update policy weight W as Sect. 4.5 - if Stage 1 and success rate over threshold, set Stage 2 and reduce kT in Eq. 7 to encourage smaller assistive force - if Stage 2 and assistive force lower than threshold, set Stage 3 and wstage = 0 in Eq. 3 to disable assistance until learning converges

4.5

Learning Settings

In this subsection, we briefly introduce the settings during policy learning, including the observation and action space, reward function, and learning algorithm. Observation and Action Space As discussed in the Sect. 4.3, the action space consists of both joints related state and two assistive forces vector. We leverage the joint position controller to send commands to the robot. Comparing to the joint torque or current controller, it requires less control

Learning Bipedal Skills via Assistance

313

frequency in the loop [6]. For the joints action state, we prefer the residual joints angle δθ d to be learned during training. The target joints angle is computed as follows: θd = θ + δθ d

(4)

in which the target joints angle equals the addition of the current joints angle and learned residual joints command. Compared with the method of directly learning the target joints angle, the proposed residual command based on the previous joints command could prevent the large command variation caused by noisy joints position data obtained from encoders. Therefore, the whole action space is as follows: a = [δθ d , FT ]

(5)

We only utilize the proprioception information during bipedal tasks. The main information in the observation state comes from the joints and IMU data. The observation space is as follows: ˙ θd , q, w, φ] s = [θ, θ,

(6)

in which the joints target angle is of the previous time point t − 1, joints velocity could be replaced by the previous joints angle for the robot without joints velocity input, and φ denotes the variable related to the phase. Reward Function The learned controller is expected to behave in the following aspects: – – – – – – –

2

tracking the target command rcmd = kcmd ∗ e−kecmd ||δs cmd || smooth joints command ra = ka ∗ ||θd − θdprev || ˙ low energy consumption re = ke ∗ ||θτT · θ|| −k ||ζ ||2 foot being parallel to the ground during contact rζf = kζf ∗ e eζf f low frequency in the contact switch rc = kfc ∗ ||fc − fcprev || low contact velocity in the tangent plane rcv = kcv ∗ (||vftan || + ||vftan ||) r l minor oscillation of the robot body rω = kω ∗ ||ω||

Besides the general reward terms mentioned above, there are a few specific taskrelated terms. During walking, there are the foot clearance reward to encourage a distinct foot step and avoid the dragging motion and periodic reward to shape the gait pattern. The detailed mathematical representation of each reward term could be referred in our supplementary material. We extend the legged reward terms rLegged with an assistance reward rAssist to gradually reduce the assistive force: rAssist = kT ∗ (||FT [0:3] || + ||FT [3:6] ||) r

Legged

=r

+r +r +r +r +r r = rLegged + rAssist

cmd

a

e

ζf

c

cv

(7) +r

ω

(8) (9)

where FT [0:3] , FT [3:6] denote the two sets of assistance force individually, and kT is a negative weight to punish the assistive force.

314

F. Shi et al.

Termination Judgment During learning, we advocate the early termination to avoid robot learning biased samples with the local minimum [19]. The termination would be triggered as follows: – – – –

robot in self-collision robot in collision with links except for the end effector CoG height less than a threshold value (robot tends to fall) robot state varying the target state over the threshold (task tends to fail)

Learning Algorithm The actor-critic Proximal Policy Optimization (PPO) method [21] is leveraged for optimizing the reinforcement learning problem. The policy and value function are represented as a multi-layer perceptron network with the Tanh activation function. The policy and value network have the same observation input and same hidden layers in architecture but with different output dimensions. All the learning process is conducted in the simulation.

5 Experiments To fully verify our proposed framework, we learn multiple humanoid whole-body behaviors on several robot platforms with different configuration settings. The robots include the KXR robot [2] with 5-DoF legs and 4-DoF arms, Atlas robot [1] with 6-DoF legs and 1-DoF chest, and Jaxon robot [11] with 6-DoF legs and 5-DoF arms to actuate. The learned behaviors include walking and several dynamic skills (Fig. 5). The learning procedure is conducted in the Raisim simulation equipped with a highquality contact solver [9]. Our program is implemented on a desktop with Intel Xeon W-3225 CPU (3.70 GHz 8 cores), Nvidia RTX 3090 GPU (24 GB), and 192 GB RAM.

(b)

Fixed

(a)

(c)

Fig. 5. Experiments of the learned controller on humanoid whole-body locomotion behavior: (a). walking controller is learned on the standard and asymmetrical KXR robots and demonstrated in multiple real-world scenarios; (b). walking controller is learned on the JAXON robot and achieves robust push recovery performance; (c). loco-manipulation on a ball. More quantitive analysis could be referred in our supplementary material: https://fanshi14.github.io/me/isrr22sup.pdf

Learning Bipedal Skills via Assistance

315

The simulation runs at 250 Hz and our neural controller operates at 50 Hz. We train each policy for 90,000 steps with 7,500 epochs, which takes about one day on our desktop. 5.1 Example 1: Locomotion Our proposed method is verified with the humanoid locomotion behavior. The robot is commanded to move following the given speed and direction. We verify the framework with different robot configurations. First, we train the control policy for a standard KXR robot with a symmetrical hardware design. We also learn the controller for the asymmetrical KXR robot with a fixed actuator on a knee or hip joint on one leg. The policy is successfully trained with the same reward setting and demonstrated on the real robot. In addition, we learn the controller for a life-size JAXON robot and compare its performance with the previous model-based controller in the simulation. Comparison 1: Learning Without Assistive Force Locomotion is the basic motor skill for a humanoid robot, but learning without reference motion is still challenging. We conduct the comparison experiment by keeping the same reward setting but removing the assistive force. Without the assistant, the robot failed to learn the behavior after long iteration steps. The main reason is the humanoid robot with arms has large action space but narrow solution space. The agent keeps immediate failure in the initial stage and obtains biased samples to learn. In contrast, the agent with the assistive force continues to explore the action space and collect valuable samples as the attached video. We compare the success rate in 5 different tasks. When without the assistive force, the robot only successfully learns the bipedal jumping task but fails to learn other whole-body behavior (Table 2). Table 2. Comparison of convergency on 5 tasks with same reward and training steps Conditions

Whole-body walk Bipedal jump Loco-manipulation Backflip Rotation jump

Without assistance Fail

Succeed

Fail

Fail

With assistance

Succeed

Succeed

Succeed Succeed

Succeed

Fail

Comparison 2: Performance Against Model-Based Controller We compare our learned controller with a model-based controller [12] developed with the capture point with the linear inverted pendulum model (LIPM). A KXR robot is commanded to walk on the multiple terrain with different difficulties. As Table 3 shows, the learned controller achieves better performance in the challenging terrain, which is because the model-based controller has several assumption on the terrain type and friction setting. For the push recovery reaction, we conduct the quantitive comparison on JAXON robot in the simulation. Our learned controller shows more robustness, especially in the lateral direction when disturbance force is against the support foot.

316

F. Shi et al.

Table 3. Comparison of robustness of the walking behavior with a model-based controller, content means the average survival steps before falling Conditions

Flat terrain Sliding terrain Robber shoes Rock terrain

Model-based controller [12] No failure 7.3 steps No failure 36.3 steps

Learned controller

5.2

2.8 steps

3.8 steps

No failure

47.4 steps

Example 2: Dynamic Skills

We also verify our learning framework with dynamic motor skills. We conduct continuous rotational jumping as shown in Fig. 6 and forward jumping as shown in Fig. 8 on the Atlas robot simulator. The assistive force is shaped to decrease during Stage 2 in jumping learning as the plot in Fig. 6(e). Robot masters the skill without the assistant in Stage 3. More quantitive analysis could be referred in our supplementary material. Another dynamic motor skill is backflipping. Inspired by Boston Dynamics’s video [1], the robot is expected to achieve a full back flip within 1 s. We do not provide any keyframe or reference joints trajectory during learning. The only information is the agent has 1 s to prepare and another 1 s to achieve the 360◦ rotation. With the proposed assistive curricula, our robot successfully learns this challenging skill in the simulation.

Ratio of assistive force in Stage 2 0.6

0.5

(a)

(b)

0.4

0.3

0.2 0

(c)

(d)

1

2

3

4

5

6

7

8

9 k iterations

(e)

Fig. 6. Experiments of continuous rotation jumping: (a)–(d). screenshots of jumping with 180◦ rotation; (e). trend of assistive force during Stage 2, in which the ratio is the sum of two sets of the assistive force with respect to the robot weight.

Comparison 3: Learning with Other Assistance Placement We compare the learning performance with the different placement of the assistive force. The assistance is expected to be applied to the torso of a bipedal robot. In the comparison experiments, we testify the placement of robot’s hands or feet in the rotation jumping task. The agent fails on Stage 1 as the supplementary video because of the joints displacement and non-uniform assistive wrench space discussed in Sect. 4.1.

Learning Bipedal Skills via Assistance

317

Comparison 4: Learning with Different Assistive Magnitude As discussed in Sect. 4.2, the magnitude of both assistive forces in each axis is less than 20% of the robot weight. Firstly, the tiny maximum assistive force could not be effective as well. In our experiment, the force upper bound is set as 2% of the robot weight, and the learning fails after long iteration steps. On the contrary, we testify the upper bound being 60% of the robot weight. In the rotation jumping task, assistive force lifts the robot because its sum is larger than robot gravity, and robot fails to learn the plausible behavior as the supplementary video. In addition, the robot could collect biased samples which could not effectively help the learning process with large assistance as Fig. 7.

Fig. 7. Robot with large assistive forces collects the biased meaningless samples during walking learning, in which the upper bound of each assistance is set as 40% of the robot weight.

Comparison 5: Learning with Single Assistance We compare the different quantity of the assistant. When applying a single assistive force on the robot’s pelvis as [34], the agent fails to learn the skill after long learning steps. The main reason is the single assistance force could not cover the 6D wrench space to generalize to different assistive tasks. We also testify it with the single 3D assistive torque, but the learning process fails for the similar reasons. Torque [Nm] 150

(a)

(b)

(c)

(d)

0

-150 0

(e)

10

Fig. 8. Experiments of bipedal jumping: (a)–(d). screenshots of jumping with 0.25 m/s; (e). joints torque plot during the movement.

318

F. Shi et al.

6 Conclusion and Discussion In this paper, we propose a reference-free bipedal learning framework with the assistive force as a curricula. The learning process is divided into three curriculum stages to shape the robot’s behavior. We analyze the placement, quantity, and format of the assistive force for a bipedal robot. Our experiments show that this approach successfully learns the control policy for multiple challenging skills with varied robot configurations. There are still several problems to be discussed in future work: 1. Multi-skills assistive learning: multi-tasks learning is still a challenging problem for legged robots. The assistive learning behavior in nature has shown great success in aiding multi-skills learning. The assistive force supports both single-skill motion and transitional motion between two skills to avoid the early termination and biased sample collection. We are interested in extending our framework with multi-skills learning. 2. Theoretical analysis based on game theory: the assistant and robot controller could be regarded as the cooperative agents in a non-zero-sum game. However, the assistance agent may “spoil” the robot controller and hinder effective learning if the assistant is not prepared properly. There is theoretical analysis for adversarial control agents in the zero-sum game [20, 36], while cooperative agents require further investigation. 3. Neuroscience comparison with natural intelligence: inspired by human beings and birds in nature, we propose the assistive learning framework with similar curriculum stages. We are interested in more neuroscience clues about the similarities and differences between the behavior of the living creature and our robot during learning, which helps better understand both natural and machine intelligence. Acknowledgment. We would like to thank Prof. Marco Hutter and Dr. Kaiqing Zhang for their insightful discussion.

References 1. Boston dynamics atlas (2022). https://www.bostondynamics.com/atlas. Accessed 01 May 2022 2. Kondo kxr robot (2022). https://kondo-robot.com/product-category/robot/kxrseries. Accessed 01 May 2022 3. Asano, Y., Okada, K., Inaba, M.: Design principles of a human mimetic humanoid: humanoid platform to study human intelligence and internal body system. Sci. Robot. 2(13), eaaq0899 (2017) 4. Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48 (2009) 5. Chapman, D., Daoust, T., Ormos, A., Lewis, J.: Weightshift: Accelerating animation at framestore with physics (2020) 6. Chen, S., Zhang, B., Mueller, M.W., Rai, A., Sreenath, K.: Learning torque control for quadrupedal locomotion. arXiv preprint arXiv:2203.05194 (2022) 7. Hirai, K., Hirose, M., Haikawa, Y., Takenaka, T.: The development of Honda humanoid robot. In: Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No. 98CH36146), vol. 2, pp. 1321–1326. IEEE (1998)

Learning Bipedal Skills via Assistance

319

8. Hwangbo, J., et al.: Learning agile and dynamic motor skills for legged robots. Sci. Robot. 4(26) (2019) 9. Hwangbo, J., Lee, J., Hutter, M.: Per-contact iteration method for solving contact dynamics. IEEE Robot. Autom. Lett. 3(2), 895–902 (2018) 10. Kim, N.H., Ling, H.Y., Xie, Z., van de Panne, M.: Flexible motion optimization with modulated assistive forces. Proc. ACM Comput. Graph. Interact. Tech. 4(3), 1–25 (2021) 11. Kojima, K., et al.: Development of life-sized high-power humanoid robot Jaxon for real-world use. In: 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), pp. 838–843. IEEE (2015) 12. Kojio, Y., et al.: Unified balance control for biped robots including modification of footsteps with angular momentum and falling detection based on capturability. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 497–504. IEEE (2019) 13. Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning quadrupedal locomotion over challenging terrain. Sci. Robot. 5(47) (2020). https://robotics.sciencemag.org/ content/5/47/eabc5986 14. Li, Z., et al.: Reinforcement learning for robust parameterized locomotion control of bipedal robots. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2811–2817. IEEE (2021) 15. Miki, T., Lee, J., Hwangbo, J., Wellhausen, L., Koltun, V., Hutter, M.: Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 7(62), eabk2822 (2022) 16. Mordatch, I., Todorov, E., Popovi´c, Z.: Discovery of complex behaviors through contactinvariant optimization. ACM Trans. Graph. (TOG) 31(4), 1–8 (2012) 17. Natale, L., Bartolozzi, C., Pucci, D., Wykowska, A., Metta, G.: iCub: the not-yet-finished story of building a robot child. Sci. Robot. 2(13), eaaq1026 (2017) 18. Panne, M.v.d., Lamouret, A.: Guided optimization for balanced locomotion. In: Terzopoulos, D., Thalmann, D. (eds) Computer Animation and Simulation ’95, pp. 165–177. Springer, Cham (1995).https://doi.org/10.1007/978-3-7091-9435-5 13 19. Peng, X.B., Abbeel, P., Levine, S., van de Panne, M.: Deepmimic: example-guided deep reinforcement learning of physics-based character skills. ACM Trans. Graph. (TOG) 37(4), 1–14 (2018) 20. Pinto, L., Davidson, J., Sukthankar, R., Gupta, A.: Robust adversarial reinforcement learning. In: International Conference on Machine Learning, pp. 2817–2826. PMLR (2017) 21. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017) 22. Shi, F., Anzai, T., Kojio, Y., Okada, K., Inaba, M.: Learning agile hybrid whole-body motor skills for thruster-aided humanoid robots. In: 2021 IEEE International Conference on Intelligent Robots and Systems (IROS) (2022) 23. Shi, F., et al.: Circus anymal: a quadruped learning dexterous manipulation with its limbs. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 2316–2323 (2021) 24. Siekmann, J., Godse, Y., Fern, A., Hurst, J.: Sim-to-real learning of all common bipedal gaits via periodic reward composition. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 7309–7315 (2021) 25. Siekmann, J., Green, K., Warila, J., Fern, A., Hurst, J.: Blind bipedal stair traversal via simto-real reinforcement learning. Robot. Sci. Syst. (RSS) (2021) 26. Siekmann, J., et al.: Learning memory-based control for human-scale bipedal locomotion. Robot. Sci. Syst. (RSS) (2020) 27. Starke, S., Zhao, Y., Komura, T., Zaman, K.: Local motion phases for learning multi-contact character movements. ACM Trans. Graph. (TOG) 39(4), 1–54 (2020)

320

F. Shi et al.

28. Thananjeyan, B., et al.: Recovery RL: safe reinforcement learning with learned recovery zones. IEEE Robot. Autom. Lett. 6(3), 4915–4922 (2021) 29. Tidd, B., Hudson, N., Cosgun, A.: Guided curriculum learning for walking over complex terrain. arXiv preprint arXiv:2010.03848 (2020) 30. Wrotek, P., Jenkins, O.C., McGuire, M.: Dynamo: dynamic, data-driven character control with adjustable balance. In: Proceedings of the 2006 ACM SIGGRAPH symposium on Videogames, pp. 61–70 (2006) 31. Xie, Z., Clary, P., Dao, J., Morais, P., Hurst, J., Panne, M.: Learning locomotion skills for cassie: iterative design and sim-to-real. In: Conference on Robot Learning, pp. 317–329. PMLR (2020) 32. Xu, K., Verma, S., Finn, C., Levine, S.: Continual learning of control primitives: skill discovery via reset-games. Adv. Neural. Inf. Process. Syst. 33, 4999–5010 (2020) 33. Yang, T.Y., Zhang, T., Luu, L., Ha, S., Tan, J., Yu, W.: Safe reinforcement learning for legged locomotion. arXiv preprint arXiv:2203.02638 (2022) 34. Yu, W., Turk, G., Liu, C.K.: Learning symmetric and low-energy locomotion. ACM Trans. Graph. (TOG) 37(4), 1–12 (2018) 35. Yuan, Y., Kitani, K.: Residual force control for agile human behavior imitation and extended motion synthesis. Adv. Neural. Inf. Process. Syst. 33, 21763–21774 (2020) 36. Zhang, K., Hu, B., Basar, T.: On the stability and convergence of robust adversarial reinforcement learning: a case study on linear quadratic systems. Adv. Neural. Inf. Process. Syst. 33, 22056–22068 (2020)

Ball-and-Socket Joint Pose Estimation Using Magnetic Field Tai Hoang1(B) , Alona Kharchenko2 , Simon Trendel2 , and Rafael Hostettler2 1

2

Technical University of Munich, Munich, Germany [email protected] Devanthro - The Robody Company, Munich, Germany

Abstract. Roboy 3.0 is an open-source tendon-driven humanoid robot that mimics the musculoskeletal system of the human body. Roboy 3.0 is being developed as a remote robotic body - or a robotic avatar - for humans to achieve remote physical presence. Artiﬁcial muscles and tendons allow it to closely resemble human morphology with 3-DoF neck, shoulders and wrists. Roboy 3.0’s 3-DoF joints are implemented as balland-socket joints. While industry provides a clear solution for 1-DoF joint pose sensing, it is not the case for the ball-and-socket joint type. In this paper we present a custom solution to estimate the pose of a ball-andsocket joint. We embed an array of magnets into the ball and an array of 3D magnetic sensors into the socket. We then, based on the changes in the magnetic ﬁeld as the joint rotates, are able to estimate the orientation of the joint. We evaluate the performance of two neural network approaches using the LSTM and Bayesian-ﬁlter like DVBF. Results show that in order to achieve the same mean square error (MSE) DVBFs require signiﬁcantly more time training and hyperparameter tuning compared to LSTMs, while DVBF cope with sensor noise better. Both methods are capable of real-time joint pose estimation 37 Hz with MSE of around 0.03 rad for all three degrees of freedom combined. The LSTM model is deployed and used for joint pose estimation of Roboy 3.0’s shoulder and neck joints. The software implementation and PCB designs are opensourced under https://github.com/Roboy/ball and socket estimator.

1

Introduction

Classical rigid robots, which often consist of a chain of rigid and heavy links, pose a safety risk in unstructured environments. Soft robots, on the other hand, made primarily of lightweight materials, have a great potential to operate safely in any environment, especially during dynamic interaction with humans. Many soft mechanical designs have been proposed in recent years. Musculoskeletal robot [5] or more speciﬁcally, Roboy 3.0, an open-source humanoid robot, is one of the few examples of the soft humanoid robot designs. It mimics the working principle of the human musculo-skeletal system and morphology. Instead of having an actuator directly in the joint like in classic rigid robots, Roboy 3.0 links are actuated by a set of artiﬁcial muscles and tendons - series elastic actuators. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 321–334, 2023. https://doi.org/10.1007/978-3-031-25555-7_22

322

T. Hoang et al.

Such design allows to passively store the energy in the muscles and makes the robot inherently compliant. These properties, combined with anthropomorphic morphology, turn to be beneﬁcial in full-body motion mapping during teleoperation as well collaborative object manipulation and other close physical interactions with humans. However, achieving reliable, stable and precise joint and endeﬀector-workspace control for a tendon-driven humanoid [9] is challenging due to diﬃculties in precise modelling of muscle co-actuation and tendon friction among other things. Thus, a high-accuracy robot state estimation is of crucial importance to enable the implementation of closed-loop control algorithms. There are two types of joint being used in the upper body of Roboy 3.0: 3-DoF and 1-DoF. The neck, shoulders, and wrists are 3-DoF joint and elbows are 1-DoF. Determining the joint position for 1-DoF joints is possible with high precision using oﬀthe-shelf sensors. However, the same solution is Fig. 1. Tendon-driven humanoid not directly applicable for the 3-DoF joint case. robot Roboy 3.0 We therefore, in this paper provide a solution on both hardware and software for this 3-DoF ball-socket joint (Fig. 1). Our idea is based on using the magnetic ﬁeld of strong neodymium magnets to determine the orientation of the joint. This has long been a well known approach due to various reasons. First, it is a contact-free measurement, and way more robust to the environment compared to other contact-free sensors like optical [3] and vision-based [2]. Second, the measurement value can be obtained at high rates with a high accuracy. The sensor devices are also not expensive and have a suﬃciently small size which makes it easy to integrate into our robot. However, obtaining the orientation directly from the magnetic value in closedform is challenging due to their non-linearity relation. We therefore shift the focus on the data driven approach, or more particularly, learning a neural network to do the transformation between the magnetic sensors output and the joint’s orientation. Recently, several researchers have successfully developed solutions based on this idea. Jungkuk [8] built a full pipeline to obtain the orientation of a 2-DoF object with only a single sensor, Meier [11] developed a system that can determine the human hand-gesture based on the magnetic ﬁeld. Magnetic ﬁeld can also be used to determine the position of an indoor robot, which is an active research area for indoor localization problems [12,14]. The magnetic sensor used in this paper is a three-dimensional Hall-eﬀect sensor. This particular type of sensor has a critical issue: it is very noisy and since our ultimate goal is using the inferred orientation to close the control

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

323

loop, it is very important to mitigate this issue, otherwise our robot could not operate safely and can easily damage itself and the environment. In this paper, we proposed to use two types of neural network that can deal with this problem: a recurrent-based neural network (LSTM) [1] and Deep Variational Bayes Filter (DVBF) [6,7]. We hypothesize that both of the approaches can smoothen out the output signal. While the former does this based on the historical data, the latter approach is designated to do this more explicitly with the integrated Bayesian ﬁlter.

2

Hardware

(a) ball-in-socket joint

(b) printed circuit board

Fig. 2. Hardware design of Roboy 3.0 shoulder ball-and-socket joint

Figure 2 shows the hardware design of the ball-socket joint of Roboy 3.0. On the left image, the ball-in-socket includes two main components, the ball and the socket. Inside the 3D-printed socket is the PCB with four 3D-magnetic sensors TLE493d mounted around the circuit (Fig. 2b). The sensor data is automatically read out by the Terasic DE10-Nano FPGA dev kit using an I2C switch TCA9546A. The sensors measure the magnetic ﬁeld strength generated by three 10mm neodymium cube magnets mounted inside the ball. The magnets are distributed as depicted in Fig. 3. The magnetic ﬁeld was simulated using magpylib [13] and the result is shown in Fig. 4. The main goal of the design is to have

324

T. Hoang et al.

(a) Side view

(b) Top view

Fig. 3. Location of magnets inside the ball

as little symmetry in the magnetic ﬁeld as possible, otherwise the same magnetic ﬁeld would be sensed for diﬀerent joint positions and such a function is impossible to learn with standard feed forward neural networks. Many magnet arrangements were evaluated both in simulation and hardware. The three magnet arrangement was found to provide suﬃcient sensor signal strength and little symmetry for our target joint angle range.

3 3.1

Methodologies Problem Statement

In this research, our objective is obtaining the orientation of the ball-and-socket joint based on the output of the magnetic sensors. Formally speaking, let xi ∈ X be the magnetic output of the sensor i-th, y ∈ SO(3) is our actual joint angle. We want to learn a transformation function f such that: f ({x}ni=1 ) = y The problem seems to be easily solvable with a simple feedforward neural network. However, there are two problems: – Output of a feedforward neural network often lies in Euclidean space, i.e. R3 . Without further care, the output will be discontinuous whenever it approaches the singularity point. To solve this, we use a solution proposed by researchers from the computer graphics community [10]. In short, the output still lies in R3 , but then will be transformed to the 6D vector representation which is possible to convert to the Euler angle afterwards. The proposed transformation is diﬀerentiable; hence the end-to-end learning fashion remains.

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

325

Fig. 4. Simulation of magnetic ﬁeld generated by three magnets

– The input source xi is noisy, and sometimes returns an outlier value. This leads to a undesirable consequence as the distorted output of the network will be propagated through the kinematic control stack, causing instabilities and unpredicted behavior. We therefore propose to treat the problem as the sequence model regression problem. Instead of relying on the only recent input, we take the advantage of having a time-series data, i.e. series of input and expect that the output will be smooth by those non-corrupted historical input afterwards. Two proposed models will be explained more in detail in the following subsections. 3.2

Recurrent Neural Network

A Recurrent Neural Network (RNN) takes a sequence of inputs (x1 , x2 , ...) and outputs a corresponding sequence (y1 , y2 , ...). The network weights are shared time-wise. However, this naive design leads to a critical issue which is often referred to as “vanishing and exploding gradient” since after a certain time, the gradient becomes too small or too large, and the network will eventually stop learning. This well-known problem can be solved by the Long-short term memory (LSTM) [1] gate-mechanism. The main idea of LSTM is, instead of keeping all the information from history, we let the gates decide which information to keep, which to forget. In formal, let Ct−1 be the cell state at the previous time step, Cˆt be the updated state, and Ct be the ﬁnal state, the update will look like this: Ct = Ct−1 ft + Cˆt it where ft and it are forget and input gate, respectively. Both are scalar, and if it is close to 0, Ct will completely ignore the information from the previous cell Ct−1 and vice versa. Then, the ﬁnal output can be obtained with this equation: yt = ot tanh(Ct )

326

3.3

T. Hoang et al.

Deep Variational Bayes Filters

A simple feedforward or recurrent (LSTM) model cannot provide a good estimate of the real-world dynamics and uncertainty. To mitigate this, we also try Fusion Deep Variational Bayes Filter (Fusion DVBF) [6,7] in this research. In general, Fusion DVBF is a neural network that can perform ﬁltering during inference and also has the possibility to fuse multiple sources of sensory input to obtain a more accurate state estimation (data fusion). In terms of probabilistic sequential model, the conditional likelihood distribution is written as p(x1:T | u1:T ) =

p(x1 |z1 )ρ(z1 )

T

p(xt |zt )p(zt |zt−1 , ut−1 )dz1:T

(1)

t=2

Maximizing Eq. 1 requires solving an integral, which is intractable as all the variables x, z, u lie in the continuous space. One way to solve it is doing the maximization on the evidence lower bound (ELBO) instead: log p(x1:T | u1:T ) ≥ LELBO (x1:T , u1:T ) = Eq [log p(x1:T | z1:T )] + KL(q(z1:T | x1:T , u1:T ) || p(z1:T )) The LELBO is computed by introducing a variational distribution q(z1:T | x1:T , u1:T ) to approximate the intractable p(z1:T | x1:T , u1:T ). The main components of Fusion DVBF distinguish it from classical machine learning are: 1. Amortized inference: the variational distribution q was approximated by a neural network qθ (z1:T | x1:T , u1:T ), which is also referred to as the recognition network. 2. The recognition network can be written in a form that is similar to the classical Bayesian ﬁlter [7] q(zt |z1:t−1 , x1:t , u1:t ) ∝ qmeas (zt |xt ) × qtrans (zt |zt−1 , ut )

(2)

These modiﬁcations form an end-to-end learning, in the style of variational auto-encoder [4]. Moreover, modelling the inference network like Eq. 2 allows us to put more sensory input knowledge via qmeas (zt |xt ) as in classical data fusion. To apply Fusion DVBF into our problem, we need two further modiﬁcations. First, the original model only returns the latent representation of the input space. However, we also need a corresponding output (joint angle) from the sensory input (magnetic output). Therefore, we introduce another “decoder” network that maps the latent state z to the output y, i.e. q(y | z). Second, although applying data fusion for multiple sensory input is straightforward, without any constraint, there is a high possibility that during training, the latent state results from each sensor will be far apart, which is non-intuitive for the aggregation afterwards. To solve this, we put a constraint that forces the latent states to stay close to each other.

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

327

Output-Space Transformation. We introduce another random variable y that will be stacked together with the observable x. This is our desired output. Together, they form the log likelihood distribution log(x1:T , y1:T | u1:T ), with the corresponding ELBO

LELBO (x1:T , y1:T , u1:T ) = Eq [log p(x1:T | z1:T )] + Eq [log p(y1:T | z1:T )] + KL(q(z1:T | x1:T , u1:T ) || p(z1:T )) (3) where p(y | z) is another probabilistic neural network and acts as the decoder network like p(x | z). n-Sensors Fusion. We hypothesise that each sensor has enough information to infer the latent state. Therefore, we put them as an independent distribution, and together form the recognition distribution in a factorisation fashion: qmeas (·) =

n

qi (·) =

i=1

n

N (μi , Σi )

i=1

with n is the number of sensors and the way the aggregated Gaussian parameters (μmeas , σmeas ) computed is as follow:

σmeas

−1/2 n 1 = σ2 i=1 i

μmeas = σmeas

n

σi −2 × μi

i=1

Normally, due to the suboptimality, each qi (z|x, u) can have diﬀerent value and still result in the same μmeas and σmeas . This is actually bad as one of the advantages of sensor fusion is that, in the aggregation step, it will penalise the less accurate sensor based on its variance. Therefore, it makes sense if the latent state resulting from every sensor stays close to each other. Then, even though there is corruption in the sensory inputs, it will be strongly penalised and consequently, other non-corrupted inputs can help to cover this. The extra soft constraint we put is: 2 Lnew ELBO (x1:T , y1:T , u1:T ) = LELBO (x1:T , y1:T , u1:T ) + αμi (x)2

It simply forces the output of the encoder to stay close to zero, and since we do not want it to be too restrictive, we put a small weight α on the constraint.

328

4 4.1

T. Hoang et al.

Experiments Experiment Setup

Data Collection. To collect data, we use an open loop control and do random walks on the 3D space: we randomly sample a target joint and do the linear interpolation between that point and the current position, and move the arm towards that point. Data is recorded at every 20 ms along the movement, including the magnetic sensors output, the joint targets, and the joint pose. The joint pose is our ground truth obtained using two HTC Vive trackers by mounting one tracker on the base link (torso) and one the moving link (e.g. upper arm) and calculating orientation between the two, which yields the orientation of the desired 3-DoF joint. Data Analysis. The most basic assumption to be successful in the data driven approach is the correlation between inputs and outputs. If they are correlated, even weakly, there is hope that there exists a neural network that can approximate the actual function. Figure 5 and Fig. 6 show this information, and in both ﬁgures, based on the color which indicates the output values (Euler angle), ﬁrst, we do not see any clear discontinuities in the plots. This is important, otherwise learning would be very diﬃcult or even impossible if the data are highly disconnected. On the ﬁrst and second columns, the correlation looks really strong, we can see a lot of smooth curves, and together they even form a shape that looks like the entire workspace of the joint. Evaluation. To understand how well the proposed methods perform, we do the following setup on the left shoulder ball-and-socket joint of our robot: – For DVBF, our input is an 4D array of four magnetic sensor data, which will be treated individually by the architecture described in the previous section. DVBF acts as an online ﬁltering method; data goes in and out one after another like a stream. We also feed to the DVBF the joint targets, it acts as control input signal, and is of importance for the underlying state prediction phase. – Unlike DVBF, in the LSTM network case, we only take ﬁve recent sensor outputs (no control input), feed them to the network and predict the next one. We follow this approach since LSTM is known to be sensitive with the historical data, and we do not want it to be too dependent on the far unuseful data.

4.2

Experiment Results

First, we evaluate the training time of LSTM and DVBF. Based on Fig. 7, it is obvious that to get the same mean square error (MSE) loss, DVBF needs

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

329

Fig. 5. Correlation between outputs from magnetic sensor and the joint position. Columns are the magnetic sensors, and rows are the ground truth Euler angles.

Fig. 6. Correlation between outputs from magnetic sensor (reduced by Principal Component Analysis) and the joint position.

signiﬁcantly longer time, roughly two and a half hours compared to just 10 min from LSTM. This is quite surprising as we do not expect such a huge gap, probably the new networks introduced in the previous section makes DVBF struggling to ﬁnd a good latent representation that satisﬁes all of the conditions. We then demonstrate the performance of both the networks on the unseen data. To evaluate real-time accuracy of the proposed methods we executed a 3D random walk joint pose trajectory with the following characteristics:

330

T. Hoang et al.

Fig. 7. Validation MSE

(a) Magnetic sensor 1

(b) Magnetic sensor 2

(c) Magnetic sensor 3

(d) Magnetic sensor 4

Fig. 8. Magnetic sensor encoder values from time step 9500 to 11000 of the test dataset. There are four sensors placed around the disk, with the interval 20 ms between two consecutive time steps

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

(a) DVBF and ground truth

331

(b) LSTM and ground truth

Fig. 9. Error between our predictive model (DVBF and LSTM) and ground truth from time step 9000 to 13000 of the test dataset. Data is plotted on three Euler angles, with the interval 20 ms between two consecutive time steps

– trajectory duration: 404 s – actual reached joint limits (in radians): • rotation around X axis: [−1.418, 0.647] • rotation around Y axis: [−1.457, −0.0288] • rotation around Z axis: [−2.036, 0.061] – magnetic sensor data published 65 Hz per joint The experiment yielded real-time joint pose estimation with both methods 37 Hz with MSE of 0.0382 rad rad for DVBF and 0.0362 rad rad for LSTM. Further we analyze the quality of the predictions on a selected snapshot of our test data depicted on Fig. 9 and Fig. 10. Both LSTM and DVBF perform well on the period where there is no noise in the sensor data, the prediction output is indeed very close to the ground truth data. At around time step 10200, however, there is a sudden jump in the magnetic output (Fig. 8a), and as a result, this sudden jump also appears in the prediction outputs on both of the methods. However, ﬁltering-like DVBF is more advantageous in these cases, the learned state space model is able to recognize the corrupted sensors, and as a result, one can clearly see that the magnitude of the spike is much smaller compared to its corresponding output from LSTM model, though the jump is still recognizable from the Fig. 10e. One can argue that those spikes can be eliminated by an appropriate post processing. However, we experienced that, it is only true if the jump magnitude is obviously an outlier. Otherwise, ﬁnding a compromise between the accuracy in the next prediction step and the appearance of an outlier is not that straightforward. This is the main reason why we choose DVBF, since this task is similar to the online ﬁltering paradigm, ﬁnding the compromise between the model and the measurements.

332

T. Hoang et al.

(a) Euler X - DVBF

(b) Euler X - LSTM

(c) Euler Y - DVBF

(d) Euler Y - LSTM

(e) Euler Z - DVBF

(f) Euler Z - LSTM

Fig. 10. Comparison between LSTM and DVBF with ground truth data from time step 9000 to 13000 of the test dataset. Data is plotted on three Euler angles, with the interval 20 ms between two consecutive time steps

5

Conclusions

We presented in this paper a solution to estimate the orientation of the 3-DoF ball-and-socket joint. Magnetic ﬁeld based approach was chosen due to its easy integrability and robustness of the environment. Although the magnetic ﬁeld is

Ball-and-Socket Joint Pose Estimation Using Magnetic Field

333

stable, the three-dimensional Hall-eﬀect sensor we used to get the magnetic value is known to be very noisy. We demonstrated in the paper that two sequentialbased neural networks LSTM and DVBF can help mitigate this critical issue. The experimental results show that both LSTM and DVBF are able to successfully give a high accurate orientation of the joint. However, in the period when the magnetic sensor is corrupted, DVBF works better, the resulting output is smoother than LSTM method. Unfortunately, DVBF also has several clear disadvantages: the network structure is too sophisticated and tuning hyperparameters is not always a straightforward process despite potentially huge improvements in network’s performance. Also, the training time of DVBF is signiﬁcantly longer than LSTM. We therefore decided that, LSTM would be good solution for fast prototyping but then in the production phase, DVBF will be deployed to give a better estimation. Acknowledgements. We would like to thank Maximilian Karl for the helpful discussions on DVBFs model and Jurgen Lippl for the 3D model of the ball-socket-joint.

References 1. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 2. Garner, H., Klement, M., Lee, K.-M.: Design and analysis of an absolute noncontact orientation sensor for wrist motion control. In: Proceedings of the 2001 IEEE/ASME International Conference on Advanced Intelligent Mechatronics (Cat. No. 01TH8556), vol. 1, pp. 69–74 (2001). https://doi.org/10.1109/AIM.2001. 936432 3. Lee, K.-M., Zhou, D.: A real-time optical sensor for simultaneous measurement of three-DOF motions. IEEE/ASME Trans. Mechatron. 9(3), 499–507 (2004). https://doi.org/10.1109/TMECH.2004.834642 4. Kingma, D.P., Welling, M.: Auto-encoding variational bayes (2014). arXiv:1312.6114 [stat.ML] 5. Richter, C., Jentzsch, S., Hostettler, R., et al.: Musculoskeletal robots: scalability in neural control. IEEE Robot. Autom. Mag. 23(4), 128–137 (2016). https://doi. org/10.1109/MRA.2016.2535081 6. Karl, M., Soelch, M., Bayer, J., van der Smagt, P.: Deep variational bayes ﬁlters: unsupervised learning of state space models from raw data. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017, Conference Track Proceedings, Open-Review.net (2017). https://openreview. net/forum?id=HyTqHL5xg 7. Karl, M., Soelch, M., Becker-Ehmck, P., Benbouzid, D., van der Smagt, P., Bayer, J.: Unsupervised real-time control through variational empowerment (2017). arXiv:1710.05101 [stat.ML] 8. Kim, J., Son, H.: Two-DOF orientation measurement system for a magnet with single magnetic sensor and neural network. In: 2017 14th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 448–453 (2017). https://doi.org/10.1109/URAI.2017.7992773 9. Trendel, S., Chan, Y.P., Kharchenko, A., Hostetrler, R., Knoll, A., Lau, D.: CardsFlow: an end-to-end open-source physics environment for the design, simulation

334

10. 11.

12.

13. 14.

T. Hoang et al. and control of musculoskeletal robots. In: 2018 IEEERAS 18th International Conference on Humanoid Robots (Humanoids), pp. 245–250 (2018). https://doi.org/ 10.1109/HUMANOIDS.2018.8624940 Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks (2018). arXiv:1812.07035 Meier, P., Rohrmann, K., Sandner, M., Prochaska, M.: Application of magnetic ﬁeld sensors for hand gesture recognition with neural networks. In: 2019 IEEE 1st Global Conference on Life Sciences and Technologies (LifeTech), pp. 200–203 (2019). https://doi.org/10.1109/LifeTech.2019.8884006 Chiang, T.-H., Sun, Z.-H., Shiu, H.-R., Lin, K.C.-J., Tseng, Y.-C.: Magnetic ﬁeldbased localization in factories using neural network with robotic sampling. IEEE Sens. J. 20(21), 13 110–13 118 (2020). https://doi.org/10.1109/JSEN.2020.3003404 Ortner, M., Coliado Bandeira, L.G.: Magpylib: a free python package for magnetic ﬁeld computation. SoftwareX (2020). 2020.100466. https://doi.org/10.1016/j.softx Sasaki, A.-I.: Eﬀectiveness of artiﬁcial neural networks for solving inverse problems in magnetic ﬁeld-based localization. Sensors 22(6) (2022). https://doi.org/10. 3390/s22062240. https://www.mdpi.com/1424-8220/22/6/2240. ISSN 1424-8220

BulletArm: An Open-Source Robotic Manipulation Benchmark and Learning Framework Dian Wang(B) , Colin Kohler, Xupeng Zhu, Mingxi Jia, and Robert Platt Khoury College of Computer Sciences, Northeastern University, Boston, MA 02115, USA {wang.dian,kohler.c,zhu.xup,jia.ming,r.platt}@northeastern.edu

Abstract. We present BulletArm, a novel benchmark and learning-environment for robotic manipulation. BulletArm is designed around two key principles: reproducibility and extensibility. We aim to encourage more direct comparisons between robotic learning methods by providing a set of standardized benchmark tasks in simulation alongside a collection of baseline algorithms. The framework consists of 31 different manipulation tasks of varying difficulty, ranging from simple reaching and picking tasks to more realistic tasks such as bin packing and pallet stacking. In addition to the provided tasks, BulletArm has been built to facilitate easy expansion and provides a suite of tools to assist users when adding new tasks to the framework. Moreover, we introduce a set of five benchmarks and evaluate them using a series of state-of-the-art baseline algorithms. By including these algorithms as part of our framework, we hope to encourage users to benchmark their work on any new tasks against these baselines. Keywords: Benchmark · Simulation · Robotic learning · Reinforcement learning

1 Introduction Inspired by the recent successes of deep learning in the field of computer vision, there has been an explosion of work aimed at applying deep learning algorithms across a variety of disciplines. Deep reinforcement learning, for example, has been used to learn policies which achieve superhuman levels of performance across a variety of games [30, 36]. Robotics has seen a similar surge in recent years, especially in the area of robotic manipulation with reinforcement learning [12, 20, 51], imitation learning [50], and multi-task learning [10, 14]. However, there is a key difference between current robotics learning research and past work applying deep learning to other fields. There currently is no widely accepted standard for comparing learning-based robotic manipulation methods. In computer vision for example, the ImageNet benchmark [8] has been a crucial factor in the explosion of image classification algorithms we have seen in the recent past. D. Wang and C. Kohler—Equal Contribution. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 335–350, 2023. https://doi.org/10.1007/978-3-031-25555-7_23

336

D. Wang et al.

While there are benchmarks for policy learning in domains similar to robotic manipulation, such as the continuous control tasks in OpenAI Gym [4] and the DeepMind Control Suite [37], they are not applicable to more real-world tasks we are interested in robotics. Furthermore, different robotics labs work with drastically different systems in the real-world using different robots, sensors, etc. As a result, researchers often develop their own training and evaluation environments, making it extremely difficult to compare different approaches. For example, even simple tasks like block stacking can have a lot of variability between different works [28, 31, 43], including different physics simulators, different manipulators, different object sizes, etc. In this work, we introduce BulletArm, a novel framework for robotic manipulation learning based on two key components. First, we provide a flexible, open-source framework that supports many different manipulation tasks. Compared with prior works, we introduce tasks with various difficulties and require different manipulation skills. This includes long-term planning tasks like supervising Covid tests and contact-rich tasks requiring precise nudging or pushing behaviors. BulletArm currently consists of 31 unique tasks which the user can easily customize to mimic their real-world lab setups (e.g., workspace size, robotic arm type, etc). In addition, BulletArm was developed with a emphasis on extensability so new tasks can easily be created as needed. Second, we include five different benchmarks alongside a collection of standardized baselines for the user to quickly benchmark their work against. We include our implementations of these baselines in the hopes of new users applying them to their customization of existing tasks and whatever new tasks they create. Our contribution can be summarized as three-fold. First, we propose BulletArm, a benchmark and learning framework containing a set of 21 open-loop manipulation tasks and 10 close-loop manipulation tasks. We have developed this framework over the course of many prior works [3, 42–45, 54]. Second, we provide state-of-the-art baseline algorithms enabling other researchers to easily compare their work against our baselines once new tasks are inevitably added to the baseline. Third, BulletArm provides a extensive suite of tools to allow users to easily create new tasks as needed. Our code is available at https://github.com/ColinKohler/BulletArm.

2 Related Work Reinforcement Learning Environments. Standardized environments are vitally important when comparing different reinforcement learning algorithms. Many prior works have developed various video game environments, including PacMan [33], Super Mario [39], Doom [21], and StarCraft [41]. OpenAI Gym [4] provides a standard API for the communication between agents and environments, and a collection of various different environments including Atari games from the Arcade Learning Environment (ALE) [2] and some robotic tasks implemented using MuJoCo [38]. The DeepMind Control Suite (DMC) [37] provides a similar set of continuous control tasks. Although both OpenAI Gym and DMC have a small set of robotic environments, they are toy tasks which are not representative of the real-world tasks we are interested in robotics. Robotic Manipulation Environments. In robotic manipulation, there are many benchmarks for grasping in the context of supervised learning, e.g., the Cornell dataset [19],

BulletArm

337

the Jacquard dataset [9], and the GraspNet 1B dataset [11]. In the context of reinforcement learning, on the other hand, the majority of prior frameworks focus on single tasks, for example, door opening [40], furniture assembly [27], and in-hand dexterous manipulation [1]. Another strand of prior works propose frameworks containing a variety of different environments, such as robosuite [55], PyRoboLearn [7], and MetaWorld [48], but are often limited to short horizon tasks. Ravens [50] introduces a set of environments containing complex manipulation tasks but restricts the end-effector to a suction cup gripper. RLBench [18] provides a similar learning framework to ours with a number of key differences. First, RLBench is built around the PyRep [17] interface and is therefor built on-top of V-REP [34]. Furthermore, RLBench is more restrictive than BulletArm with limitations placed on the workspace scene, robot, and more. Robotic Manipulation Control. There are two commonly used end-effector control schemes: open-loop control and close-loop control. In open-loop control, the agent selects both the target pose of target pose of the end-effector and some action primitive to execute at that pose. Open-loop control generally has shorter time horizon, allowing the agent to solve complex tasks that require a long trajectory [43, 51, 52]. In close-loop control, the agent sets the displacement of the end-effector This allows the agent to more easily recover from failures which is vital when dealing with contact-rich tasks [12, 20, 44, 53]. BulletArm provides a collection of environments in both settings, allowing the users to select either one based on their research interests.

3 Architecture At the core of our learning framework is the PyBullet [6] simulator. PyBullet is a Python library for robotics simulation and machine learning with a focus on sim-to-real transfer. Built upon Bullet Physics SDK, PyBullet provides access to forward dynamics simulation, inverse dynamics computation, forward and inverse kinematics, collision detection, and more. In addition to physics simulation, there are also numerous tools for scene rendering and visualization. BulletArm builds upon PyBullet, providing a diverse set of tools tailored to robotic manipulation simulations. 3.1 Design Philosophy The design philosophy behind our framework focuses on four key principles: 1. Reproducibility: A key challenge when developing new learning algorithms is the difficulty in comparing them to previous work. In robotics, this problem is especially prevalent as different researchers have drastically different robotic setups. This can range from small differences, such as workspace size or degradation of objects, to large differences such as the robot used to preform the experiments. Moving to simulation allows for the standardization of these factors but can impact the performance of the trained algorithm in the real-world. We aim to encourage more direct comparisons between works by providing a flexible simulation environment and a number of baselines to compare against.

338

D. Wang et al.

Fig. 1. Example scripts using our framework. (Left) Creating and interacting with a environment running the Block Stacking task. (Right) Creating a new block structure construction task by subclassing the existing base domain.

2. Extensibility: Although we include a number of tasks, control types, and robots; there will always be a need for additional development in these areas. Using our framework, users can easily add new tasks, robots, and objects. We make the choice to not restrict tasks, allowing users more freedom create interesting domains. Figure 1 shows an example of creating a new task using our framework. 3. Performance: Deep learning methods are often time consuming, slow processes and the addition of a physics simulator can lead to long training times. We have spent a significant portion of time in ensuring that our framework will not bottleneck training by optimizing the simulations and allowing the user to run many environments in parallel. 4. Usability: A good open-source framework should be easy to use and understand. We provide extensive documentation detailing the key components of our framework and a set of tutorials demonstrating both how to use the environments and how to extend them. 3.2

Environment

Our simulation setup (Fig. 4) consists of a robot arm mounted on the floor of the simulation environment, a workspace in front of the robot where objects are generated, and a sensor. Typically, we use top-down sensors which generate heightmaps of the workspace. As we restrict the perception data to only the defined workspace, we choose to not add unnecessary elements to the setup such as a table for the arm to sit upon. Currently there are four different robot arms available in BulletArm (Fig. 3): KUKA IIWA, Frane Emika Panda, Universal Robots UR5 with either a simple parallel jaw gripper or the Robotiq 2F-85 gripper. Environment, Configuration, and Episode are three key terms within our framework. An environment is an instance of the PyBullet simulator in which the robot interacts with objects while trying to solve some task. This includes the initial environment state, the reward function, and termination requirements. A configuration contains additional specifications for the task such as the robotic arm, the size of the workspace, the physics mode, etc. (see Appendix B for an full list of parameters). Episodes are generated by taking actions (steps) within an environment until the episode ends. An

BulletArm

339

Fig. 2. The Deconstruction planner. Left to right: a deconstruction episode where the expert deconstructs the block structure in the left-most figure. Right to left: a construction episode is generated by reversing the deconstruction episode. This is inspired by [49] where the authors propose a method to learn kit assembly through disassembly by reversing disassembly transitions.

episode trajectory τ contains a series of observations o, actions a, and rewards r: τ = [(o0 , a0 , r0 ), ..., (oT , aT , rT )]. Users interface with the learning environment through the EnvironmentFactory and EnvironmentRunner classes. The EnvironmentFactory is the entry point and creates the Environment class specified by the Configuration passed as input. The EnvironmentFactory can create either a single environment or multiple environments meant to be run in parallel. In either case, an EnvironmentRunner instance is returned and provides the API which interacts with the environments. This API, Fig. 1, is modelled after the typical agent-environment RL setup popularized by OpenAI Gym [4]. The benchmark tasks we provide have a sparse reward function which returns +1 on successful task completion and 0 otherwise. While we find this reward function to be advantageous as it avoids problems due to reward shaping, we do not require that new tasks conform to this. When defining a new task, the reward function defaults to sparse but users can easily define their custom reward for a new task. We separate our tasks into two categories based on the action spaces: open-loop control and closed-loop control. These two control modes are commonly used in robotics manipulation research. 3.3 Expert Demonstrations Expert demonstrations are crucial to many robotic learning fields. Methods such as imitation and model-based learning, for example, learn directly from expert demonstrations. Additionally, we find that in the context of reinforcement learning, it is vital to seed learning with expert demonstrations due to the difficulties in exploring large stateaction spaces. We provide two types of planners to facilitate expert data generation: the Waypoint Planner and the Deconstruction Planner. The Waypoint Planner is a online planning method which moves the end-effector through a series of waypoints in the workspace. We define a waypoint as wt = (pt , at ) where pt is the desired pose of the end effector and at is the action primitive to execute at that pose. These waypoints can either be absolute positions in the workspace or positions relative to the objects in the workspace. In open-loop control, the planner returns the waypoint wt as the action executed at time t. In close-loop control, the planner will continuously return a small displacement from the current end-effector pose as

340

D. Wang et al.

(a) Kuka

(b) Panda

(c) UR5 Parallel (d) UR5 Robotiq

Fig. 3. Our work currently supports four different arms: Kuka, Panda, UR5 with parallel jaw gripper, and UR5 with Robotiq gripper.

Fig. 4. The environment containing a robot arm, a camera, and a workspace

Fig. 5. (a) The manipulation scene. (b) The state including a top-down heightmap I, an in-hand image H and the gripper state g.

the action at time t. This process is repeated until the waypoint has been reached. The Deconstruction Planner is a more limited planning method which can only be applied to pick-and-place tasks where the goal is to arrange objects in a specific manner. For example, we utilize this planner for the various block construction tasks examined in this work (Fig. 2). When using this planner, the workspace is initialized with the objects in their target configuration and objects are then removed one-by-one until the initial state of the task is reached. This deconstruction trajectory, is then reversed to produce an expert construction trajectory, τexpert = reverse(τdeconstruct ).

4 Environments The core of any good benchmark is its set of environments. In robotic manipulation, in particular, it is important to cover a broad range of task difficulty and diversity. To this end, we introduce tasks covering a variety of skills for both open-loop and closeloop control. Moreover, the configurable parameters of our environments enable the user to select different task variations (e.g., the user can select whether the objects in the workspace will be initialized with a random orientation). BulletArm currently provides a collection of 21 open-loop manipulation environments and a collection of 10 close-loop environments. These environments are limited to kinematic tasks where the robot has to directly manipulate a collection of objects in order to reach some desired configuration.

BulletArm

341

(a) Block Stacking (b) House Building 1 (c) House Building 2 (d) House Building 3

(e) House Building 4 (f) Improvise House (g) Improvise House Building 2 Building 3

(i) Bottle Arrangement

(j) Box Palletizing

(k) Covid Test

(h) Bin Packing

(l) Object Grasping

Fig. 6. The open-loop environments. The window on the top-left corner of each sub-figure shows the goal state of each task.

4.1 Open-Loop Environments In the open-loop environments, the agent controls the target pose of the end-effector, resulting in a shorter time horizon for complex tasks. The open-loop environment is comprised of a robot arm, a workspace, and a camera above the workspace providing the observation (Fig. 4). The action space is defined as the cross product of the gripper motion Ag = {PICK, PLACE} and the target pose of the gripper for that motion Ap , A = Ag × Ap . The state space is defined as s = (I, H, g) ∈ S (Fig. 5), where I is a top-down heightmap of the workspace; g ∈ {HOLDING, EMPTY} denotes the gripper state; and H is an in-hand image that shows the object currently being held. If the last action is PICK, then H is a crop of the heightmap in the previous time step centered at the pick position. If the last action is PLACE, H is set to a zero value image. BulletArm provides three different action spaces for Ap : Ap ∈ Axy , Axyθ , ASE(3) . The first option (x, y) ∈ Axy only controls the (x, y) components of the gripper pose, where the rotation of the gripper is fixed; and z is selected using a heuristic function that first reads the maximal height in the region around (x, y) and then either adds a positive offset for a PLACE action or a negative offset for PICK action. The second option (x, y, θ) ∈ Axyθ adds control of the rotation θ along the z-axis. The third option (x, y, z, θ, φ, ψ) ∈ ASE(3) controls the full 6 degree of freedom pose of the grip-

342

D. Wang et al.

(a) The Ramp Environment

(b) The Bump Environment

Fig. 7. The 6DoF environments. (a): In the Ramp Environment, the objects are initialized on two ramps, where the agent needs to control the out-of-plane orientations to pick up the objects. (b): Similarly, in the Bump Environment, the objects are initialized on a bumpy surface.

Fig. 8. (a) The close-loop environment containing a robot arm, two cameras, and a workspace. (b) The point cloud generated from the two cameras. (c) The orthographic projection generated from the point cloud which is used as the observation. The two squares at the center of the image represent the gripper. Alternatively, the image can be generated using a simulated orthographic camera located at the position of the end-effector.

per, including the rotation along the y-axis φ and the rotation along the x-axis ψ. Axy and Axyθ are suited for tasks that only require top-down manipulations, while ASE(3) is designed for solving complex tasks involving out-of-plane rotations. The definition of the state space and the action space in the open-loop environments also enables effortless sim2real transfer. One can reproduce the observation in Fig. 5 in the real-world using an overhead depth camera and transfer the learned policy [43, 45]. Figure 6 shows the 12 basic open-loop environments that can be solved using topdown actions. Those environments can be categorized into two collections, a set of block structure tasks (Figs. 6a–6g), and a set of more realistic tasks (Figs. 6h and 6l). The block structure tasks require the robot to build different goal structures using blocks. The more realistic tasks require the robot to finish some real-world problems, for example, arranging bottles or supervising Covid tests. We use the default sparse reward function for all open-loop environments, i.e., +1 reward for reaching the goal, and 0 otherwise. See Appendix A.1 for a detailed description of the tasks. 6DoF Extensions. The environments that we have introduced so far only require the robot to perform top-down manipulation. We extend those environments to 6 degrees

BulletArm

343

(a) Block Reaching

(b) Block Picking

(c) Block Pushing

(d) Block Pulling

(e) Block in Bowl

(f) Block Stacking

(g) House Building

(h) Corner Picking

(i) Drawer Opening (j) Object Grasping

Fig. 9. The close-loop environments. The window on the top-left corner of each sub-figure shows the goal state of the task.

of freedom by initializing them in either the ramp environment or the bump environment (Fig. 7). In both cases, the robot needs to control the out-of-plane orientations introduced by the ramp or bump in order to manipulate the objects. We provide seven ramp environments (for each of the block structure construction tasks), and two bump environments (House Building 4 and Box Palletizing). See Appendix C for details. 4.2 Close-Loop Environments The close-loop environments require the agent to control the delta pose of the endeffector, allowing the agent more control and enabling us to solve more contact-rich tasks. These environments have a similar setup to the open-loop domain but to avoid the occlusion caused by the arm, we instead use two side-view cameras pointing to the workspace (Fig. 8a). The heightmap I is generated by first acquiring a fused point cloud from the two cameras (Fig. 8b) and then performing an orthographic projection (Fig. 8c). This orthographic projection is centered with respect to the gripper. In practice, this process can be replaced by putting a simulated orthographic camera at the position of the gripper to speed up the simulation. The state space is defined as a tuple s = (I, g) ∈ S, where g ∈ {HOLDING, EMPTY} is the gripper state indicating if there is an object being held by gripper. The action space is defined as the cross product of the gripper open width Aλ and the delta motion of the gripper Aδ , A = Aλ × Aδ .

344

D. Wang et al.

Table 1. The five benchmarks in our work include three open-loop benchmarks and two closeloop benchmarks. ‘fixed orientation’ and ‘random orientation’ indicate whether the objects in the environments will be initialized with a fixed orientation or random orientation.

Benchmark

Environments

Action space

Open-Loop 2D benchmark Open-loop environments with fixed orientation Ag × Axy Open-Loop 3D benchmark Open-loop environments with random orientation Ag × Axyθ Open-Loop 6D benchmark Open-loop environments 6DoF extensions Ag × ASE(3) Close-Loop 3D benchmark Close-loop environments with fixed orientation Aλ × Axyz δ Close-Loop 4D benchmark Close-loop environments with random orientation Aλ × Axyzθ δ Table 2. The number of objects, optimal number of steps per episode, and max number of steps per episode in our Open-Loop 3D benchmark experiments Task

Block stacking

Number of objects

House building 1

House building 2

House building 3

House building 4

Improvise house building 2

Improvise Bin house packing building 3

Bottle arrangement

Box palletizing

Covid test

Object grasping

4

4

3

4

6

3

4

8

6

18

6

15

Optimal number of steps 6

6

4

6

10

4

6

16

12

36

18

1

Max number of steps

10

10

10

20

10

10

20

20

40

30

1

10

xyzθ Two different action spaces are available for Aδ : Aδ ∈ {Axyz }. In Axyz δ , Aδ δ , the robot controls the change of the x, y, z position of the gripper, where the top-down ori, the robot controls the change of the x, y, z position and the entation θ is fixed. In Axyzθ δ top-down orientation θ of the gripper. Figure 9 shows the 10 close-loop environments. We provide a default sparse reward function for all environments. See Appendix A.2 for a detailed description of the tasks.

5 Benchmark BulletArm provides a set of 5 benchmarks covering the various environments and action spaces (Table 1). In this section, we detail the Open-Loop 3D Benchmark and the CloseLoop 4D Benchmark. See Appendix D for the other three benchmarks. 5.1

Open-Loop 3D Benchmark

In the Open-Loop 3D Benchmark, the agent needs to solve the open-loop tasks shown in Fig. 6 using the Ag × Axyθ action space (see Sect. 4.1). We provide a set of baseline algorithms that explicitly control (x, y, θ) ∈ Axyθ and select the gripper motion using the following heuristic: a PICK action will be executed if g = EMPTY and a PLACE action will be executed if g = HOLDING. The baselines include: (1) DQN [30], (2) ADET [24], (3) DQfD [15], and (4) SDQfD [43]. The network architectures for these different methods can be used interchangeably. We provide the following network architectures: 1. CNN ASR [43]: A two-hierarchy architecture that selects (x, y) and θ sequentially.

BulletArm

345

Fig. 10. The Open-Loop 3D Benchmark results. The plots show the evaluation performance of the greedy policy in terms of the task success rate. The evaluation is performed every 500 training steps. Results are averaged over four runs. Shading denotes standard error.

2. Equivariant ASR (Equi ASR) [45]: Similar to ASR, but instead of using conventional CNNs, equivariant steerable CNNs [5, 46] are used to capture the rotation symmetry of the tasks. 3. FCN: a Fully Convolutional Network (FCN) [29] which outputs a n channel actionvalue map for each discrete rotation. 4. Equivariant FCN [45]: Similar to FCN, but instead of using conventional CNNs, equivariant steerable CNNs are used. 5. Rot FCN [51, 52]: A FCN with 1-channel input and output, the rotation is encoded by rotating the input and output for each θ. In this section, we show the performance of SDQfD (which is shown to be better than DQN, ADET, and DQfD [43]. See the performance of DQN, ADET and DQfD in Appendix E) equipped with CNN ASR, Equi ASR, FCN, and Rot FCN. We evaluate SDQfD in the 12 environments shown in Fig. 6. Table 2 shows the number of objects, the optimal number of steps per episode, and the max number of steps per episode in the open-loop benchmark experiments. Before the start of training, 200 (500 for Covid Test) episodes of expert data are populated in the replay buffer. Figure 10 shows the results. Equivariant ASR (blue) has the best performance across all environments, then Rot FCN (green) and CNN ASR (red), and finally FCN (purple). Notice that Equivariant ASR is the only method that is capable of solving the most challenging tasks (e.g., Improvise House Building 3 and Covid Test).

346

D. Wang et al.

Table 3. The number of objects, optimal number of steps per episode, and max number of steps per episode in our Close-Loop 4D Benchmark experiments. Task

Block reaching Block picking Block pushing Block pulling Block in bowl Block stacking House building Corner picking Drawer opening Object grasping

Number of objects

1

1

1

2

2

2

2

1

1

5

Optimal number of steps 3

7

7

7

11

12

12

14

9

7

Max number of steps

50

50

50

50

50

50

50

50

50

5.2

50

Close-Loop 4D Benchmark

The Close-Loop 4D Benchmark requires the agent to solve the close-loop tasks shown ⊂ R5 , where in Fig. 9 in the 5-dimensional action space of (λ, x, y, z, θ) ∈ Aλ × Axyzθ δ the agent controls the positional displacement of the gripper (x, y, z), the rotational displacement of the gripper along the z axis (θ), and the open width of the gripper (λ). We provide the following baseline algorithms: (1) SAC [13], (2) Equivariant SAC [44], (3) RAD [25] SAC: SAC with data augmentation, (4) DrQ [23] SAC: Similar to (3), but performs data augmentation when calculating the Q-target and the loss, and (5) FERM [53]: A Combination of SAC and contrastive learning [26] using data augmentation. Additionally, we also provide a variation of SAC, SACfD [44], that applies an auxiliary L2 loss towards the expert action to the actor network. SACfD can also be used in combination with (2), (3), and (4) to form Equivariant SACfD, RAD SACfD, DrQ SACfD, and FERM SACfD. In this section, we show the performance of SACfD, Equivariant SACfD (Equi SACfD), Equivariant SACfD using Prioritized Experience Replay (PER [35]) and data augmentation (Equi SACfD + PER + Aug), and FERM SACfD. (See Appendix F for the performance of RAD SACfD and DrQ SACfD.) We use a continuous action space where x, y, z ∈ [−0.05m, 0.05m], θ ∈ [− π4 , π4 ], λ ∈ [0, 1]. We evaluate the various methods in the 10 environments shown in Fig. 9. Table 3 shows the number of objects, the optimal number of steps for solving each task, and the maximal number of steps for each episode. In all tasks, we pre-load 100 episodes of expert demonstrations in the replay buffer. Figure 11 shows the performance of the baselines. Equivariant SACfD with PER and data augmentation (blue) has the best overall performance followed by standard Equivariant SACfD (red). The equivariant algorithms show a significant improvement when compared to the other algorithms which do not encode rotation symmetry, i.e. CNN SACfD and FERM SACfD.

BulletArm

347

Fig. 11. The Close-Loop 4D benchmark results. The plots show the evaluation performance of the greedy policy in terms of the task success rate. The evaluation is performed every 500 training steps. Results are averaged over four runs. Shading denotes standard error.

6 Conclusions In this paper, we present BulletArm, a novel benchmark and learning environment aimed at robotic manipulation. By providing a number of manipulation tasks alongside our baseline algorithms, we hope to encourage more direct comparisons between new methods. This type of standardization through direct comparison has been a key aspect of improving research in deep learning methods for both computer vision and reinforcement learning. We aim to maintain and improve this framework for the foreseeable future adding new features, tasks, and baseline algorithms over time. An area of particular interest for us is to extend the existing suite of tasks to include more dynamic environments where the robot is tasked with utilizing tools to accomplish various tasks. We hope that with the aid of the community, this repository will grow over time to contain both a large number of interesting tasks and state-of-the-art baseline algorithms. Acknowledgments. This work is supported in part by NSF 1724257, NSF 1724191, NSF 1763878, NSF 1750649, and NASA 80NSSC19K1474.

348

D. Wang et al.

References 1. Andrychowicz, O.M., et al.: Learning dexterous in-hand manipulation. Int. J. Robot. Res. 39(1), 3–20 (2020) 2. Bellemare, M.G., Naddaf, Y., Veness, J., Bowling, M.: The arcade learning environment: an evaluation platform for general agents. J. Artif. Intell. Res. 47, 253–279 (2013) 3. Biza, O., Wang, D., Platt, R., van de Meent, J.-W., Wong, L.L.: Action priors for large action spaces in robotics. In: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, pp. 205–213 (2021) 4. Brockman, G., et al.: OpenAI Gym. arXiv preprint arXiv:1606.01540 (2016) 5. Cohen, T.S., Welling, M.: Steerable CNNs. arXiv preprint arXiv:1612.08498 (2016) 6. Coumans, E., Bai, Y.: Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository (2016) 7. Delhaisse, B., Rozo, L.D., Caldwell, D.G.: PyRoboLearn: a python framework for robot learning practitioners. In: CoRL (2019) 8. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009) 9. Depierre, A., Dellandr´ea, E., Chen, L.: Jacquard: a large scale dataset for robotic grasp detection. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3511–3516. IEEE (2018) 10. Devin, C., Gupta, A., Darrell, T., Abbeel, P., Levine, S.: Learning modular neural network policies for multi-task and multi-robot transfer. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2169–2176. IEEE (2017) 11. Fang, H.-S., Wang, C., Gou, M., Lu, C.: GraspNet-1billion: a large-scale benchmark for general object grasping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453 (2020) 12. Gu, S., Holly, E., Lillicrap, T., Levine, S.: Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3389–3396. IEEE (2017) 13. Haarnoja, T., et al.: Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905 (2018) 14. Hausman, K., Springenberg, J.T., Wang, Z., Heess, N., Riedmiller, M.: Learning an embedding space for transferable robot skills. In: International Conference on Learning Representations (2018) 15. Hester, T., et al.: Deep q-learning from demonstrations. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 16. Huber, P.J.: Robust estimation of a location parameter. Ann. Math. Stat. 35(1), 73–101 (1964) 17. James, S., Freese, M., Davison, A.J.: PyRep: bringing V-REP to deep robot learning. arXiv preprint arXiv:1906.11176 (2019) 18. James, S., Ma, Z., Arrojo, D.R., Davison, A.J.: RLBench: the robot learning benchmark & learning environment. IEEE Robot. Autom. Lett. 5(2), 3019–3026 (2020) 19. Jiang, Y., Moseson, S., Saxena, A.: Efficient grasping from RGBD images: learning using a new rectangle representation. In: 2011 IEEE International Conference on Robotics and Automation, pp. 3304–3311. IEEE (2011) 20. Kalashnikov, D., et al.: Scalable deep reinforcement learning for vision-based robotic manipulation. In: Conference on Robot Learning, pp. 651–673. PMLR (2018) 21. Kempka, M., Wydmuch, M., Runc, G., Toczek, J., Ja´skowski, W.: ViZDoom: a doom-based AI research platform for visual reinforcement learning. In: 2016 IEEE Conference on Computational Intelligence and Games (CIG), pp. 1–8. IEEE (2016)

BulletArm

349

22. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 23. Kostrikov, I., Yarats, D., Fergus, R.: Image augmentation is all you need: regularizing deep reinforcement learning from pixels. arXiv preprint arXiv:2004.13649 (2020) 24. Lakshminarayanan, A.S., Ozair, S., Bengio, Y.: Reinforcement learning with few expert demonstrations. In: NIPS Workshop on Deep Learning for Action and Interaction, vol. 2016 (2016) 25. Laskin, M., Lee, K., Stooke, A., Pinto, L., Abbeel, P., Srinivas, A.: Reinforcement learning with augmented data. In: Advances in Neural Information Processing Systems, vol. 33, pp. 19884–19895 (2020) 26. Laskin, M., Srinivas, A., Abbeel, P.: Curl: contrastive unsupervised representations for reinforcement learning. In: International Conference on Machine Learning, pp. 5639–5650. PMLR (2020) 27. Lee, Y., Hu, E., Yang, Z., Yin, A.C., Lim, J.J.: IKEA furniture assembly environment for long-horizon complex manipulation tasks. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 6343–6349 (2021) 28. Li, R., Jabri, A., Darrell, T., Agrawal, P.: Towards practical multi-object manipulation using relational reinforcement learning. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4051–4058. IEEE (2020) 29. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015) 30. Mnih, V., et al.: Human-level control through deep reinforcement learning. Nature 518(7540), 529–533 (2015) 31. Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Abbeel, P.: Overcoming exploration in reinforcement learning with demonstrations. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 6292–6299. IEEE (2018) 32. Paszke, A., et al.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017) 33. Rohlfshagen, P., Lucas, S.M.: Ms Pac-Man versus ghost team CEC 2011 competition. In: 2011 IEEE Congress of Evolutionary Computation (CEC), pp. 70–77. IEEE (2011) 34. Rohmer, E., Singh, S.P., Freese, M.: V-REP: a versatile and scalable robot simulation framework. In: 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1321–1326. IEEE (2013) 35. Schaul, T., Quan, J., Antonoglou, I., Silver, D.: Prioritized experience replay. arXiv preprint arXiv:1511.05952 (2015) 36. Silver, D., et al.: Mastering the game of go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016) 37. Tassa, Y., et al.: DeepMind control suite. arXiv preprint arXiv:1801.00690 (2018) 38. Todorov, E., Erez, T., Tassa, Y.: MuJoCo: a physics engine for model-based control. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE (2012) 39. Togelius, J., Karakovskiy, S., Baumgarten, R.: The 2009 Mario AI competition. In: IEEE Congress on Evolutionary Computation, pp. 1–8. IEEE (2010) 40. Urakami, Y., Hodgkinson, A., Carlin, C., Leu, R., Rigazio, L., Abbeel, P.: DoorGym: a scalable door opening environment and baseline agent. arXiv, abs/1908.01887 (2019) 41. Vinyals, O., et al.: Starcraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782 (2017) 42. Wang, D., Jia, M., Zhu, X., Walters, R., Platt, R.: On-robot policy learning with O(2)equivariant SAC. arXiv preprint arXiv:2203.04923 (2022) 43. Wang, D., Kohler, C., Platt, R.: Policy learning in SE(3) action spaces. In: Conference on Robot Learning, pp. 1481–1497. PMLR (2021)

350

D. Wang et al.

44. Wang, D., Walters, R., Platt, R.: SO(2)-equivariant reinforcement learning. In: International Conference on Learning Representations (2022) 45. Wang, D., Walters, R., Zhu, X., Platt, R.: Equivariant Q learning in spatial action spaces. In: Conference on Robot Learning, pp. 1713–1723. PMLR (2022) 46. Weiler, M., Cesa, G.: General E(2)-equivariant steerable CNNs. arXiv preprint arXiv:1911.08251 (2019) 47. Wohlkinger, W., Aldoma Buchaca, A., Rusu, R., Vincze, M.: 3DNet: large-scale object class recognition from CAD models. In: IEEE International Conference on Robotics and Automation (ICRA) (2012) 48. Yu, T., Quillen, D., He, Z., Julian, R., Hausman, K., Finn, C., Levine, S.: Meta-world: a benchmark and evaluation for multi-task and meta reinforcement learning. In: Conference on Robot Learning, pp. 1094–1100. PMLR (2020) 49. Zakka, K., Zeng, A., Lee, J., Song, S.: Form2Fit: learning shape priors for generalizable assembly from disassembly. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9404–9410. IEEE (2020) 50. Zeng, A., et al.: Transporter networks: rearranging the visual world for robotic manipulation. In: Conference on Robot Learning, pp. 726–747. PMLR (2021) 51. Zeng, A., Song, S., Welker, S., Lee, J., Rodriguez, A., Funkhouser, T.: Learning synergies between pushing and grasping with self-supervised deep reinforcement learning. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4238– 4245. IEEE (2018) 52. Zeng, A., et al.: Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching. In: 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3750–3757. IEEE (2018) 53. Zhan, A., Zhao, P., Pinto, L., Abbeel, P., Laskin, M.: A framework for efficient robotic manipulation. arXiv preprint arXiv:2012.07975 (2020) 54. Zhu, X., Wang, D., Biza, O., Su, G., Walters, R., Platt, R.: Sample efficient grasp learning using equivariant models. In: Proceedings of Robotics: Science and Systems (RSS) (2022) 55. Zhu, Y., Wong, J., Mandlekar, A., Mart’in-Mart’in, R.: Robosuite: a modular simulation framework and benchmark for robot learning. arXiv, abs/2009.12293 (2020)

Optimization-Based Online Flow Fields Estimation for AUVs Navigation Hao Xu1,2(B) , Yupu Lu1,2 , and Jia Pan1,2 1

Department of Computer Science, The University of Hong Kong, Hong Kong, China {hxu,yplu,jpan}@cs.hku.hk 2 Centre for Garment Production Limited, Hong Kong, China Abstract. The motion of an autonomous underwater vehicle (AUV) is aﬀected by its surrounding water ﬂows, so an accurate estimation of the ﬂow ﬁeld could be used to assist the vehicle’s navigation. We propose an optimization-based approach to the problem of online ﬂow ﬁeld learning with limited amounts of data. To compensate for the shortage of online measurements, we identify two types of physically meaningful constraints from eddy geometry of the ﬂow ﬁeld and the property of ﬂuid incompressibility respectively. By parameterizing the ﬂow ﬁeld as a polynomial vector ﬁeld, the optimization problem could be solved eﬃciently via semi-deﬁnite programming (SDP). The eﬀectiveness of the proposed algorithm in terms of ﬂow ﬁeld estimation is experimentally validated on real-world ocean data by providing performance comparisons with a related method. Further, the proposed estimation algorithm is proved to be able to be combined with a motion planning method to allow an AUV to navigate eﬃciently in an underwater environment where the ﬂow ﬁeld is unknown beforehand. Keywords: Optimization · Online estimation Autonomous underwater vehicle · Navigation

1

· Flow ﬁeld ·

Introduction

Autonomous underwater vehicles (AUVs) play an important role in many underwater tasks, such as structure inspection [16] and environment monitoring [13]. When these tasks are performed underwater, the motion of an AUV is subject to its ambient water ﬂows. Therefore, in order to achieve reliable motions, various motion planning algorithms considering the eﬀect of oceanic ﬂows have been developed [11,20]. They assume that the ﬂow ﬁeld information is available from geophysical ocean models, such as Regional Ocean Modeling System (ROMS) [26]. However, ﬂow ﬁelds from these models are only accessible in the form of discrete grids with a coarse spatial resolution in the order of kilometers, which are commonly not accurate enough for improving an AUV’s navigation performance in many underwater tasks. In addition, since such ocean models are generally initialized with in-situ measurements including high frequency (HF) Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 24. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 351–367, 2023. https://doi.org/10.1007/978-3-031-25555-7_24

352

H. Xu et al.

radar and ship-based acoustic measurements, they are only available in speciﬁc geographical regions. Therefore, in this work, we consider the problem of learning a ﬂow ﬁeld based on online observations of an AUV. Recently, online ﬂow ﬁeld estimation has been studied as an optimization problem (e.g. [12,23]). Given a set of ﬂow velocity measurements along an AUV’s trajectory, they use a least-squares optimization [23] or Gaussian process (GP) regression [12] to estimate a ﬂow ﬁeld. These methods achieve up-to-date estimates by observing local ﬂow conditions, but ignoring detailed physical characteristics of the ﬂow ﬁelds results in low generalization of their models. Although a novel kernel considering ﬂuid incompressibility was introduced in [12] to enhance the physical ﬁdelity of the estimated ﬂow ﬁeld, only the ﬂow ﬁeld in the vicinity of the AUV’s trajectory can be well estimated. However, during an AUV’s navigation only limited online data is available for ﬂow ﬁeld learning. Considering most motion planning algorithms require full knowledge of the ﬂow ﬁeld for appropriate control actions, it is desired that the estimated ﬂow model can generalize well with a small number of observations. Unfortunately, none of the aforementioned approaches is able to achieve that, so they are diﬃcult to be integrated into a planning algorithm to assist an AUV’s navigation. In this work, we also explore the problem in an optimization framework, but in addition to online data and the incompressibility condition, we identify additional constraints from the eddy geometry detected within the ﬂow ﬁeld and incorporate them into the estimation process through a convex manner, thus these physical insights improve the generalization of the ﬂow model in a computationally eﬃcient way.

Fig. 1. Satellite-based ocean currents observation from Ocean Surface Topography of Space-NASA [2]: the ocean environment is full of eddies with various radii ranging from centimeters to hundreds of kilometers.

In ﬂuid dynamics, an eddy describes the swirling pattern of ﬂow motion. As demonstrated in Fig. 1, eddies with various radii are densely distributed in the ocean. They play a vital role in grasping an understanding of ocean dynamics [25]. Therefore, in the past few years, the detection and analysis of ocean eddies have been actively studied. Since ocean eddies have prominent signatures in sea surface height (SSH) ﬁeld [6], there are plenty of methods developed to detect eddies based on SSH data [17,27]. With the advance of satellite-based remote sensing technique,

Optimization-Based Online Flow Fields Estimation

353

a global and regular monitoring of SSH with a relatively high resolution has become possible [1]. Therefore, it is an ideal way to acquire the spatial feature of a ﬂow ﬁeld with the eddies detected from its SSH data. The main contribution of this work is to propose a novel computational framework for estimating a ﬂow ﬁeld given sparse measurements of ocean ﬂows collected by an AUV. To better extrapolate these online measurements, we identify two types of physically meaningful constraints for the optimization problem from: (1) the property of ﬂuid incompressibility to determine a general principle for the ﬂow motion; (2) the geometry of the eddy boundary to prescribe spatial features of the ﬂow ﬁeld. These constraints can be imposed via semi-deﬁnite programming (SDP) by choosing a polynomial-based representation of the ﬂow ﬁeld. We validate the eﬀectiveness of our approach through experiments using real ocean data. The performance comparison with the incompressible-GP [12] highlights the advantages of our method in extrapolating the training data. And the applicability of our estimation approach on AUV navigation in an unknown ﬂow ﬁeld is demonstrated in a replanning framework that interleaves estimation and planning.

2

Related Works

The problem of ocean ﬂow ﬁeld estimation has been actively studied in previous works. In oceanography, the most common method is to directly simulate a dynamical model of the ocean described by spatio-temporal partial diﬀerential equations (PDEs) (e.g. [15,21]). Although such a physics-based model leads to an accurate estimate of ocean currents, numerically solving PDEs to simulate ocean processes is commonly computationally intensive. Thus it is unsuitable to be applied to AUV platforms with limited computing resources and computational time. Hackbarth et al. [9] proposed a framework for ﬂow ﬁeld estimation that uses an ensemble Kalman ﬁlter [4] to assimilate ﬂow velocity measurements from multiple AUVs into the simulation of a PDE-based computational ﬂuid dynamics (CFD) model. However, for reduction in computational burden, the CFD model is simulated with a coarse spatial resolution relative to the scale of an AUV’s motion, thus it is insuﬃcient for improving the AUV’s navigation performance in many underwater tasks. For fast computation and high resolution, there has been a growing interest in employing data-driven models to approximate ﬂow ﬁelds. Petrich et al. [23] propose a ﬂow ﬁeld model consisting of a uniform ﬂow component and a singular ﬂow component, whose parameters are identiﬁed based on a least-squares optimization using ﬂow measurements taken by AUVs. Although their low-complexity ﬂow ﬁeld model estimates ﬂows with higher resolution and computationally faster than PDE-based geophysical ocean models, sacriﬁcing detailed physical insights leads to lower generalization of the model, especially when the amount of data is limited. In our work, we parameterize a ﬂow ﬁeld as a polynomial vector ﬁeld and identify its parameters by minimizing a least-squares objective function. However, to improve the physical ﬁdelity of the model, we constrain

354

H. Xu et al.

the minimization to preserve ﬂuid incompressibility and regional eddy pattern of the ﬂow ﬁeld. Therefore, the ﬂow ﬁeld model produced by our proposed algorithm is able to provide realistic ﬂow estimates in a larger region surrounding the measurements, thus could be used to assist an AUV’s navigation. The property of ﬂuid incompressibility is also utilized by [12] to enhance the generalization of the estimated ﬂow ﬁeld. They develop a specialized kernel enforcing the incompressibility of ocean ﬂows in a Gaussian process (GP) regression scheme for ﬂow ﬁeld estimation. In our method, the polynomial-based representation of the ﬂow ﬁeld provides us with a more straightforward way to express incompressibility through aﬃne constraints. And by exploiting the idea from sum of squares optimization [3], the problem of identifying a polynomial vector ﬁeld subject to constraints of incompressibility condition and eddy ﬂow pattern is able to be formulated as a semi-deﬁnite program, which could be solved in a computationally eﬃcient manner. To compensate for the shortage of online data, another type of prior knowledge for the ﬂow ﬁeld estimation problem is an ensemble forecast, which refers to a set of ﬂow ﬁelds produced by simulating a geophysical ocean model with various initial conditions. From an ensemble forecast, a popular method in meteorology for distilling a single estimate is an ensemble Kalman ﬁlter [4]. However, the ﬁlter is computationally intensive, thus may be unsuitable for AUV applications. In order to make more eﬀective use of ensemble forecast information, Kong et al. [28] use kernel methods and singular value decomposition to identify a compressed model that keeps spatial correlations found from the ensemble data, and then improve the model by integrating online measurements of AUVs. Although their algorithm achieves a fast convergence rate, the accuracy of the estimates is highly bounded to the performance of the ensemble forecast. In our algorithm, eddy geometry information is leveraged as a prior knowledge of spatial features of the ﬂow ﬁeld. Compared with the ensemble forecast data from PDE-based ocean models, the eddies detected from SSH data directly observed by satellites are more accessible.

3

Problem Formulation

An AUV is considered to navigate in a two-dimensional (2D) underwater environment W2 = x ∈ R2 | (x, y) . We assume that the AUV is subject to a 2D vector ﬁeld of ﬂow velocities, which is deﬁned as a continuous mapping f : W2 → R2 . During the navigation of the AUV, a discrete set of online meaˆ i ) | i = 1, ..., k} is collected periodically, where x ∈ W2 surements T = {(xi , u is the AUV’s position updated through a global positioning system (GPS), and the ﬂow velocity u ∈ R2 at the location x is able to be measured by the sensors [18] such as Doppler Velocity Log equipped on the AUV. Considering the sensor noise, a Gaussian error ε is introduced to the ﬂow measurement: ˆ = u + ε, u

(1)

where ε ∼ N 0, σ 2 I2 with a standard deviation σ ∈ R and an identity matrix I2 ∈ R2×2 . Our goal is to use these online trajectory measurements T to

Optimization-Based Online Flow Fields Estimation

355

approximate a ﬂow ﬁeld model fˆ (x) in an optimization framework. So in our work, the problem of ﬂow ﬁeld estimation could be deﬁned as Problem 1. Given online trajectory measurements T , ﬁnd a smooth vector ﬁeld fˆ (·) by minimizing the following least-squares objective function: 2 ˆ ˆ i . arg min (2) f (xi ) − u fˆ∈C ∞

4

(xi ,ˆ ui )∈T

2

Flow Field Estimation

As most AUV motion planning algorithms assume full distribution of the ﬂow ﬁeld for planning, it is desired that the learned ﬂow ﬁeld fˆ (x) could generalize well over the entire workspace W2 . However, the objective function (Eq. 2) only decides how the estimated vector ﬁeld should behave on the training data, which could easily lead to overﬁtting when the number of the training data is limited. Therefore, in this section, we mainly identify additional prior knowledge related to the ﬂow ﬁeld besides online ﬂow measurements, and incorporate the knowledge into the optimization framework as constraints to compensate for insuﬃciency of trajectory measurements. To further deﬁne the optimization problem (Problem 1), we ﬁrst parameterize the planar ﬂow ﬁeld fˆ (x) to be a 2-dimensional polynomial vector ﬁeld (Equation 3 with n = 2). Polynomial is a reasonable way to parameterize the ﬂow ﬁeld as it is expressive enough to represent spatial variation of the ﬂows. More importantly, with the choice of polynomials, the objective function in (Eq. 2) will be a convex quadratic function of coeﬃcients of the polynomial ﬂow ﬁeld fˆ (x), and the identiﬁed constraints are able to be enforced to the optimization problem in a convex manner. Definition 1 (Polynomial vector field). An n-dimensional polynomial vector ﬁeld over a compact set Wn ⊂ Rn is a function F : Wn → Rn of the form F ∈ {(p1 , ..., pn ) | pi : Wn → R, i = 1, ..., n} ,

(3)

where pi (·) is a polynomial with real coeﬃcients for i = 1, ..., n. And the polynomial vector ﬁeld is homogeneous if all entries of F are homogeneous polynomials of the same degree. 4.1

Eddy-Based Constraint

An eddy is a dominant feature in oceanic ﬂow ﬁelds. We are able to detect eddies within a ﬂow ﬁeld through an eddy detection algorithm. Speciﬁcally, it is assumed that the sea surface height (SSH) data of a 2D environment W2 is available from a satellite-based ocean monitoring system [1] for the AUV. Then through an SSH-based eddy detection algorithm [17], we are able to obtain

356

H. Xu et al.

Fig. 2. Eddies detected in the Mediterranean Sea based on Sea Level Anomalies data (heatmap) provided by Copernicus Marine Environment Monitoring Service (CMEMS) [1]. Eddies are classiﬁed according to their rotational direction as either anticyclonic (red curve) if they rotate clockwise or cyclonic (blue curve) otherwise.

polygonal eddy boundary information E = {Bi ⊂ W2 | i = 1, ..., m} within the environment, where Bi is a discrete set consisting of vertices of an eddy boundary polygon, and m is the total number of the detected eddies. Figure 2 provides an example of eddy detection based on SSH data, with the red curve representing the boundary of anticyclonic eddies and the blue curve denoting the boundary of cyclonic eddies.

Fig. 3. The ﬂow velocity ﬁeld of a sub-region in Fig. 2, where the direction of the black arrow represents the ﬂow direction, and the length of the arrow indicates the velocity magnitude. The detected polygonal eddies boundary (blue and red curve) is approximated by periodic quadratic B-spline (purple dashed curve).

An eddy describes the swirling motion of water ﬂows, so the ﬂow vectors along the eddy boundary clearly show a rotating feature (shown in Fig. 3). Thus the eddy boundary can be used to constrain the direction of the estimated ﬂow velocity along the boundary during the learning process. In particular, if the boundary of an eddy B ∈ E is able to be approximated using a closed tangentcontinuous (C 1 ) spline deﬁned piecewise by n polynomial functions as n B= xj (t) : R → R2 | t ∈ [aj , aj+1 ] , j=1

(4)

Optimization-Based Online Flow Fields Estimation

357

where [aj , aj+1 ] ⊂ R, ∀i = 1, ..., n are knot intervals. The direction of the spline tangent x˙ (t) should be consistent to the rotational direction of the eddy, then the eddy boundary B indicates the following constraints to the ﬂow direction along the boundary: (5) |θ (t)| θτ , ∀t ∈ [aj , aj+1 ] , ∀j = 1, ..., n,

where θ (t) = ∠ fˆ (xj (t)) , x˙ j (t) denotes the angle between the estimated ﬂow

vector and the tangent vector of the eddy boundary curve, θτ ∈ [0, π/2] is an angle threshold. Ideally, the ﬂow velocity is desired to be tangent to the boundary (i.e. θτ = 0) for imposing stricter constraints related to the ﬂow direction. However, diﬀerent studies have diﬀerent deﬁnitions of what constitutes an eddy in an SSH ﬁeld, and the eddy detected from SSH data generally leaves inconsistent features in the ﬂow velocity ﬁeld, especially given the noise in the SSH data. Therefore, θτ is set to be tunable to constrain the ﬂow direction along the eddy boundary.

Fig. 4. Illustration of the eddy-based constraint.

As shown in Fig. 4, to constrain the direction of the estimated ﬂow fˆ (xj (t)) to be within a given angle range [−θτ , θτ ] from the tangent vector x˙ j (t), the ﬂow vector should be enforced to lie in the common region of the two half-planes (the green region in Fig. 4), which are deﬁned using two normal vectors rθ1 (t) and rθ2 (t). The normal vector rθ (t) is obtained by rotating the tangent vector, where θ ∈ {θ1 = π/2 − θτ , θ2 = θτ − π/2} is the rotation angle dependent on the angle threshold θτ . Then for each piece of the boundary spline j = 1, ..., n and each rotation angle θ ∈ {θ1 , θ2 }, Eq. 5 is equivalent to qj (t) 0, ∀t ∈ [aj , aj+1 ] ,

(6)

where qj = fˆ (xj (t)) · rθ (t). The normal vector rθ (t) = Mθ x˙ j (t) is calculated by rotating the tangent vector x˙ j (t) with a 2D rotation matrix Mθ ∈ R2×2 . Based on the polynomial approximation of the ﬂow ﬁeld and the eddy boundary, fˆ (xj (t)) and rθ (t) are both in the form of polynomials, so the dot product of which leads qj : R → R to be a polynomial function, whose coeﬃcients depend aﬃnely on the coeﬃcients of the ﬂow ﬁeld fˆ.

358

H. Xu et al.

Theorem 1. (Putinar’s Positivstellensatz [24]). Let Λ = {x ∈ Rn | gi (x) ≥ 0, ∀i = 1, ..., m} be a basic closed semialgebraic set deﬁned by real polynomials gi , and assume Λ satisﬁes the Archimedean property. For any real polynomial q, if q > 0, ∀x ∈ Λ, then m q (x) = σ0 (x) + σi (x) gi (x) , i=1

where σ0 , ..., σm are sum of squares (SOS) polynomial deﬁned as σ = some polynomials hi .

i

h2i for

Imposing the nonnegativity constraint (Eq. 6) directly will make the optimization problem NP-hard [19]. Nonetheless, Theorem 1 provides us with an algebraic suﬃcient condition for the nonnegativity of the polynomial qj (t) on the basic closed semialgebraic set Λj := {t ∈ R | aj+1 ≥ t ≥ aj }, which could be written as (7) q (t) = σ0 (t) + (t − aj ) σ1 (t) + (aj+1 − t) σ2 (t) , where σ0 , σ1 and σ2 are SOS polynomials. Clearly, if we succeed in ﬁnding the SOS polynomials in Eq. 7, the nonnegativity constraint (Eq. 6) will be exactly satisﬁed. And the searching problem of SOS polynomial accounts for a semideﬁnite program (SDP) [3,22], which is able to be solved to an arbitrary accuracy in a polynomial time [29]. 4.2

Incompressibility-Based Constraint

Incompressibility intuitively indicates that for every point of the ﬂow ﬁeld, the mass of ﬂuid entering is equal to the mass of ﬂuid leaving. An incompressible 2D ﬂow ﬁeld is a common model in oceanography to characterize the horizontal stratiﬁcation of oceanic currents [14,30]. Therefore, in order to enhance the physical ﬁdelity of the estimated polynomial model, we assume the ﬂow ﬁeld is incompressible in this work. According to the potential ﬂow theory [5], if a ﬂow ﬁeld is incompressible, the divergence of the ﬂow velocity is zero at all points in the ﬁeld: ∇ · f (x) = 0,

(8)

where ∇· is the divergence operator. Equation 8 allows us to represent the ﬂow ﬁeld with a single function φ : R2 → R called stream function. So in this work the ﬂow ﬁeld f (x) should be in the form of

∂φ(x) ∂φ(x) ,− f (x) = ∂y ∂x

T .

(9)

In order to enforce the incompressibility constraint (Eq. 9) into the optimization process, we parameterize the stream function φ as a polynomial. Unlike

Optimization-Based Online Flow Fields Estimation

359

the previous work [12] introducing a kernel with related to ﬂuid incompressibility, the polynomial parameterization provides a more straightforward way to express the incompressible feature by imposing the following aﬃne constraints T to the candidate polynomial vector ﬁeld fˆ (x) = [p1 (x) , p2 (x)] : p1 (x) =

∂φ (x) ∂φ (x) , p2 (x) = − . ∂y ∂x

(10)

If fˆ (x) is parameterized to be a homogeneous polynomial vector ﬁeld with degree d, then the stream function φ (x) should be a scalar-valued polynomial of d + 1 degree, whose coeﬃcients are additional optimization variables to the optimization problem (Problem 1).

(a)

(b)

(c)

(d)

Fig. 5. Streamplot of ﬂow ﬁelds estimated with diﬀerent constraint conditions: (a) no constraints, (b) incompressibility-based constraint, (c) incompressibility-based constraint and eddy-based constraint; (d) Ground-truth ﬂow ﬁeld, a sub-region of the ﬂow ﬁeld in Fig. 3. The green line represents a trajectory of an AUV with red points denoting sensor measurements.

Combining the least-squares objective function (Eq. 2) with the eddy-based constraint (Eq. 7) and the incompressibility-based constraint (Eq. 10), we end up with an SDP formulation of the ﬂow ﬁeld estimation problem. In Fig. 5, the beneﬁts of enforcing these constraints are illustrated by presenting the ﬂow ﬁeld estimates under diﬀerent constraint conditions. Figure 5(a) demonstrates that without any constraint only a smooth polynomial vector ﬁeld that overﬁts the training data is obtained. Then by adding the incompressibility-based constraint, a regional ﬂow pattern (saddle point) near the training samples is successfully reconstructed, as presented in Fig. 5(b). Although the incompressibility-based constraint determines a general principle for the ﬂow motion, only the ﬂows in the

360

H. Xu et al.

local area of the trajectory can be well estimated. Therefore, to further prescribe a remote spatial pattern of the ﬂow ﬁeld, we additionally impose the eddy-based constraint. By combining these two types of constraints, the ﬂow ﬁeld could be well estimated even with a limited number of ﬂow velocity measurements, as shown in Fig. 5(c).

5

Experiment

In this section, experiments are performed with real ocean data to demonstrate that our proposed algorithm achieves better performance in terms of ﬂow ﬁeld estimation when compared with an existing method. In addition, our proposed estimation algorithm is proven to be useful in enabling energy-eﬃcient navigation in an unknown ﬂow ﬁeld when incorporated into a replanning framework. The ocean data including ﬂow ﬁelds and sea surface height (SSH) data used in the experiments is provided by Copernicus Marine Environment Monitoring Service (CMEMS) [1]. 5.1

Flow Field Estimation

This part presents the performance comparation on estimating ﬂow ﬁelds between our proposed SDP-based method and the incompressible-GP method in [12]. In the experiments, we simulate an AUV navigating with a straight path, during which the vehicle is assumed to periodically update its position and measure its ambient ﬂow velocity. The standard deviation σ of Gaussian noise in ﬂow measurements is set to be ten percent of the ﬂow magnitude. Both methods are executed to estimate the ﬂow ﬁeld using the same trajectory of sensor measurements, and the experimental results are shown in Fig. 6, where the green lines represent trajectories of the AUV and red points denote the positions of sensor measurements. For the incompressible-GP method, we use the following Radial Basis Function Kernel to derive the incompressible kernel: 1 T (11) k (x, x ) = σf2 exp − 2 (x − x ) (x − x ) 2l with the two parameters σf and l optimized by maximizing the marginal likelihood for the training data. In Fig. 6(d) and Fig. 6(e), the ﬂow ﬁeld estimates with diﬀerent number of measurements are shown, from which we can observe that the estimate improves as the number of measurements increases. However, the results also demonstrate that only the local region of measurements is well estimated, while the remote regions from the trajectory are generally not in good agreement with the ground-truth ﬂow ﬁeld in Fig. 6(c). As discussed in Sect. 4, although the incompressibility condition is enforced to determine a general principle for the ﬂow motion, ignoring spatial features of the ﬂow ﬁeld will lead to poor generalization of the estimated model, especially when the amount of data is limited.

Optimization-Based Online Flow Fields Estimation

361

Fig. 6. Comparison between the proposed SDP-based method and the incompressibleGP. (a), (b): streamplot of ﬂow ﬁelds estimated by our SDP-based method after 4 and 12 measurements respectively; (d), (e): streamplot of ﬂow ﬁelds estimated by the incompressible-GP using the same measurements as (a) and (b) respectively; (c): ground truth of the ﬂow ﬁeld; (f): RMSE from the two estimation algorithms with the increasing number of measurements.

For our SDP-based estimation algorithm, to compensate for shortage of online measurements, eddy geometry is used to provide the spatial feature of the ﬂow ﬁeld. So in this experiment we use an eddy detection algorithm [17] taking SSH data as input to detect the eddies within the ﬂow ﬁeld. Then the detected eddy boundary is approximated using quadratic B-spline to provide the eddy-based constraint, of which the angle threshold θτ in (Eq. 5) is set to be 11◦ . Following our proposed approach, the candidate ﬂow ﬁeld is parameterized as a homogeneous polynomial vector ﬁeld with the degree d = 5 in our experiment. Then the formulated SDP is solved by the Conic Operator Splitting Method (COSMO) solver [8], and the results are shown in Fig. 6(a) and (b). Compared with ﬂow ﬁelds estimated by incompressible-GP using the same data (Fig. 6(d) and (e)), our proposed method achieves better performance on extrapolating the training samples. It should be observed that in Fig. 6(a) after only four measurements, the estimated ﬂow ﬁeld matches the ground truth with minor disparity. The combination of incompressibility-based constraint and eddy-based constraint leads to successful reconstruction of the remote eddy pattern, thus a better extrapolation for the ﬂow is achieved, which proves that these identiﬁed constraints are useful in assisting the task of ﬂow ﬁeld learning. To further analyze the convergence of the two estimation algorithms, we test the methods in 10 ﬂow ﬁelds with diﬀerent gyre patterns, and compute root mean squared error (RMSE) between the estimated and the true ﬂow ﬁeld with increasing number of sensor measurements. The RMSE is computed across the entire ﬂow ﬁeld. From Fig. 6(f), it could be observed that for both algorithms the RMSE decreases with the accumulation of data. However, our proposed SDP-based

362

H. Xu et al.

method achieves a faster convergence rate than the incompressible-GP method, and a lower value of RMSE with any amount of training samples. Therefore, Fig. 6(f) clearly indicates that our proposed method achieves better performance in terms of estimating a ﬂow ﬁeld over the incompressible-GP method.

Fig. 7. Estimate the ﬂow ﬁeld (Fig. 6(c)) using various value of θτ based on the same four measurements. (a): RMSE with the increasing value of θτ , in the yellow region (θτ ∈ [10, 50]) the algorithm achieves good performance with no sensitivity to the value of θτ ; (b–f): streamplot of ﬂow ﬁelds estimated by diﬀerent values of θτ respectively, where (b), (c) show the unsatisfying results with high RMSEs and (d–f) present the superior results with low RMSEs.

In our method, the eddy-based constraint is achieved by enforcing a upper bound θτ to the angle between the ﬂow vector and the eddy boundary tangent. Thus in our optimization framework the angle threshold θτ is a key parameter for the model to grasp the spatial feature of the ﬂow ﬁeld. We further investigate the inﬂuence of θτ on the ﬂow ﬁeld estimates. In Fig. 7(a), it could be observed that our algorithm achieves the excellent performance within a large angular region ([10◦ , 50◦ ] in Fig. 7(a)). Figure 7(d–f) present three cases within this region, where estimations successfully capture the ﬂow ﬁeld, with the eddy inside the constraint region. Comparatively, Fig. 7(b–c) present cases if we constrain the angle too tight (θτ = 5◦ ) or too loose (θτ = 60◦ ). The former leads to a non-existent ﬂow feature (saddle point) to satisfy the ﬂuid incompressibility, and the later cannot make sure that the majority of the eddy stays within the constraint region. Nonetheless, only over tight and over loose bound may degrade the performance of our algorithm. These results reveal that our algorithm is robust enough to alleviate parameter-tuning burden. 5.2

AUV Navigation

To validate the estimated ﬂow ﬁeld could be used to improve an AUV’s navigation performance in an unknown ﬂow ﬁeld, in this subsection we develop a

Optimization-Based Online Flow Fields Estimation

363

replanning framework that closes the loop between ﬂow ﬁeld estimation and motion planning. Then simulation experiments are performed to demonstrate the eﬀectiveness of our proposed ﬂow ﬁeld estimation method in AUVs’ navigation applications.

Algorithm 1: Replanning framework in an unknown ﬂow ﬁeld 1 2 3 4 5 6 7 8 9

Input: a start position xinit , a goal position xgoal ˆ init ← MeasureFlow (xinit ); u ˆ init )}; T ← {(xinit , u repeat fˆ ← EstimateFlowField (T ); η ← Planner(fˆ, xgoal ); xcurr ← MoveToNextWaypoint(η); ˆ curr ← MeasureFlow (xcurr ); u ˆ curr ); T .add (xcurr , u until the AUV reaches the goal;

The replanning framework is detailed in Algorithm 1. During navigation, when the AUV reaches a new waypoint, it measures the ﬂow velocity at the location, then uses the function EstimateFlowField () to estimate the ﬂow ﬁeld based on the accumulated data T . Then with this new estimation of the ﬂow ﬁeld, the vehicle replans a new trajectory η using the function Planner () and moves to the next waypoint according to the new trajectory. The ﬂow ﬁeld estimation and trajectory planning are executed alternately until the vehicle reaches the goal. An underwater vehicle is simulated to navigate from a start position to a goal position, subject to the eﬀect of the ﬂow ﬁeld in Fig. 6(c). Assuming the ﬂow ﬁeld is unknown beforehand, the AUV navigates under the replanning framework, in which the function EstimateFlowField () could be either our proposed SDP-base method or the incompressible-GP method. For comparison, the trajectory planned with perfect ﬂow ﬁeld information will also be provided. In the experiment, we use an A*-based planner [10] to plan for trajectories that minimize the total energy consumption of the AUV. The energy consumption of an AUV trajectory η : [0, T ] → R2 is given by E (η) =

T

Ph + Pp (t) dt.

(12)

0

Ph is the vehicle hotel load deﬁned as the average power of the computing and sensor system, and Ph (t) = αv 3 (t) denotes the propulsion power, where v (t) is the vehicle speed relative to the water, α is a parameter depending on the vehicle’s drag coeﬃcient and the eﬃciency of the propulsion system. Energy eﬃciency is critical in underwater environments, especially for AUVs with constrained energy budget deployed for long-range autonomous missions. In the experiments, we use the parameter of Autosub [7], which is an underwater glider

364

H. Xu et al.

widely used in taking oceanographic proﬁles, and the parameters are set as Ph = 0.5 W and α = 276.08. Table 1. Average Energy Cost (MJ) Optimal SDP (Ours) incompressible-GP Energy Cost 0.409

0.492

0.739

Fig. 8. Comparison between the replanning process based on the incompressible-GP (a–c) and our SDP method (d–f) respectively. The trajectories that have been tracked by the AUV is shown is green, where the red dots denote the ﬂow measurements. And yellow curves are the newly planned trajectories based on the current ﬂow ﬁeld estimates (streamplot).

The cost function (Eq. 12) encourages the AUV to go with the ﬂow to reduce relative motion between the ﬂow and the vehicle. So an accurate ﬂow estimate will lead the planner to generate an energy-eﬃcient path. Figure 9 shows three example paths resulting from various ﬂow ﬁeld estimation methods, in which “Optimal” refers to the expected optimal path planned based on the true ﬂow ﬁeld. And Fig. 9. Simulated paths planned based their corresponding replanning processes on diﬀerent ﬂow ﬁeld estimates and the are presented in Fig. 8, from which we true ﬂow ﬁeld respectively. could observe that the path generated from our SDP method has smaller disparity with the path “Optimal”. In contrast, the replanning process of incompressible-GP shown in Fig. 8 (a–c) indicates

Optimization-Based Online Flow Fields Estimation

365

that lacking the ability of predicting the global structure of the ﬂow ﬁeld may cause the newly planned trajectory to deviate from the optimal trajectory. Further, we compute the path cost to numerically compare the planning results. Table 1 provides the average energy cost of the trajectories for 50 experiments with random start-goal combinations, which demonstrates that our proposed method is useful in planning energy-eﬃcient trajectories for an AUV navigating in unknown ﬂow ﬁelds.

6

Conclusions

This work develops an optimization framework for estimating a ﬂow ﬁeld from a small number of online data. In order to improve the generalization of the model and avoid overﬁtting, eddy information and ﬂuid incompressibility are exploited in our algorithm in the form of convex constraints. The experiments with real-world oceanic data demonstrate the applicability of our approach to ﬂow ﬁeld learning, and further show that the method could be used to assist AUVs’ navigation when combined with a motion planning approach. In future work, we would like to explore the application of our method in an active sampling framework, which aims to improve the eﬃciency of modeling the ﬂow ﬁeld by selecting sampling locations. Acknowledgements. This work was supported by HKSAR Research Grants Council (RGC) General Research Fund (GRF) HKU 11202119, 11207818, and the Innovation and Technology Commission of the HKSAR Government under the InnoHK initiative.

References 1. Copernicus marine environment monitoring service. https://marine.copernicus.eu 2. Ocean surface topography of Space-NASA. https://sealevel.jpl.nasa.gov 3. Ahmadi, A.A., Khadir, B.E.: Learning dynamical systems with side information. In: Proceedings of the 2nd Conference on Learning for Dynamics and Control, vol. 120, pp. 718–727 (2020) 4. Evensen, G.: The ensemble Kalman ﬁlter: theoretical formulation and practical implementation. Ocean Dyn. 53(4), 343–367 (2003) 5. Fox, R.W., McDonald, A.T., Mitchell, J.W.: Fox and McDonald’s introduction to ﬂuid mechanics. John Wiley and Sons (2020) 6. Fu, L.L., Chelton, D.B., Le Traon, P.Y., Morrow, R.: Eddy dynamics from satellite altimetry. Oceanography 23(4), 14–25 (2010) 7. Furlong, M.E., Paxton, D., Stevenson, P., Pebody, M., McPhail, S.D., Perrett, J.: Autosub long range: a long range deep diving auv for ocean monitoring. In: 2012 IEEE/OES Autonomous Underwater Vehicles (AUV), pp. 1–7. IEEE (2012) 8. Garstka, M., Cannon, M., Goulart, P.: COSMO: A conic operator splitting method for large convex problems. In: European Control Conference (2019) 9. Hackbarth, A., Kreuzer, E., Schr¨ oder, T.: CFD in the loop: ensemble Kalman ﬁltering with underwater mobile sensor networks. In: International Conference on Oﬀshore Mechanics and Arctic Engineering, vol. 45400, p. V002T08A063. American Society of Mechanical Engineers (2014)

366

H. Xu et al.

10. Kularatne, D., Bhattacharya, S., Hsieh, M.A.: Time and energy optimal path planning in general ﬂows. In: Robotics: Science and Systems, pp. 1–10 (2016) 11. Kularatne, D., Bhattacharya, S., Hsieh, M.A.: Optimal path planning in timevarying ﬂows using adaptive discretization. IEEE Robot. Autom. Lett. 3(1), 458– 465 (2017) 12. Lee, K.M.B., Yoo, C., Hollings, B., Anstee, S., Huang, S., Fitch, R.: Online estimation of ocean current from sparse GPS data for underwater vehicles. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 3443–3449 (2019) 13. Ma, K.C., Liu, L., Sukhatme, G.S.: An information-driven and disturbance-aware planning method for long-term ocean monitoring. In: 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 2102–2108. IEEE (2016) 14. Ma, T., Wang, S.: Geometric theory of incompressible ﬂows with applications to ﬂuid dynamics, no. 119. American Mathematical Soc. (2005) 15. Madec, G., et al.: NEMO ocean engine (2017). https://hdl.handle.net/2122/13309 16. Mallios, A., Ridao, P., Ribas, D., Carreras, M., Camilli, R.: Toward autonomous exploration in conﬁned underwater environments. J. Field Robot. 33(7), 994–1012 (2016) 17. Mason, E., Pascual, A., McWilliams, J.C.: A new sea surface height-based code for oceanic mesoscale eddy tracking. J. Atmos. Oceanic Tech. 31(5), 1181–1188 (2014) 18. Merckelbach, L., Briggs, R., Smeed, D., Griﬃths, G.: Current measurements from autonomous underwater gliders. In: 2008 IEEE/OES 9th Working Conference on Current Measurement Technology, pp. 61–67 (2008) 19. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Math. Program. 39(2), 117–129 (1987). https://doi.org/10. 1007/BF02592948 20. Nutalapati, M.K., Joshi, S., Rajawat, K.: Online utility-optimal trajectory design for time-varying ocean environments. In: 2019 International Conference on Robotics and Automation (ICRA), pp. 6853–6859. IEEE (2019) 21. Oke, P.R., et al.: Evaluation of a near-global eddy-resolving ocean model. Geoscientiﬁc Model Develop. 6(3), 591–615 (2013) 22. Parrilo, P.A.: Structured semideﬁnite programs and semialgebraic geometry methods in robustness and optimization. California Institute of Technology (2000) 23. Petrich, J., Woolsey, C.A., Stilwell, D.J.: Planar ﬂow model identiﬁcation for improved navigation of small AUVs. Ocean Eng. 36(1), 119–131 (2009) 24. Putinar, M.: Positive polynomials on compact semi-algebraic sets. Indiana Univ. Math. J. 42(3), 969–984 (1993). https://www.jstor.org/stable/24897130 25. Robinson, A.R.: Overview and summary of eddy science. In: Robinson, A.R. (ed.) Eddies in Marine Science. Topics in Atmospheric and Oceanographic Sciences, pp. 3–15. Springer, Heidelberg (1983). https://doi.org/10.1007/978-3-642-69003-7 1 26. Shchepetkin, A.F., McWilliams, J.C.: The regional oceanic modeling system (ROMS): a split-explicit, free-surface, topography-following-coordinate oceanic model. Ocean Model. 9(4), 347–404 (2005) 27. Sun, X., Zhang, M., Dong, J., Lguensat, R., Yang, Y., Lu, X.: A deep framework for eddy detection and tracking from satellite sea surface height data. IEEE Trans. Geosci. Remote Sens. 59(9), 7224–7234 (2020)

Optimization-Based Online Flow Fields Estimation

367

28. To, K., Kong, F.H., Lee, K.M.B., Yoo, C., Anstee, S., Fitch, R.: Estimation of spatially-correlated ocean currents from ensemble forecasts and online measurements. arXiv preprint arXiv:2103.04036 (2021) 29. Vandenberghe, L., Boyd, S.: Semideﬁnite programming. SIAM review 38(1), 49–95 (1996) 30. Zika, J.D., England, M.H., Sijp, W.P.: The ocean circulation in thermohaline coordinates. J. Phys. Oceanogr. 42(5), 708–724 (2012)

Flying Hydraulically Amplified Electrostatic Gripper System for Aerial Object Manipulation Dario Tscholl(B) , Stephan-Daniel Gravert, Aurel X. Appius, and Robert K. Katzschmann Soft Robotics Lab, ETH Zurich, Tannenstrasse 3, 8092 Zurich, CH, Switzerland {dtscholl,rkk}@ethz.ch https://srl.ethz.ch/ Abstract. Rapid and versatile object manipulation in air is an open challenge. An energy-eﬃcient and adaptive soft gripper combined with an agile aerial vehicle could revolutionize aerial robotic manipulation in areas such as warehousing. This paper presents a bio-inspired gripper powered by hydraulically ampliﬁed electrostatic actuators mounted to a quadcopter that can interact safely and naturally with its environment. Our gripping concept is motivated by an eagle’s foot. Our custom multi-actuator concept is inspired by a scorpion tail design (consisting of a base electrode with pouches stacked adjacently) and spider-inspired joints (classic pouch motors with a ﬂexible hinge layer). A hybrid of these two designs realizes a higher force output under moderate deﬂections of up to 25◦ compared to single-hinge concepts. In addition, sandwiching the hinge layer improves the robustness of the gripper. For the ﬁrst time, we show that soft manipulation in air is possible using electrostatic actuation. This study demonstrates the potential of untethered hydraulically ampliﬁed actuators in aerial robotic manipulation. Our proof of concept opens up the use of hydraulic electrostatic actuators in mobile aerial systems. (https://youtube.com/watch?v=7PmZ8C0Ji08) Keywords: Soft robotics · Hydraulically ampliﬁed electrostatic actuators · Soft gripper · Quadcopter · System integration · Bio-inspired · Aerial manipulation

1

Introduction

Soft robotics is a rapidly growing research ﬁeld that combines compliant and rigid materials to achieve motions similar to living organisms. Working with soft materials enables engineers to build technologies that closer mimic the design principles found in nature. Furthermore, soft materials make systems more capable of complying with many diﬀerent tasks while making them more user-friendly Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 25. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 368–383, 2023. https://doi.org/10.1007/978-3-031-25555-7_25

Flying Electrostatic Gripper

369

by allowing continuous deformation. This concept is particularly interesting for applications that require the interaction of a system with its environment. A continually growing and substantial market for robotic automation is found in warehousing. With aerial delivery drones capturing headlines, the pace of adoption of drones in warehouses has shown great acceleration. Warehousing constitutes 30% of the cost of logistics in the US. The rise of e-commerce, greater customer service demands of retail stores and a shortage of skilled labor intensify competition for eﬃcient warehouse operations [1]. A few companies like Soft Robotics Inc. already incorporate soft robotics in their automated production facilities. Soft robotics enables manipulating traditionally hard-to-grasp objects while being compliant with many diﬀerent sized and shaped bodies. Combining the advantage of soft robotic systems with aerial vehicles would make for a revolutionary approach to warehouse manipulation. We choose to use hydraulically ampliﬁed electrostatic actuators, also referred to as HASEL actuators, because they enable muscle-like performance while being lightweight and providing fast actuation speeds. Especially in ﬁelds where payload mass is critical, like in aeronautics, those properties become decisive. Since electrostatic actuators have been used almost exclusively in a laboratory environment, this paper aims to demonstrate the potential of HASELs as part of a more complex system to eventually tackle real industry needs. 1.1

Objectives

The main objective of this paper was to design and manufacture a soft robotic gripper for object manipulation using HASEL actuators. Our work aimed to demonstrate the potential of soft actuators on aerial vehicles and took a ﬁrst step toward developing entirely soft aerial systems. As a ﬂying platform, we used the RAPTOR quadcopter developed in parallel and explained in more detail in Sect. 2.3. The goal of RAPTOR is to dynamically pick and place objects in a swooping motion, which is very similar to an eagle’s way of hunting. When eagles hunt their prey, they mainly actuate the very front of their toes, known as talons, and barely displace their tarsus. Figure 1 illustrates a preliminary concept of how the said characteristic could be mirrored in a HASEL gripper design. The real strength of an eagle’s grip stems from his upper leg. However, as shown in [3], a bio-inspired, tendon-driven design would not be feasible for drone integration due to its large size. Consequently, a design inspired by [4] was sought after. 1.2

Related Work

Soft Robotics on Aerial Vehicles. The idea of implementing a gripper on an aerial vehicle [5] or speciﬁcally on a drone [6] is, in principle, not a novelty. Pioneering designs were rather large and made using classic rigid robotics. Furthermore, the capacity was minimal, only allowing to lift objects weighing up to 58 g. With time, the force output of newly proposed drone grippers increased substantially. Some designs even exceeded 30.6 N in terms of force output [7]. In recent times, soft robotics has been incorporated more frequently in aerial

370

D. Tscholl et al.

Fig. 1. Bio-inspired design where the left-hand image shows an eagle’s foot as taken from [2] and the right-hand illustration represents an early concept based on an eagle’s gripping mechanism. The main actuation for the talon is highlighted in both images in green. The actuator sits right behind the talon.

vehicles. The most common types feature tendon-driven grippers that focus on aggressive and dynamic grasping [8,9]. The downside of tendon-driven actuators is that they still require rigid motors and thus lack the potential for developing a fully compliant system in the future. Fully compliant systems are of special interest when interacting with humans and encountering, for example, localization problems. Soft actuators and a soft structure would substantially decrease the risk of harm in case of any anomaly. In soft robotics, ﬂuidic actuators are predominantly used because of their high force output [10]. Drone grippers of this type do exist and even demonstrate impressive perching capabilities [11]. However, ﬂuidic actuators are driven by pressurized gas or liquid [12] requiring a network of channels, tubes, and pumps. Those things add weight and generate losses in speed and eﬃciency due to ﬂuid transportation. That is why ﬂuidic actuators on aerial vehicles struggle with a trade-oﬀ between high actuation strain/pressure (determining the force and actuation frequency) and payload capacity. Hydraulically Amplified Self-Healing Electrostatic Actuator. Over the last few years, a new actuator type called the hydraulically ampliﬁed self-healing electrostatic actuator has shown great potential in many diﬀerent applications [13]. HASELs have a high power density and are eﬃcient when driven by a recuperating high voltage power supply. Unlike dielectric elastomer actuators (DEAs) [14], the dielectric in HASELs is a liquid that enables the actuator to selfheal itself immediately after an electric breakdown. Even multiple breakdowns won’t inﬂuence the functionality, assuming the polymer ﬁlm doesn’t take any damage. Besides that, HASEL actuators have shown high force outputs [15], fast actuation speeds and low power consumption [3], all while being lightweight [16]. HASEL actuators are often also referred to as artiﬁcial muscles because of their force/strain characteristics. Certain HASEL types even exceed muscle-like performance in speciﬁc areas, showing up to 24% of strain [17].

Flying Electrostatic Gripper

371

HASEL Gripper Concepts. The most straightforward approach for designing a HASEL gripper is to stack single Donut HASELs and have at least two stacks facing opposite to each other [12]. Changing the pouch from a dimple to a quadrant design allows for additional improvement in the overall strain performance when dealing with larger stacks [18]. Over time, new HASEL actuator concepts, such as planar HASELs, surfaced, leading to novel, more dynamic gripper designs. Unlike in the Donut HASELs, the electrodes of planar HASEL grippers are located at the end. A backbone plate or strain limiting layer is added to conﬁne the extension of the dielectric shell, leading to a unidirectional bending motion [18,19]. Applying electrodes on the sides instead of the ends results in a so-called High Strain-Peano-HASEL, a variation of a classic Peano-HASEL but exhibiting almost 40% more strain at low forces [17]. The High Strain-PeanoHASEL allows for a much denser actuator design along its length, but the large width of the actuation unit is not suitable for a compact gripper design. Current research focuses on improving the classic pouch motors [20] to achieve high speciﬁc torque outputs in combination with a lightweight design. Making use of bio-mimetic elements, certain HASEL’s have the ability to exceed a torque output of a servo motor with similar dimensions [4]. Showing an increase in strength by a factor of ﬁve compared to previous HASEL designs, this advancement poses a big step towards a potential use in more complex systems. Placing the actuators afar from the joints and inducing the force via tendons yields another gripper concept found in the ﬁeld of prosthetics. In that context, the Peano-HASELs can be seen as artiﬁcial muscles with the tendons acting as ligaments [3]. Despite the higher force output of Peano-HASELs [16], outweighing DC motors in terms of speed and energy consumption, the pinching force is considerably smaller. Moreover, the size requirement of the whole system prevents its use in aeronautical applications.

2

Methods

This section elaborates on the physics and challenges of working with hydraulically ampliﬁed electrostatic actuators. The manufacturing process of our actuators is described and we brieﬂy introduce the quadcopter used to ﬂy the gripper. 2.1

HASEL Challenges

The main challenge when working with HASELs is ﬁnding a sweet spot in the force/strain relation of the actuator under a certain operating voltage V. Because the force is proportional to the voltage squared as derived from the analytical model of a Peano-HASEL presented in [15], F =

w cos(α) 0 r V 2 4t 1 − cos(α)

(1)

372

D. Tscholl et al.

with α being half the angle spanned by the ﬁlled pouch under actuation and ε0 , the permittivity of free space, it is generally desired to operate at the highest voltage possible without causing any electrical breakdowns. On the other hand, when developing an aerial system that interacts with its environment, it’s typically better to stay within a lower voltage region while still ensuring suﬃcient strength and strain. Although the voltage has the biggest inﬂuence on the HASEL’s performance, as seen in Eq. 1, the pouch geometry and the shell material also play a signiﬁcant role [15]. In particular, parameters like the width w, the thickness t or the dielectric permittivity of the material εr are of importance. For strain improvements, the initial pouch length also becomes relevant. Because Maxwell pressure is independent of the electrode area, the actuation force and strain can be scaled by adjusting the ratio of electrode area to the total surface of the elastomeric shell. That means an actuator with larger electrodes displaces more liquid dielectric, generating a larger strain but a smaller force because the resulting hydraulic pressure acts over a smaller area. Conversely, an actuator with smaller electrodes displaces less liquid dielectric, generating less strain but more force because the resulting hydraulic pressure acts across a larger area [12]. 2.2

Actuator Manufacturing

The stencil and sealing pattern of the actuator were designed in CAD. Once all the components were manufactured, the polyester ﬁlm was sealed. It turned out that using Mylar 850, a co-extruded, one-side heat sealable polyester ﬁlm, worked best. Other materials like Hostaphan were also used throughout the many iterations, though resulting in signiﬁcantly weaker sealing lines while also requiring higher temperatures. For the sealing process, we used a Prusa 3D printer with a custom conﬁguration to provide the optimal sealing temperature at an adequate speed. After that, we reinforced the holes of the polymer ﬁlm with structural tape and added the electrodes using an airbrush. The described process is depicted in Fig. 2. The ﬁnal steps included ﬁlling the pouches using the Envirotemp FR3 Fluid as dielectric and attaching the leads to the electrodes with a two-component epoxy adhesive. We decided to forego a commonly used electrode channel to make the actuator more compact. Regarding the structural part of the gripper, we chose to 3D print all of the parts using PLA. Furthermore, we came up with a sandwich conﬁguration to make the ﬁnger more sturdy compared to previous work [4]. We used standard packing tape for the hinge layer that connects all the bottom pieces and guides the structure during rotation. To join the top with the bottom plates, we decided on simple tape to omit the need for screws to save weight and space. Lastly, we utilized the same adhesive for mounting the HASEL actuator to the backplate structure.

Flying Electrostatic Gripper

373

Fig. 2. Manufacturing process of a HASEL actuator. A) Sealing two Mylar sheets to obtain the desired actuator geometry. B) Sealed polymer ﬁlm with reinforced mounting holes. The holes primarily served the ﬁxation during the airbrushing process. C) Actuator is clamped between 3D printed stencil, ready for airbrushing.

2.3

RAPTOR Quadcopter

RAPTOR is an acronym that stands for Rapid Aerial Pick-and-Transfer of Objects by Robots. The system consists of a conventional quadcopter frame where a gripper can be attached to the bottom. [21] utilized a soft gripper to pick objects with diﬀerent geometries in a rapid swooping motion. The system architecture, seen in Fig. 3 allows for an eﬀortless integration of various gripper types.

Fig. 3. The quadcopter uses a motion capture system to determine its position in space. A ground computer sends both position and gripper commands to the onboard computer. From there, the commands are forwarded to the ﬂight controller and the gripper using a serial connection.

374

3

D. Tscholl et al.

Inverse vs. Classical Gripper Approach

When in ﬂight, birds of prey extend their feet behind, carrying them closely under their tail. Only when approaching prey do eagles bring their feet to the front and open up their talons. Observing this behavior sparked the idea of designing a prestressed system that would open upon actuation but remains closed by default. That concept is essentially the inverse process of a traditional gripper that contracts during operation. Because of that, we decided to call this concept ”inverse gripper”. We initially wanted to implement the prestretching in our prototype by using a spring. Since there are countless springs on the market and we were only interested in the prototype’s functionality, we decided to use a stretchable cord with a diameter of 1 mm. The cord performed very similarly to our traditional one-hinge actuator designs when loaded. Figure 4 shows the inverse gripper and a side-by-side comparison to a conventional ﬁnger design. Because of the increased bearing surface through prestressing the talon and the facilitated pulling due to the initial displacement, the inverse gripper showed a 14.2% greater maximum force output compared to a traditional design with the same actuator geometry. However, we did not pursue the inverse gripper concept beyond prototyping because of overall simplicity. Besides requiring a more elaborate design and fabrication, mounting the actuator outside the gripper increases the risk of it taking damage during ﬂight. More so in the early stages when drone and gripper controls are not yet optimized.

Fig. 4. Side-by-side comparison of a traditional and inverse gripper. A) Reverse gripper with a pouch length of 40 mm and width of 6 mm. In contrast to a classic gripper, the actuator is mounted on the outside of the toe. On the inside, a 1 mm thick cord was used for prestressing the system. B) & C) Visual comparison of the reverse gripper with a traditional model. The black tape is used to insulate the gripper to avoid arching at high voltages.

4

Actuator Types and Configurations

Instead of building a complete gripper with every iteration, which would have been time-consuming, we followed a bottom-up approach. That means we ﬁrst

Flying Electrostatic Gripper

375

focused on one hinge, then determined the optimal conﬁguration of one toe and ﬁnally built the entire system. The baseline for our joint optimization was a HASEL actuator with a pouch geometry of 40 × 10x10 mm, denoting the electrode length, electrode width, and the pouch width, respectively. Those values originated from a preliminary design study we conducted to ﬁnd proper dimensions as a starting-point. Since we aimed to design a system for an aerial vehicle, we intended to save as much weight and space as possible while not signiﬁcantly compromising the deﬂection or force output. Consequently, the length and width of the pouch played a signiﬁcant role. So, we divided the single-hinge optimization into two test cases: (a) Pouch Width Test (PWT) and (b) Pouch Length Test (PLT). 4.1

Pouch Width Test (PWT)

In the PWT series, the primary attention went towards the maximum deﬂection the actuator can achieve. As shown in [15], increasing the deﬂection of a PeanoHASEL is not a linear process. Instead, the free strain reaches a pinnacle before declining again. Therefore, we investigated how that observation translates from a linear to an angular HASEL actuator. We deﬁned pouch widths ranging from 6 mm to 14 mm for our experiment. If we take the relationship between the width of the electrode (Le ) and the width of the whole pouch (Lp ), we end up with the following range for our test case: 0.417 ≤ fe,P W T ≤ 0.625

and fe =

Le Lp

(2)

While the measured torque kept rising by increasing the pouch width, the actuator reached a maximum deﬂection of nearly 50◦ at a 12 mm pouch width. That equals an electrode coverage of fe,P W T 12 = 0.455. After that, the largest possible deﬂection of the talon decreased, as shown in Fig. 5. 4.2

Pouch Length Test (PLT)

Conversely, in the PLT series, the achievable deﬂection should not change by varying the electrode length. What does change is the force output. Tapering the electrodes when stacking multiple joints or actuator types would make for a more realistic toe design. Hence, we investigated how narrowing the electrode length impacted the gripping force. As predicted by Eq. 1, the torque and pouch length share a linear relation. Comparing the torque values from an actuator with a 40 mm to one with a 30 mm electrode length yields a loss of roughly 40%, which notably impairs the functionality of the gripper. As a result, tapering the electrodes when stacking multiple hinges or actuation types was not an option. 4.3

Scorpion Concept Test

An early example of a bio-mimicking planar HASEL is the scorpion tail concept. Hence, when mentioning ”Scorpion” throughout this paper, we are referring

376

D. Tscholl et al.

Fig. 5. Performance comparison of diﬀerent actuators from the pouch width test (PWT) series. A) Illustration of a PWT actuator. B) Exceeded torque from the individual actuator geometries at diﬀerent deﬂection angles. Second-order curve ﬁts approximate the measurement response.

to the design of having a base electrode with pouches stacked adjacently. The Scorpion is exciting because large deﬂections can be achieved while saving a lot of space due to the absence of electrodes between the pouches. For our Scorpion concept, we chose to further develop the initial design [18] by combining it with the assets of the spider-inspired joints [4]. We selected a base electrode with a height of 20 mm to avoid large pre-deﬂections. Further, we used two adjacent pouches, each with a width of 10 mm. The preliminary design study revealed that the dielectric struggles to prevail through the channel connecting the two pouches if its width is below 4 mm. One way to avoid that problem is to use a dielectric with better ﬂowability, like silicone oil, rather than the Envirotemp FR3. Narrowing the channel between the two pouches would possibly increase the performance by having less liquid blocking the hinge and larger cylinder areas. However, even with that beneﬁt, the silicone oil would likely not exceed the force output of a Scorpion containing the Envirotemp FR3. Hence, we selected a channel width of 5 mm and FR3 dielectric for our Scorpion prototype. Undergoing several actuation tests, the Scorpion demonstrated a deﬂection of more than 50◦ . The force output was satisfying with 0.2 N under 20◦ deﬂection. A side-by-side comparison of the performance between the Scorpion and the single-hinge concept with similar dimensions (fe,Scorpion = fe,P W T 10 ) demonstrated that the single-hinge actuator is superior until a deﬂection of around 19◦ . After that point, the two-pouch design delivered notably higher forces and contracted almost 30% more, as depicted in Fig. 6. 4.4

Complete Finger Configuration

After we determined the optimal pouch geometry for our actuator and compared the two main actuation types, we looked into diﬀerent ways of combining every-

Flying Electrostatic Gripper

377

Fig. 6. Performance of a two-pouch Scorpion actuator vs. a 10 mm pouch width (PWT) actuator. A) The PWT actuator performs better until reaching a deﬂection of around 19◦ . Subsequently, the scorpion demonstrates a higher force output combined with a deﬂection increase of about 30%. B) Shows the force measurement setup and process of a PWT actuator in idle state V0 = 0 kV and during actuation V1 = 10 kV.

thing to a complete toe conﬁguration. We found that the two gripper designs described below and illustrated in Fig. 7 both oﬀer great performance while being as compact as possible. I. In Fig. 6, the classic single-hinge design showed a force output of 0.2 N under a deﬂection of 20◦ . Connecting three of those in series should, in principle, yield a deﬂection of at least 80◦ while still generating a force of roughly 0.1 N, assuming a linear stacking behavior. Because of its similarity to a triple-pouch Peano-HASEL design, we abbreviated this concept to “Triple”. II. Because the Scorpion design demonstrated an exemplary deﬂection behavior, we fused it with a classical single-hinge design to further improve the deﬂection and increase the force output at the tip of the talon. We chose to name the said concept “Hybrid”. Switching the electrode material from BP-B311 to Electrodag 502, a carbon paint commonly utilized in similar publications, led to an overall force increase of up to 26.5%. After that, we compared the performance between the Triple and the Hybrid concept, as depicted in Fig. 8. It turned out that for slight deﬂections up to 26◦ , the Hybrid displays a higher force output. Subsequently, the Triple holds an advantage over the Hybrid until they reach a contraction of about 56◦ , where they meet again at a force output of around 0.025 N. Putting the two concepts into perspective with the classical single-hinge and Scorpion actuator gave rise to Fig. 9. It can be observed that the advantage of the Hybrid at lower deﬂections stems from the additional single-hinge in the front. Since the Scorpion part of the Hybrid takes care of the main deﬂection, the front actuator

378

D. Tscholl et al.

Fig. 7. Side-by-side comparison between the Hybrid and Triple concept. The illustrations and the picture in the middle visualize the idea and prototype of the chosen conﬁgurations, respectively. The dimensions for the Hybrid were 20/10 mm electrode width, 10/10/12 mm pouch width, and for the Triple, 10/10/10 mm electrode width and 12/12/12 mm pouch width. All values are in order, from base to tip.

Fig. 8. Performance comparison between the Hybrid and Triple conﬁguration. The electrodes of both actuators were made using Electrodag 502. A) Raw data comparison between the two proposed toe designs, indicating a clear advantage of the Hybrid. B) Evolution of the actuator’s exhibited force at diﬀerent angles. The Hybrid performs better under small deﬂections, while the Triple seems to have the upper hand at medium to large contractions.

only displaces minimally, nearly exerting its full force-potential. Looking at the performance of a single-hinge and comparing it with the output from the Triple implies a linear stacking behavior. Thus, explaining the consistency throughout actuation by sharing the deﬂection more or less equally among all three pouches.

5

Drone Integration

After identifying the Hybrid as the best gripper conﬁguration for our case, we designed the electronics compartment. As seen in Fig. 10, the system is composed

Flying Electrostatic Gripper

379

Fig. 9. Side-by-side comparison between the single-hinge, Scorpion, Hybrid and Triple design. The graph provides a comparative overview of the actuators’ performance relative to each other.

of a micro-controller (Arduino Mega), a step-down converter (to supply the HVPS and Arduino with 5 V) and a 3S 11.1 V / 500 mA h LiPo battery. The high voltage power supply (HVPS) [18] is hidden inside the box to provide the actuator with the necessary voltages of up to 10 kV. The electronic components are covered by self-designed snap-ﬁt cases for quick and easy access while oﬀering protection and cleaner cable management. The connection terminals distribute the high voltage and ground path through the wires leading to the gripper. The gripper is mounted in the middle by inserting its counterpart into the connector. 5.1

Flight Experiment

Finally, we combined our gripper system with the RAPTOR quadcopter. For safety reasons, we separated the drone’s power supply from the one of the gripper in case of any failure like an electrical breakdown. A successful aerial grasp, as in Fig. 11, requires time-critical actuation, which was achieved by using the existing RAPTOR system architecture described in Sect. 3. By installing a serial connection between the onboard computer and the gripper’s microcontroller, commands could be sent at the right time. Once the serial actuation signal is detected, the microcontroller transmits a pulse width modulation (PWM) to the buck converter’s MOSFET on the HVPS, activating the optocoupler. The high voltage charge then ﬂows to the H-bridge (used for high voltage switching) connected to the gripper and actuates it. The mechanical connection between

380

D. Tscholl et al.

Fig. 10. Untethered power supply and control system of the gripper. The gripper is operated by an Arduino Mega. A 3S LiPo battery with a step-down module powers the Arduino and the high voltage power supply that provides the gripper with the necessary voltage.

Fig. 11. Full system with Triple actuator concept demonstrating aerial object manipulation. The drone picks and places an object weighing 76 g with a gripper operating voltage of 8 kV.

the drone and the gripper was accomplished by inserting two carbon rods that passed through small connectors on both subsystems. As a test object, we used a stuﬀed animal a bit larger than the size of a ﬁst weighing 76 g. We had to

Flying Electrostatic Gripper

381

Fig. 12. Full gripper assembly in the eagle conﬁguration with the Hybrid actuator concept. The system is shown from a side and top view, respectively.

ﬁxate the plush with a small stick to prevent it from being blown away by the downwash caused by the quadcopter propellers. It had to be taken into account that the actuation time to fully contract the toes took notably longer using the untethered HVPS than the TREK 20/20C-HS used for static ground tests. That is because the bandwidth and the actuation speed depend on how much charge per unit time can be supplied to the actuator’s electrode. Hence, when aiming towards grasping objects in passing, a more eﬃcient HVPS should be considered. Once again leading to a trade-oﬀ between size and performance.

6

Conclusion and Future Work

We proposed a bio-inspired HASEL gripper which we integrated as part of a bigger system for the ﬁrst time. We investigated various pouch geometries and conﬁgurations leading us to a satisfying trade-oﬀ between force output and deﬂection. We showed that by combining diﬀerent actuator types into a hybrid design, the actuator could achieve a higher force output under small deﬂections compared to a more traditional Triple concept. Figure 12 shows the eagle-inspired gripper. As opposed to our original idea, the bio-characteristics were less predominant in the ﬁnal design. Through testing, it became clear that narrowing down the area of rotation to mimic an eagle’s talon actuation will not yield a suﬃcient force output. Since our system did not require perching capabilities, we departed from the hallux (back toe) in later versions. Doing so yielded an equally spaced arrangement of the four toes, evident in Fig. 11, which enabled us to better employ their individual force contributions and improve redundancy. Stationary experiments and in-ﬂight tests conﬁrmed the gripper’s performance of being able to pick and place objects weighing up to 76 g without the use of

382

D. Tscholl et al.

additional material to increase friction. Compared to previous HASEL grippers, our hinges proved to be signiﬁcantly more sturdy and durable, even enduring small crashes. In addition, we did not experience any electrical breakdowns during several in-ﬂight tests, even at operating voltages of up to 10 kV. With a working proof of concept at hand, weight and space savings could be tackled in the future. That includes redesigning the HVPS and selecting a diﬀerent-sized microcontroller. Furthermore, a step-down converter that is powerful enough to keep the supplied voltage constant during operation would improve the gripping strength, especially when traveling long distances. Another approach could be redesigning the gripper’s structure using soft materials, similar to [22], to create an entirely soft system. Finally, improving the speciﬁc force and torque output of HASELs remains an open challenge that will signiﬁcantly impact the scalability of future actuators.

References 1. Companik, E., Michael, G., Farris, M.T.: Feasibility of warehouse drone adoption and implementation. J. Transport. Manage. 28, 5 (2018) 2. Avian Report: All about bald eagle talons. Accessed 12 Jun 2022 3. Yoder, Z., et al.: Design of a high-speed prosthetic ﬁnger driven by peano-hasel actuators. Front. Robot. AI 7, 181 (2020) 4. Kellaris, N., et al.: Spider-inspired electrohydraulic actuators for fast, soft-actuated joints. Advanced Sci. 8(14), 2100916 (2021) 5. Pounds, P.E.I., Bersak, D.R., Dollar, A.M.: Grasping from the air: hovering capture and load stability. In: 2011 IEEE International Conference on Robotics and Automation, pp. 2491–2498 (2011) 6. Kruse, L., Bradley, J.: A hybrid, actively compliant manipulator/gripper for aerial manipulation with a multicopter. In: 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), pp. 1–8 (2018) 7. Setty, K., van Niekerk, T., Stopforth, R.: Generic gripper for an unmanned aerial vehicle. Enhancing design through the 4th Industrial Revolution Thinking. Procedia CIRP 91, 486–488 (2020) 8. Fishman, J., Ubellacker, S., Hughes, N., Carlone, L.: Dynamic grasping with a soft drone: from theory to practice. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4214–4221 (2021) 9. Roderick, W.R.T., Cutkosky, M.R., Lentink, D.: Bird-inspired dynamic grasping and perching in arboreal environments. Sci. Robot. 6(61), eabj7562 (2021) 10. Trimmer, B.: A practical approach to soft actuation. Soft Robot (SoRo) 4(1), 1–2 (2017) 11. Miron, G., B´edard, B., Plante, J.-S.: Sleeved bending actuators for soft grippers: a durable solution for high force-to-weight applications. Actuators 7(3), 40 (2018) 12. Acome, E., et al.: Hydraulically ampliﬁed self-healing electrostatic actuators with muscle-like performance. Science 359(6371), 61–65 (2018) 13. Rothemund, P., Kellaris, N., Mitchell, S.K., Acome, E., Keplinger, C.: HASEL artiﬁcial muscles for a new generation of lifelike robots-recent progress and future opportunities. Adv. Mater. 33(19), 2003375 (2021) 14. Pelrine, R., Kornbluh, R., Pei, Q., Joseph, J.: High-speed electrically actuated elastomers with strain greater than 100%. Science (New York, N.Y.) 287(5454), 836-839 (2000)

Flying Electrostatic Gripper

383

15. Kellaris, N., Venkata, V.G., Rothemund, P., Keplinger, C.: An analytical model for the design of Peano-HASEL actuators with drastically improved performance. Extreme Mech. Lett. 29, 100449 (2019) 16. Kellaris, N., Venkata, V.G., Smith, G.M., Mitchell, S.K., Keplinger, C.: PeanoHASEL actuators: muscle-mimetic, electrohydraulic transducers that linearly contract on activation. Science Robot. 3(14), eaar3276 (2018) 17. Wang, X., Mitchell, S.K., Rumley, E.H., Rothemund, P., Keplinger, C.: High-strain Peano-HASEL actuators. Adv. Func. Mater. 30(7), 1908821 (2020) 18. Mitchell, S.K., et al.: An easy-to-implement toolkit to create versatile and highperformance HASEL actuators for untethered soft robots. Advanced Sci. 6(14), 1900178 (2019) 19. Park, T., Kim, K., Oh, S.-R., Cha, Y.: Electrohydraulic actuator for a soft gripper. Soft Rob. 7(1), 68–75 (2020). PMID: 31549923 20. Niiyama, R., Sun, X., Sung, C., An, B., Rus, D., Kim, S.: Pouch motors: printable soft actuators integrated with computational design. Soft Rob. 2(2), 59–70 (2015) 21. Appius, A., et al.: Raptor: rapid aerial pickup and transport of objects by robots. ArXiv, vol. abs/2203.03018 (2022) 22. Gravert, S.-D., et al.: Planar modeling and sim-to-real of a tethered multimaterial soft swimmer driven by Peano-HASELs (2022)

Control

Hybrid Quadratic Programming - Pullback Bundle Dynamical Systems Control Bernardo Fichera(B) and Aude Billard École polytechnique fédérale de Lausanne, Lausanne, Switzerland [email protected] Abstract. Dynamical System (DS)-based closed-loop control is a simple and eﬀective way to generate reactive motion policies that well generalize to the robotic workspace, while retaining stability guarantees. Lately the formalism has been expanded in order to handle arbitrary geometry curved spaces, namely manifolds, beyond the standard ﬂat Euclidean space. Despite the many diﬀerent ways proposed to handle DS on manifolds, it is still unclear how to apply such structures on real robotic systems. In this preliminary study, we propose a way to combine modern optimal control techniques with a geometry-based formulation of DS. The advantage of such approach is two fold. First, it yields a torque-based control for compliant and adaptive motions; second, it generates dynamical systems consistent with the controlled system’s dynamics. The salient point of the approach is that the complexity of designing a proper constrained-based optimal control problem, to ensure that dynamics move on a manifold while avoiding obstacles or self-collisions, is “outsourced” to the geometric DS. Constraints are implicitly embedded into the structure of the space in which the DS evolves. The optimal control, on the other hand, provides a torque-based control interface, and ensures dynamical consistency of the generated output. The whole can be achieved with minimal computational overhead since most of the computational complexity is delegated to the closed-form geometric DS.

Keywords: Dynamical system programming

1

· Diﬀerential geometry · Quadratic

Introduction

Reactive planning and control in the face of perturbation or sudden changes in the environment are key requirements in robotics. Dynamical System (DS) - based closed loop control has emerged as an eﬀective technique to generate reactive policies. In a DS-based control, a policy is represented as a vector ﬁeld f : Rn → Rn mapping a state-space variable of the robotic system x ∈ Rn to an action in terms of velocity x˙ ∈ Rn , that the controlled system has to follow in order to achieve a certain task; i.e. x˙ = f (x). The usage of DS in robotic control is advantageous, since it allows it to embed in a single closed-form law all the possible trajectories to accomplish a task. This yields instantaneous adaptation c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 387–394, 2023. https://doi.org/10.1007/978-3-031-25555-7_26

388

B. Fichera and A. Billard

and recalculation of trajectories, providing robustness in face of both spatial and temporal perturbations. Learning from Demonstration (LfD), a data-driven approach to learn DS bootstrapping from few observations of demonstrated tasks, has been in the last decades the main ﬁeld of development of non-linear stable DS for closed-loop control. Starting from the Stable Estimator of Dynamical Systems (SEDS), the proposed approaches were gradually capable of learning more and more complex DS, retaining stability guarantees, either reﬁning the stability constraints within an optimization problem, [8,12], or adopting advanced diﬀeomorphism techniques, [9]. However all these approaches assume that the DS evolves along an Euclidean metric space. Data encountered in robotics are characterized by varied geometries: For instance, joint angles lie on a d-dimensional torus (T d ) in revolving articulations, 3-dimensional Special Euclidean groups (SE(3)) for rigid body motions, orientations represented as unit quaternions (S 3 ) and diﬀerent forms of symmetric positive deﬁnite matrices such as inertia, stiﬀness or manipulability ellipsoids. Further, constraints may restrict the robot to a free sub-manifold which is diﬃcult to represent explicitly. For example, an obstacle in the workspace can produce an additional topological hole that geodesics or “default” unforced trajectories should avoid. Often such constraints are handled by constrained optimization. However, constrained optimization scales poorly with the number and complexity of non convex constraints, making real-time Model Predictive Control (MPC) impractical. Sampling-based model predictive control, [1,14], despite avoiding to solve explicitly an optimal control problem, heavily relies on the number and the length of trajectories, hence compromising accuracy for computation performances. An alternative approach is to use Riemannian Motion Policies, a geometrical control framework designed to account for potentially non-Euclidean conﬁguration and task spaces [10]. This approach not only has the advantage of handling arbitrary space geometry but can also account for the presence of the obstacles by locally deforming the space. However, correctly designing metrics to account for such deformation directly in the conﬁguration manifold can be diﬃcult. The same applies for scenarios where multiple goals have to be taken into consideration. This motivated the usage of a tree structure where each node represents a sub-task connected to a primary task, [4]. This approach “split” the overarching goal in a series of subtasks, each described by a particular DS on a certain manifold. The overall DS representing the primary task is recovered via the so-called DS pullback. One of the main limitations of the this approach is the lack of geometric consistency. This translates in a control law that yields diﬀerent output depending on the particular representation of a certain manifold structure. To address this concern, [3] introduced the Pullback Bundle Dynamical Systems (PBDS) framework for combining multiple geometric task behaviors into a single robot motion policy, while maintaining geometric consistency and stability. This approach provides a principle diﬀerential geometry description of

Hybrid QP-Pullback Bundle DS

389

the pullback operation for DS. It oﬀers an easier and more intuitive way of designing geometrical entities, such as metrics on sub-task manifolds, and correctly captures the manifold structure due the geometric consistency property. Furthermore such approach proposes a clean and eﬀective way to decouple the sub-task policy from the task priority metric design. Geometric formalization of such approach can be derived by considering more general Finsler geometries, [11,13]. Yet one of the main limitations is the “absence” of the controlled system. While the overall PBDS is potentially capable of producing desired acceleration in conﬁguration space, the designed DS law is agnostic of the robot model and generate infeasible dynamics. Inverse dynamics (ID) torque control approaches based on Quadratic Programming (QP), [6], have gained increasing popularity by providing compliant motions and robustness to external perturbations. Similarly to classical optimal control approaches, ID with QP achieves accurate feasible trajectories by imposing constraints to satisfy model’s dynamics. By considering acceleration and eﬀort as optimization variables, the optimal control problem can be formulated in a quadratic form. This oﬀers high performance with guarantee of convergence to global optimum. In this paper, we propose a hybrid Quadratic Programming Pullback Bundle Dynamical System yielding the following advantages: 1) torque-based control for compliant and adaptive motions; 2) model based approach for dynamical consistent motion. We propose a framework where much of the complexity of a classical QP problem can be avoided and “outsourced” to the geometrical DS framework. In this setting, constraints deriving from sub-manifolds motions or obstacle avoidance can be omitted into the formulation of the QP control problem because already embedded in the geometrical DS. On the other hand the limitations of geometrical DS such as the lack of velocity, acceleration, and control limits as well as dynamical constraints can be handled eﬀectively by the QP control.

2

Background

Our notation follows [5]. We employ the Einstein summation convention in which repeated indices are implicitly summed over. Given a set, M, and a Hausdorﬀ and second-countable topology, O, a topological space (M, O) is called a ddimensional manifold if ∀p ∈ M : ∃U ∈ O : ∃x : U → x(U) ⊆ Rd , with x and the chart x−1 continuous. (U, x) is a chart of the manifold (M, O). x is called 1 d d (p), . . . , x (p) into the R Euclidean map; itmaps p ∈ M to the point x(p) = x space. x1 (p), . . . , xd (p) are known as the coordinate maps or local coordinates. We refer to a point in Rd using the bold vector notation x, dropping the explicit dependence on p ∈ M. xi is the i-th local coordinate of x ∈ Rd . We denote with M a diﬀerentiable Riemannian manifold, that is a manifold endowed with a (0, 2) tensor ﬁeld, g, with positive signature. We refer to g as a Riemannian metric.

390

B. Fichera and A. Billard

Let M be a Riemannian manifold. In local coordinates a second-order DS on M is expressed as ¨ x

G−1

˙ x

∇φ ˙ x k k i j ak k x ¨ + Γij x˙ x˙ = − g ∂a φ − Dm x˙ m , Ξ

D

(1)

k where Γijk are the Christoﬀel symbols, φ a potential function on M and Dm the components of the dissipative matrix; see [2]. Equation 1 can be expressed in the following vectorial form:

˙ ¨ = f (x, x) ˙ = −G−1 ∇φ − Dx˙ − Ξ x. x

(2)

Let Q and X be, respectively, a m-dimensional and a n-dimensional Riemannian manifolds, parametrized with respective local coordinates q and x. We endow X with the Riemannian h; we use the capital notation H to refer to the metric in matrix notation. The pullback of a DS onto X to Q reads as b

˙ − J˙q˙ − ΞJ q˙ . ¨ = −H −1 (∇φ + DJ q) Jq

(3)

If the Jacobian, J, is injective everywhere, the solution of Eq. 3 can be recovered via least mean square as an analytical solution to the optimization problem 2 q − b . minq¨ J¨

3

Method

The control structure proposed is shown in Fig. 1. Given as input both the con˙ and the “primary” task, (q, q), ˙ space state, the geometric DS ﬁguration, (q, q), (“PBDS” block) generates both conﬁguration and task space desired acceleration. This information is processed by a QP controller (“QP” block) that, taking

M ODEL

P BDS

¨ x ¨ q

QP

τ

SY ST EM

¨ q, q

FK

Fig. 1. Block diagram of the control structure. (PBDS) the geometric DS generating the desired accelerations; (QP) Quadratic Programming controller; (SYSTEM) the actual controlled robotic system; (MODEL) dynamical model of the controlled robotic system; (FK) the forward kinematics used to provide the task space to the geometric DS.

Hybrid QP-Pullback Bundle DS

391

advantage of the dynamical model of the controlled system (“MODEL” block), yields the correct conﬁguration space torques necessary to achieve the desired trajectory. The equations of motion and the constraint equations for an articulated robot system can be described as ˙ = Sτ M (q)¨ q + h(q, q)

(4)

˙ is the sum of gravitational, centrifugal where M (q) is the inertia matrix, h(q, q) and Coriolis forces, S is a selection matrix and τ is a vector of joint torques. T ˙ the equations of motion are linear in [¨ Given a state, (q, q), q τ] . The Quadratic Programming problem has the following structure min z

1 T z Wz + wT z s.t. CE z + cE = 0, 2

CI z + cI ≥ 0.

(5)

The unknown, z, and constraints, CE , cE , CI and cI , are problem speciﬁc. In T our setting we have z = [¨ q τ ξ] as unknown variable. ξ is a slack variable to relax the hard constraint imposed on the inverse dynamics as it will be clariﬁed later. In order to deﬁne the cost functional we adopt the following matrices ⎡ ⎤ Q 0 0

T qd Q 0 0 . (6) W = ⎣ 0 R 0⎦ , wT = −¨ 0 0 I Given the matrices W and w the cost in Eq. 5 is quadratic form that tries to ¨−q ¨ d , between the current system acceleration and the minimize the distance, q desired acceleration provided by the geometric DS, while trying to minimize the eﬀort, τ , as well. We solve the inverse dynamics problem imposing it as a relaxed constraint with some slack variable ξ. The equality constraints matrices are ˙ h(q, q) M −S 0 , cE = CE = (7) J 0 I ¨ d − J˙q˙ x We use CI and cI to fullﬁl velocity, acceleration and torque constraints. No other “environment” related constrained is required since this part will be taken care from the geometric DS. The desired joint acceleration is produced by the “PBDS” block, see [3] for details. Equation 3 can be extended to an arbitrary number of task submanifolds, Fig. 2a. Let fi : Q → Xi be i-th continuous map between the base space Q and the i-th target space Xi . On each of the target spaces Xi takes place a second order DS, of the type given in Eq. 2. The DS parameters to be deﬁned are: the mapping fi , the metric of the target space H, the potential energy φ and dissipative coeﬃcients a weighted least mean square prob D. Considering ¨ − bi 2Wi , where Wi ∈ Rsi ×si is task-weighting lem of the type minx¨ i Ji x matrix with si = dimNi , we can derive an analytical solution for the second order dynamical systems on the base space Q as ¨= q

i

JiT Wi Ji

−1 i

JiT Wi bi

.

(8)

392

B. Fichera and A. Billard

Fig. 2. Tree diagram of the manifolds structure.

This process can be easily extended to arbitrary long “tree” of connected spaces where each target space may represent a base space for the following layer of spaces. The tree structure of the P BDS block is shown in Fig. 2b. It is clear that only on leaf nodes T( ·)( ·) can be user-deﬁned a second order DS. On the intermediate nodes, Xi or Ni , as well as for the primary node, Q, the ﬂowing DS is automatically determined via the pullback operation. We highlight that, even when combined with the QP problem in Eq. 5, the geometrical DS in Eq. 8 retains all the stability properties. The QP in Eq. 5 solves (iteratively) an acceleration tracking problem generating the joint level torques that try to minimize the diﬀerence between the current joint acceleration and the desired acceleration produced by the geometrical DS. Further analysis and theoretical study should be dedicated to the convergence properties. Nevertheless when facing completely controllable/observable and co-located control problems, e.g. robotics arms or articulated systems, we believe that convergence does not represent a main issue, [6,7]. Provided that the geometrical DS generates joint limits respecting trajectories we expect, at least for low frequency motion, good convergence properties.

4

Preliminary Results

We tested the presented idea in scenario of constraint motion onto a sub-manifold in presence of obstacles. In Fig. 3a it is depicted the simulated environment. The end-eﬀector of a KUKA IWAA 14 has to follow a second order DS evolving on the sphere while avoiding obstacles along the path. Figure 3b illustrated the structure of the manifolds tree designed in order to accomplish the desired task. At the top of the tree we have the conﬁguration manifold, Q, of controlled robotic system. It follows the Special Euclidean group, SE(3). The map between Q and SE(3) is the forward kinematics of the robot. Next we chose as sub-manifold for the constraint motion the 2-dimensional sphere, S 2 . The map between SE(3) and S 2 performs a retraction onto the sphere for any point x ∈ R3 and x S 2 , while it project rotation matrices R ∈ SO(3) such that the orientation of the end-eﬀector remains perpendicular to the sphere. For the sphere we deﬁne to two sub-manifolds tasks: one for the

Hybrid QP-Pullback Bundle DS

393

Fig. 3. Simulated Environment: (a) the end-eﬀector KUKA IWAA 14 follows a second order DS on the sphere while avoiding obstacles located on the sub-manifold; (b) manifolds tree to generate desired accelerations. Table 1. Parameters of the Sphere Bundle. Task manifold Mapping

Metric

P=R

S 2 → R : p → dist(p, pgoal ) 1

D = S2

S 2 → S 2 : p → p

gS

Oi = R

S 2 → R : p → dist(p, pobs )

Potential Dissipation Weight metric ||p||2

−dp˙ I

1

0

0

1

exp(a/(bxb )) 0

0

1

2

Fig. 4. Pullback second order DS on S 2 (a) 3 sampled trajectories; the color gradient represents the potential function used to generate the elastic force towards the desired attractor. (b) x, y, z position and velocity signals. (c) End-eﬀector trajectory tracking of the geometric DS.

potential energy, P, and one for the dissipative forces, D. In addition arbitrary number of sub-manifolds tasks, Oi , can be added in order to account for the presence of obstacles. Table 1 the parameters for each sub-manifolds. Figure 4 shows the pullback DS at level of S 2 . On the left, 3 trajectories are sampled starting from diﬀerent initial velocities. The convergence of the DS towards the attractor is shown on the center. Figure 4c shows the results about the quality of the end-eﬀector trajectory tracking. In continuous line the it is reported the second order DS at level of S 2 ; the dashed shows the end-eﬀector motion. For x and z axis the results shows an acceptable tracking error. y axes on the other hand reports poor results due to the presence of joint limits constraints.

5

Conclusion

We introduced a structure to eﬀectively perform closed-loop control into potentially non-Euclidean settings included but not limited to obstacle avoidance

394

B. Fichera and A. Billard

scenarios. Although very simple in its strategy, we believe that the introduce approach represents an eﬀective way of taking into consideration within the control loop of potential dynamical constraints imposed by the robotic system at hand. The introduction of an iterative optimization process might be seen as a contradiction in DS framework that makes of the close-form approach a guarantee of reactivity. Nevertheless the Quadratic Programming nature of the low level optimization imposes a negligible sacriﬁce in terms of performance and it adds the capability of generating feasible trajectories.

References 1. Bhardwaj, M., et al.: STORM: an integrated framework for fast joint-space modelpredictive control for reactive manipulation. In: 5th Annual Conference on Robot Learning (2021) 2. Bullo, F., Lewis, A.D.: Geometric Control of Mechanical Systems: Modeling, Analysis, and Design for Simple Mechanical Control Systems. Texts in Applied Mathematics, Springer-Verlag, New York (2005). https://doi.org/10.1007/978-1-48997276-7 3. Bylard, A., Bonalli, R., Pavone, M.: Composable geometric motion policies using multi-task pullback bundle dynamical systems. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 7464–7470 (2021) 4. Cheng, C.-A., et al.: RMPflow : a computational graph for automatic motion policy generation. In: Morales, M., Tapia, L., Sánchez-Ante, G., Hutchinson, S. (eds.) WAFR 2018. SPAR, vol. 14, pp. 441–457. Springer, Cham (2020). https://doi.org/ 10.1007/978-3-030-44051-0_26 5. Manfredo do Carmo: Riemannian Geometry. Theory and Applications Mathematics, Birkhäuser, Boston (1992) 6. Feng, S., Whitman, E., Xinjilefu, X., Atkeson, C.G.: Optimization based full body control for the atlas robot. In: 2014 IEEE-RAS International Conference on Humanoid Robots, pp. 120–127 (2014) 7. Feng, S., Xinjilefu, X., Huang, W., Atkeson, C.G.: 3D walking based on online optimization. In: 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), pp. 21–27 (2013) 8. Figueroa, N., Billard, A.: A physically-consistent Bayesian non-parametric mixture model for dynamical system learning. In: Conference on Robot Learning (CoRL), pp. 927–946 (2018) 9. Rana, M.A., Li, A., Fox, D., Boots, B., Ramos, F., Ratliﬀ, N.: Euclideanizing ﬂows: diﬀeomorphic reduction for learning stable dynamical systems. In: Learning for Dynamics and Control, pp. 630–639. PMLR (2020) 10. Ratliﬀ, N.D., et al.: Riemannian motion policies. arXiv:1801.02854 [cs] (2018) 11. Ratliﬀ, N.D., et al.: Generalized nonlinear and ﬁnsler geometry for robotics. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 10206–10212 (2021) 12. Ravichandar, H.C., Salehi, I., Dani, A.P.: Learning partially contracting dynamical systems from demonstrations. In: Conference on Robot Learning, pp. 369–378 (2017) 13. Van Wyk, K., et al.: Geometric fabrics: generalizing classical mechanics to capture the physics of behavior. IEEE Robot. Autom. Lett. 7(2), 3202–3209 (2022) 14. Williams, G., Aldrich, A., Theodorou, E.A.: Model predictive path integral control: from theory to parallel computation. J. Guid. Control. Dyn. 40(2), 344–357 (2017)

Riemannian Geometry as a Unifying Theory for Robot Motion Learning and Control No´emie Jaquier(B) and Tamim Asfour Karlsruhe Institute of Technology, Institute for Anthropomatics and Robotics, 76131 Karlsruhe, Germany {noemie.jaquier,asfour}@kit.edu Abstract. Riemannian geometry is a mathematical ﬁeld which has been the cornerstone of revolutionary scientiﬁc discoveries such as the theory of general relativity. Despite early uses in robot design and recent applications for exploiting data with speciﬁc geometries, it mostly remains overlooked in robotics. With this blue sky paper, we argue that Riemannian geometry provides the most suitable tools to analyze and generate well-coordinated, energy-eﬃcient motions of robots with many degrees of freedom. Via preliminary solutions and novel research directions, we discuss how Riemannian geometry may be leveraged to design and combine physically-meaningful synergies for robotics, and how this theory also opens the door to coupling motion synergies with perceptual inputs. Keywords: Motion generation

1

· Riemannian geometry · Synergies

Motivation

The last years have seen the emergence of various, complex robotics systems with increased number of degrees-of-freedom (DoF). These include humanoid and quadrupedal robots, exoskeletons and robotic hands, among others. Despite recent progress, these robots remain to be actively deployed in our everyday life. Among the challenges that remain unsolved is the generation of well-coordinated, energy-eﬃcient, and reliable robot motions under any circumstances. This problem is further exacerbated by high number of DoF. As opposed to robots, humans are naturally able to generate skillful motions. For instance, they eﬃciently plan optimal trajectories while coping with the redundancy of their high-dimensional conﬁguration space. Moreover, these dexterous motions usually adapt to the perceived environment. Following insights from neurosciences, roboticists have designed biologically-inspired solutions to cope with the redundancy of robotic systems. In particular, the notion of muscle or movement synergies—coherent co-activation of motor signals [7]—has been widely used to generate motions of highly-redundant robots such as robotic Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/978-3-031-25555-7 27. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 395–403, 2023. https://doi.org/10.1007/978-3-031-25555-7_27

396

N. Jaquier and T. Asfour

hands [9,22] and humanoid robots [10,11]. Synergies oﬀer an elegant and powerful alternative to classical control schemes, as a wide range of motions is simply generated by combining few well-selected canonical synergies. However, synergies have often been extracted using linear methods, e.g., principal component analysis (PCA), which disregard the nonlinear nature of human and robot conﬁguration spaces. Latent variable models (LVMs) based on Gaussian processes [21] or auto-encoders [25] have recently been used to cope with this nonlinearity. However, the resulting synergy spaces are physically meaningless and hard to interpret, thus limiting the applicability of these methods. Moreover, although synergies have been recently adapted to object properties in robot grasping [25], linking synergies with perceptual inputs still remains an open problem. Instead, the nonlinear nature of mechanical systems such as humans and robots is suitably described by Riemannian geometry. Indeed, the conﬁguration space of any multi-linked mechanical system can be identiﬁed with a Riemannian manifold, i.e., a smooth curved geometric space incorporating the structural and inertial characteristics of the system [4]. On this basis, Riemannian computational models of human motion planning were recently proposed [3,14,16]. These models aim at uncovering the optimization principles underlying human motion generation, and thus the key mechanisms coping with the redundancy of the human body. They suggest that human motions are planned as geodesics, i.e., minimum-muscular-eﬀort (or shortest) paths, in the conﬁguration space manifold. As shown by Neilson et al. [16], each geodesic constitutes a coherent coordination of joint angles with minimum energy demand, and thus corresponds to a geodesic synergy. In that sense, human point-to-point motions follow a single geodesic synergy [3], while complex movements may result from sequences [14] or compositions [16] of these. Despite the soundness of Riemannian geometry to describe conﬁguration spaces and the attractive properties of geodesic synergies, it remains overlooked by the robotics community. In this paper, we contend that Riemannian geometry oﬀers the most suitable mathematical tools to (i) extract interpretable nonlinear movement synergies for robotics, (ii) combine them to generate dexterous robot motions, and (iii) intertwine them with various perceptual inputs. Inspired by insights from human motion analysis [3,14,16], we propose preliminary solutions that exploit Riemannian geometry for the design of meaningful synergies which account for the structural and inertial properties of high-DoF robots. We then describe our envisioned line of research, as well as the important challenges to be tackled. Finally, we discuss how geodesic synergies have the potential to contribute to the deployment of robots in our everyday life not only by reducing their energy requirements, but also by generating natural and interpretable motions of humanoid and wearable robots, which adapt to perceptual inputs. It is worth noticing that geometric methods have been successfully applied to robotics from early on, e.g., for robot design [18], or for formulating robot kinematics and dynamics models [23]. Additionally, Riemannian methods recently gained interest in robot learning and control to handle data with particular geometries [12,13], capture relevant demonstration patterns [2], or combine multiple simple policies [20]. These works exploit Riemannian geometry at data

Riemannian Geometry as a Unifying Theory for Robot Motion

397

level to learn or deﬁne task-speciﬁc policies, and thus are complementary to the ideas presented hereafter. In contrast, this paper focuses on the potential of human-inspired, physically-meaningful geometric representations of low-level robot actions (i.e., joint coordinations), and their coupling to perception.

2

Geodesic Synergies: Basics and Proof of Concept

In this section, we ﬁrst brieﬂy introduce the mathematical tools underlying the notion and use of geodesic synergies. For in-depth introductions to Riemannian geometry and geometry of mechanical systems, we refer the interested reader to, e.g., [5,15], and to [4], respectively. We then provide examples of geodesic synergies and preliminary solutions to use them for robot motion generation. 2.1

Riemannian Geometry of Mechanical Systems

Riemannian Manifolds. The conﬁguration space Q of a multi-linked mechanical system can be viewed as a Riemannian manifold, a mathematical space which inherently considers the characteristics of the system. A n-dimensional manifold M is a topological space which is locally Euclidean. In other words, each point in M has a neighborhood which is homeomorphic to an open subset of the ndimensional Euclidean space Rn . A tangent space Tx M is associated to each point x ∈ M and is formed by the diﬀerentials at x to all curves on M passing through x. The disjoint union of all tangent spaces Tx M forms the tangent bundle TM. A Riemannian manifold is a smooth manifold equipped with a Riemannian metric, i.e., a smoothly-varying inner product acting on TM. Given a choice of local coordinates, the Riemannian metric is represented as a symmetric positive-deﬁnite matrix G(x), called a metric tensor, which depends smoothly on x ∈ M. The Riemannian metric leads to local, nonlinear expressions of inner products and thus of norms and angles. Speciﬁcally, the Riemannian inner product between two velocity vectors u, v ∈ Tx M is given as u, vx = u, G(x)v. The Configuration Space Manifold. Points on the conﬁguration space manifold correspond to diﬀerent joint conﬁgurations q ∈ Q. The manifold Q can be endowed with the so-called kinetic-energy metric [4]. Speciﬁcally, the metric tensor G(q) is equal to the mass-inertia matrix of the system at the conﬁguration q ∈ Q. In this sense, the mass-inertia matrix, i.e., the Riemannian metric, curves the space so that the conﬁguration manifold accounts for the nonlinear inertial properties of the system. Figure 1a illustrates the eﬀect of the Riemannian metric, represented by ellipsoids G(q) at diﬀerent joint conﬁgurations q ∈ Q, on the conﬁguration space of a 2-DoF planar robot. Intuitively, the kinetic energy ˙ G(q)q ˙ at q is high for velocities q˙ ∈ Tq M following the ellipsoid major k = 12 q, axis, and low along the minor axis. This implies that, in the absence of external force, trajectories of robot joints are not straight lines as in Euclidean space, but instead follow geodesics, i.e., generalization of straight lines to manifolds. Geodesics. Similarly to straight lines in Euclidean space, geodesics are minimum-energy and minimum-length, constant-velocity curves on Riemannian

398

N. Jaquier and T. Asfour

(a)

Riemannian

(b)

Geodesic motions.

(c)

Euclidean motions.

Fig. 1. Illustration of the Riemannian conﬁguration space of a 2-DoFs planar robot. (a) The Riemannian metric G(q) ( ), equal to the robot mass-inertia matrix, curves the space. The energy is reﬂected by its volume ∝ det(G). Minimum-energy trajectories between two joint conﬁgurations correspond to geodesics ( ), which diﬀer from Euclidean paths ( ). (b)-(c) Robot conﬁgurations along the trajectories depicted in (a). The motions reﬂect the diﬀerences between geodesic and Euclidean paths (see 1, 3, 4).

manifolds. They solve the following system of second-order ordinary diﬀerential equations (ODE), obtained by applying the Euler-Lagrange equations to the kinetic energy gij (q)¨ qj + Γijk q˙j q˙k = 0, (1) j jk ˙ represents the inﬂuence of Coriolis forces, gij where jk Γijk q˙j q˙k = ci (q, q) ∂gjk ∂gik ij denotes the (i, j)th entry of G, and Γijk = 12 ∂g are the Christof∂qk + ∂qj − ∂qi fel symbols of the ﬁrst kind. In other words, geodesic trajectories are obtained ¨ solution of (1) at each conﬁguration q(t) by applying the joint acceleration q(t) ˙ with velocity q(t) along the trajectory. Note that (1) corresponds to the stan˙ q˙ + τg (q) = τ in the absence of gravity dard equation of motion G(q)q¨ + C(q, q) (τg (q) = 0) and external forces (τ = 0). Thus, a geodesic corresponds to a passive trajectory of the system. Geodesic trajectories of a 2-DoF planar robot are displayed in Fig. 1. As geodesics naturally result in coherent co-activations of joints, they can straightforwardly be used to deﬁne synergies, as explained next. 2.2

Robot Motion Generation with Geodesic Synergies

As previously mentioned, geodesics are the solution to the system of ODEs (1) and are therefore completely determined by their initial conditions. In other words, one simply needs to deﬁne the initial joint conﬁguration q(0) ∈ Q and

Riemannian Geometry as a Unifying Theory for Robot Motion

399

Fig. 2. Illustrations of geodesic synergies for the right arm of the humanoid robot ARMAR-6 [1]. (a)–(b) Arm motion (from transparent to opaque) and hand trajectory ) resulting from individual synergies. (c)–(f ) Arm motion and hand trajectory ( , ( ) obtained from diﬀerent combinations of the two geodesic synergies (a) and (b).

˙ velocity q(0) ∈ Tq (0) Q and solve the corresponding initial value problem (1) to obtain a minimum-energy trajectory in the robot conﬁguration manifold. The resulting joint coordination is called a geodesic synergy [16]. Similarly to motion synthesis with classical synergies, novel behaviors can be obtained by combining several known geodesic synergies. However, their Riemannian nature must be taken into account, so that the combinations still result in meaningful, minimum-energy joint trajectories. Let us ﬁrst consider two geodesic synergies with the same initial position q0 and diﬀerent initial (1) (2) velocities q˙0 , q˙0 ∈ Tq 0 Q. In this case, novel motions can be obtained by solving (1) with the initial velocity deﬁned as a weighted sum of the velocities of (1) (2) the individual synergies, i.e., q˙0 = w1 q˙0 + w2 q˙0 . However, geodesic synergies may not share a common initial position and it may be desirable that the robot starts its motion at a diﬀerent initial conﬁguration. In these cases, an additional step must be taken before summing the initial velocities. Indeed, as explained in Sect. 2.1, in Riemannian geometry, each velocity vector q˙ ∈ Tq Q is linked to a speciﬁc position q as it lies on its tangent space Tq Q. Therefore, in order to be (1) (2) combined, the velocities q˙0 ∈ Tq (1) Q, q˙0 ∈ Tq (2) Q must ﬁrst be transported 0 0 onto the tangent space of the initial conﬁguration q0 . This is achieved via parallel transport, a well-known operation in Riemannian geometry. Figure 2 presents a proof of concept illustrating the generation of novel motions (Fig. 2c–2f) via the combination of two geodesic synergies (Fig. 2a–2b). It is important to notice that combinations of geodesic synergies are themselves geodesics. Moreover, the presented approach applies to any number of synergies. In particular, combinations of N geodesic synergies with metric-orthogonal initial velocities (i.e., whose

400

N. Jaquier and T. Asfour

Riemannian inner product is 0) can generate any motion of a N -DoFs robot. A video of the geodesic synergies of Fig. 2 including comparisons with traditional PCA-based linear synergies is available at https://youtu.be/XblzcKRRITE.

3

Vision and Challenges

As illustrated previously, geodesic synergies oﬀer a compelling solution to generate energy-eﬃcient, well-coordinated robot motions. Indeed, they inherit the beneﬁts of the classical synergies, as a wide range of motions can simply be generated by combining few well-selected synergies. Moreover, they are speciﬁcally designed to account for the nonlinearities arising from the intrinsic dynamics of robotics systems. In this sense, geodesic synergies are physically meaningful, interpretable, and contribute to the explainability of the generated robot behavior. This contrasts with the PCA-based Euclidean synergies commonly used in robotics. Moreover, geodesic synergies directly encode dynamic robot motions, as opposed to postural and kinematic synergies, which disregard the robot dynamics. Furthermore, geodesic synergies result in trajectories with minimum-energy demand for the robot. This becomes especially relevant for energy-constrained systems such as robotic prosthesis, exoskeletons, humanoid robots, and mobile robots in general. Finally, according to the recent studies [3,14,16] covered in Sect. 1, robot motions produced by geodesic synergies follow the same principle as in human motion planning. Therefore, they may lead to more natural motions of wearable robots that comply with human motion generation mechanisms and thus may improve their assistive abilities. However, as discussed next, several challenges are yet to be tackled for leveraging geodesic synergies in robotics. Selection of Geodesic Synergies. Among the ﬁrst challenges to solve for a successful application to robot motion generation is the extraction and selection of relevant geodesic synergies. To do so, we envision two diﬀerent approaches. First, geodesic synergies may be learned from humans. Namely, we contend that geodesic synergies may be extracted from human motions using Riemannian dimensionality reduction methods such as principal geodesic analysis (PGA) [8,24], the Riemannian equivalent of PCA. Similarly to PCA-based approaches [9–11,22], geodesic synergies may be given by the ﬁrst n principal geodesics accounting for a given proportion of the information contained in the motion. Notice that such an analysis may also oﬀer novel perspectives on understanding the mechanisms of human motion generation. An important challenge would then be to transfer these biological geodesic synergies to robots, while accounting for the diﬀerences between the human and the robot conﬁguration manifold. Although this may be relatively straightforward for robots with anthropomorphic design, this generally remains an open problem. A second approach to extract synergies would be to directly design them for the robot at hand. This may be achieved, e.g., by deﬁning a orthonormal basis in the robot conﬁguration manifold and select the most relevant directions (under some criteria) as geodesic synergies for the robot’s movements.

Riemannian Geometry as a Unifying Theory for Robot Motion

401

Interacting with the Environment. Another challenge to be tackled in the context of geodesic synergies is to handle external inﬂuences arising from interactions with the environment or external perturbations. This is important as robot motions should not only remain optimal under all circumstances, but also be reliably safe for the user. We hypothesize that two research directions may be followed and coupled to tackle this challenge. First, the motion of the robot may be adapted to external conditions by artiﬁcially shaping the Riemannian metric, and which adapt the the geodesic synergies accordingly. To do so, we may take inspiration from methods developed around Riemannian motion policies [6,20]. This ﬁrst approach may typically be used for the robot to cope with an additional load, e.g., an manipulated object. Second, we advocate for the design of Riemannian optimal controllers based on geodesic synergies, which would allow the robot to react appropriately to external events. Geodesic Synergies for Action-Perception Couplings. To achieve motions adapted to a wide variety of tasks, the activation of synergies should be designed according to the current task and environment state. In other words, synergies, i.e., low-level robot actions, should be coupled to the perception. We hypothesize that geodesic synergies may be key components of novel, interpretable perception-action couplings in technical cognitive systems thanks to their Riemannian nature. Namely, taking inspiration from neurosciences [17,19], we suggest that robot perception spaces may be understood as geometric spaces and eventually endowed with Riemannian metrics, thus being identiﬁed as Riemannian manifolds. Through the lens of Riemannian geometry, perception inputs may then be intertwined with geodesic synergies via geometric mappings between their respective manifolds. Although important challenges such as the design of Riemannian perception spaces, and learning of mappings between the perception manifold and the conﬁguration space manifold awaits along the way, we believe that introducing such geometric representations in robot perceptionaction loops may pave the way towards the generation of eﬃcient and explainable robot motions. This may complement and improve upon recent end-to-end deep learning methods lacking explainability and adaptability. Overall, we believe that Riemannian geometry may be the theory reconciling perception and action systems for robust robot motion generation. Viewing the robot action and perception spaces through the lens of geometry may lead to a uniﬁed Riemannian framework for robot motion learning, control, and adaptation. Such a framework would allow robots to seamlessly associate perception inputs with compatible and adaptable minimum-energy joint coordinations. We believe that this is key to provide robots with robust motion generation mechanisms that build on innovative and explainable perception-action loops. In this sense, Riemannian geometry may prove to be a game changer for deploying robots in our everyday life, as it was for Einstein’s theory of general relativity. Acknowledgements. This work was supported by the Carl Zeiss Foundation through the JuBot project. The authors also thank Leonel Rozo for his useful feedback on this paper and Andre Meixner for his help in rendering Fig. 2.

402

N. Jaquier and T. Asfour

References 1. Asfour, T., et al.: ARMAR-6: a high-performance humanoid for human-robot collaboration in real world scenarios. IEEE Robot. Autom. Mag. 26(4), 108–121 (2019) 2. Beik-Mohammadi, H., Hauberg, S., Arvanitidis, G., Neumann, G., Rozo, L.: Learning Riemannian manifolds for geodesic motion skills. In: Robotics: Science and Systems (R:SS) (2021) 3. Biess, A., Liebermann, D.G., Flash, T.: A computational model for redundant human three-dimensional pointing movements: integration of independent spatial and temporal motor plans simpliﬁes movement dynamics. J. Neurosci. 27(48), 13045–13064 (2007) 4. Bullo, F., Lewis, A.D.: Geometric Control of Mechanical Systems. TAM, vol. 49. Springer, New York (2005). https://doi.org/10.1007/978-1-4899-7276-7 5. do Carmo, M.: Riemannian Geometry. Birkh¨ auser, Basel (1992) 6. Cheng, C.A., et al.: RMPﬂow: a geometric framework for generation of multi-task motion policies. Trans. Autom. Sci. Eng. 18(3), 968–987 (2021) 7. d’Avella, A., Saltiel, P., Bizzi, E.: Combinations of muscle synergies in the construction of a natural motor behavior. Nat. Neurosci. 6(3), 300–308 (2003) 8. Fletcher, P.T., Lu, C., Pizer, S.M., Joshi, S.: Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans. Med. Imaging 23(8), 995–1005 (2004) 9. Gabiccini, M., Bicchi, A., Prattichizzo, D., Malvezzi, M.: On the role of hand synergies in the optimal choice of grasping forces. Auton. Robots 31, 235–252 (2011) 10. Gu, X., Ballard, D.H.: Motor synergies for coordinated movements in humanoids. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3462–3467 (2006) 11. Hauser, H., Neumann, G., Ijspeert, A.J., Maass, W.: Biologically inspired kinematic synergies provide a new paradigm for balance control of humanoid robots. In: IEEE/RAS International Conference on Humanoid Robots (Humanoids), pp. 73– 80 (2007) 12. Jaquier, N., Rozo, L., Caldwell, D.G., Calinon, S.: Geometry-aware manipulability learning, tracking and transfer. Int. J. Robot. Res. 20(2–3), 624–650 (2021) 13. Jaquier N. Borovitskiy, V., Smolensky, A., Terenin, A., Asfour, T., Rozo, L.: Geometry-aware Bayesian optimization in robotics using Riemannian Mat´ern kernels. In: Conference on Robot Learning (CoRL) (2021) 14. Klein, H., Jaquier, N., Meixner, A., Asfour, T.: A Riemannian take on human motion analysis and retargeting. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2022) 15. Lee, J.M.: Introduction to Riemannian Manifolds. Springer, Cham (2018) 16. Neilson, P.D., Neilson, M.D., Bye, R.T.: A Riemannian geometry theory of human movement: the geodesic synergy hypothesis. Hum. Mov. Sci. 44, 42–72 (2015) 17. Neilson, P.D., Neilson, M.D., Bye, R.T.: A Riemannian geometry theory of synergy selection for visually-guided movement. Vision 5(2), 26 (2021) 18. Park, F.C.: Optimal robot design and diﬀerential geometry. J. Mech. Des. 117(B), 87–92 (1995) 19. Pellionisz, A., Llin´ as, R.: Tensor network theory of the metaorganization of functional geometries in the central nervous system. Neuroscience 16(2), 245–273 (1985)

Riemannian Geometry as a Unifying Theory for Robot Motion

403

20. Ratliﬀ, N.D., Issac, J., Kappler, D.: Riemannian motion policies (2018). arXiv preprint 1801.02854. https://arxiv.org/abs/1801.02854 21. Romero, J., Feix, T., Ek, C.K., Kjellstr¨ om, H., Kragic, D.: Extracting postural synergies for robotic grasping. IEEE Trans. Robot. 29, 1342–1352 (2013) 22. Santello, M., Flanders, M., Soechting, J.F.: Postural hand synergies for tool use. J. Neurosci. 18, 10105–10115 (1998) 23. Selig, J.M.: Geometric Fundamentals of Robotics. Monographs in Computer Science. Springer, Cham (2005) 24. Sommer, S., Lauze, F., Nielsen, M.: Optimization over geodesics for exact principal geodesic analysis. Adv. Comput. Math. 40(2), 283–313 (2014) 25. Starke, J., Eichmann, C., Ottenhaus, S., Asfour, T.: Human-inspired representation of object-speciﬁc grasps for anthropomorphic hands. Int. J. Humanoid Rob. 17(2), 2050008 (2020)

Ensured Continuous Surveillance Despite Sensor Transition Using Control Barrier Functions Luis Guerrero-Bonilla1(B) , Carlos Nieto-Granda2 , and Magnus Egerstedt3 1

3

School of Engineering and Sciences, Tecnologico de Monterrey, Monterrey, NL, Mexico [email protected] 2 U.S. Army Combat Capabilities Development Command Army Research Laboratory (ARL), Adelphi, MD, USA [email protected] Samueli School of Engineering, University of California, Irvine, Irvine, CA, USA [email protected] Abstract. A control strategy to ensure the continuous surveillance of an intruder that moves from an open region to a cluttered region is presented. Long-range sensors can provide the intruder’s position while in the open region. The eﬀectiveness of these sensors can be compromised in cluttered regions due to obstructions such as buildings and vegetation. The proposed controller guarantees that the intruder is within the sensing range of the on-board sensor of a surveillance mobile robot as the intruder crosses into the cluttered region, ensuring the continuous tracking of the intruder. The controller considers the bounds on the speed of the surveillance robot and its on-board sensor. The eﬀectiveness of the proposed solution is showcased in simulations.

Keywords: Control barrier functions Tracking · Sensor · Mobile robots

1

· Surveillance · Defense ·

Introduction

Detection and tracking of objects are widely used in robotics, as applications such as traﬃc control, surveillance, and behavioral recognition, require information about the motion of objects in the environment. In controlled environments such as laboratories and warehouses, motion capture systems have been used to obtain and keep track of the pose and estimate the velocities of objects of interest [9,13,16]. In outdoor scenarios, the use of GPS technology can provide position information, while Radar and LiDAR can be used when GPS is not available (see [8] and the references therein). There is a vast literature on the use of video cameras together with object tracking and classiﬁcation techniques [6,11,12], as well as computer vision algorithms to ﬁnd the pose and command the motion of a robot, such as visual odometry, pose estimation and visual servo control [2–4,7,10,15]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 A. Billard et al. (Eds.): ISRR 2022, SPAR 27, pp. 404–419, 2023. https://doi.org/10.1007/978-3-031-25555-7_28

Ensured Continuous Surveillance

405

Consider the problem of keeping track of an intruder that moves from an open region without obstacles or obstructions, into a cluttered region such as a city or a forest. In the open region, long-range sensors such as Radar or LiDAR technology can be used to track the intruder. These sensors that can operate in the open region will be refereed to in this paper as “global”. In the cluttered region, obstructions such as buildings and vegetation might hinder the performance of the global sensors. Within this cluttered region, a defense and surveillance mobile robot might be able to navigate through buildings and other obstacles across smaller obstacle free sub-regions, and could be deployed to keep track of the intruder using an on-board sensor such as a camera. An example is shown in Fig. 1. This on-board sensor, which is attached to the surveillance robot and has limited or short range, will be referred to in this paper as a “local” sensor. In order to eﬀectively use global and local sensors to keep track of the intruder in their corresponding sub-regions, the intruder must be ensured to be within the range of the local sensor as it crosses into the target tracking region overseen by the surveillance robot within the cluttered region, where the global sensor is no longer reliable. This way, continuity in the localization of the intruder is ensured, and the local sensor can take over the duty of tracking the intruder. In this paper, a control strategy for the motion of a surveillance robot and its local sensor is presented. The control strategy ensures the intruder is within the range of the local sensor as it crosses from the open region into the region overseen by the surveillance robot within the cluttered region. It uses the position of the intruder provided by the global sensor to move the surveillance robot and its on-board sensor and ensure a continuous transition to the use of the local onboard sensor. To avoid unnecessary use of energy, the controller does not track the intruder at all times when it is in the open region, but only actuates the sensor when required to ensure the continuous tracking as the intruder moves into the cluttered region.

Fig. 1. a) While the intruder, depicted as a red circle, is outside of the city region, a global sensor is used to obtain its position. b) Inside the city region, buildings and other structures might obstruct the global sensor. Surveillance robots within this region can use their local sensors to track the intruder. c) Seen from above, the target tracking region assigned to the surveillance robot is marked with a solid red line. Within this reduced region, the robot can move freely. Note that this region could be selected in multiple ways and other regions can be chosen, such as the ones marked with dotted red lines.

406

2

L. Guerrero-Bonilla et al.

Problem Formulation

Let xP (t) denote the position of a point P on R3 , and let the world frame, henceforth written as W -frame, be described by the orthogonal triad of constant ˆ W , yˆW , zˆW }. A super-index denotes the frame in which a position unit vectors {x vector is written, so that the position of a point P in the W -frame is denoted as xW P (t). Let xA (t) denote the position of the intruder at time t. It is assumed that xA (t) is continuously diﬀerentiable and can be measured at every instant t, and that the velocity x˙ A (t) is bounded by the constant vA > 0 such that x˙ A (t) ≤ vA , but the functions of time xA (t) and x˙ A (t) are unknown. It is T W W W also assumed that, in the W -frame where xW A (t) = xA (t) , yA (t) , zA (t) , W (t) ≥ zmin > 0, so that the intruder is assumed to always be at a distance zA ˆ W yˆW -plane. These assumptions amount to greater or equal than zmin over the x having an intruder with bounded acceleration, having sensors that can provide the position of the intruder accurately, and ensuring that the physical body of the surveillance robot does not obstruct the sensor. Let xD (t) be the position and θ (t) the orientation with respect to the world’s ˆ W -axis of the surveillance robot. The surveillance robot is modeled with twox ˆ W yˆW -plane, given by wheeled diﬀerential drive dynamics constrained to the x ⎤ ⎡ ⎡ ⎤ ux (t) x˙ D (t) ⎢ y˙ D (t) ⎥ ⎢ ⎥ uy (t) x˙ D (t) ⎥ ⎢ ⎥ (1) =⎢ ⎣ z˙D (t) ⎦ = ⎣ ⎦ 0 θ˙ (t) − sin(θ(t))u (t)+cos(θ(t))u (t) x y θ˙ (t) d where d is the oﬀset distance from the point at the center of the line segment joining the center of the wheels to the origin of the robot frame, the D-frame, ˆ D (t) , yˆD (t) , zˆD (t)} and is which has an orthogonal triad of unit vectors {x centered at xD (t), as shown in Fig. 2. The velocity of the surveillance robot is assumed to be bounded by the constant vD > 0 so that x˙ D (t) ≤ vD with 0 < vD ≤ vA . It is also assumed that the range of the global sensor is large but ﬁnite, and that the region where the intruder is of interest for the defender to take action is such that xA (t) − xD (t) ≤ C, for some real large constant C > 0. The surveillance robot is equipped with an on-board sensor to keep track of the position of the intruder. This sensor has a base attached to the robot whose coordinate frame, the β-frame, is composed of the orthogonal triad of ˆ β (t) , yˆβ (t) , zˆβ (t)}. The β-frame shares the same origin as the unit vectors {x D-frame and can rotate around the axis zˆβ (t) = zˆD (t), making an angle β (t) ˆ D (t) axes. It is shown in Sect. 5 that the addition of this ˆ β (t) and x between the x rotation capability lowers the required maximum speed of the sensor’s hardware and simpliﬁes the controller design. The sensor has a tilt motion, for which the ˆ φ (t) , yˆφ (t) , zˆφ (t)} and origin φ-frame with orthogonal triad of unit vectors {x at xD (t) is deﬁned. This tilt motion corresponds to a rotation along the axis ˆ φ (t) and x ˆ β (t) axes. Finally, yˆφ (t) = yˆβ (t), making an angle φ (t) between the x there is an additional pan motion for which the η-frame with orthogonal triad

Ensured Continuous Surveillance

407

Fig. 2. W and D frames. The zˆW and zˆD (t) axes point outside of the page. The dynamics of the surveillance robot correspond to Eq. (1).

ˆ η (t) , yˆη (t) , zˆη (t)} and origin at xD (t) is deﬁned; the motion of unit vectors {x corresponds to a rotation along the axis zˆη (t) = zˆφ (t), making an angle η (t) ˆ φ (t) axes. The sensor is oriented in the direction of ˆ η (t) and x between the x η ˆ (t). Figure 3a) shows the frames and angles. x

Fig. 3. a) Orientation of the sensor on the surveillance robot. The sensor is oriented ˆ η (t). b) A representation of the target tracking region, the defense in the direction of x robot, the axes representing the relative orientation of the intruder with respect to the defender, the intruder, and the intruder’s projection onto the plane.

The angular motion of the sensor is assumed to have ﬁrst order dynamics described by ⎤ ⎡ ⎤ ⎡ β˙ (t) uβ (t) ⎣φ˙ (t)⎦ = ⎣uφ (t)⎦ , (2) uη (t) η˙ (t) with angular speeds bounded by real positive constants φ˙ max and η˙ max such that |φ˙ (t) | ≤ φ˙ max and |η˙ (t) | ≤ η˙ max , which accommodate for the hardware’s speed limits.

408

L. Guerrero-Bonilla et al.

Reference frames describing the orientation of the intruder with respect to the β-frame of the surveillance robot are deﬁned as follows. The φAD -frame ˆ φAD (t) , yˆφAD (t) , zˆφAD (t)} and origin with orthogonal triad of unit vectors {x at xD (t) is deﬁned through a rotation along the yˆφAD (t) = yˆβ (t) axis by an angle φAD (t). Similarly, the η AD -frame with orthogonal triad of unit vectors ˆ ηAD (t) , yˆηAD (t) , zˆηAD (t)} and origin at xD (t) is deﬁned through a rotation {x along the zˆηAD (t) = zˆφAD (t) axis by an angle η AD (t). The intruder lies in the ˆ ηAD (t). In Sect. 5, it is shown that φAD (t) and ηAD (t) can be direction of x obtained as a function of only xD (t), xA (t), θ (t) and β (t). Figure 3b) shows the axes and angles for the relative orientation of the intruder. The obstacle free region of the surveillance robot within the cluttered region ˆ W yˆW -plane where local sensors are used is deﬁned as the target tracking on the x region T ⊂ R2 , and is described by the closed and bounded set of points of a rectangular region with ﬁxed vertices at xV i for i ∈ {1, 2, 3, 4} numbered in counter-clockwise order. To avoid using modulo arithmetic and for simplicity of the notation, let xV 5 = xV 1 . Each of the rectangle’s sides is a boundary of T . The vector starting at one vertex of the ith boundary and ending in an adjacent one is given by bi = xV i+1 − xV i for i ∈ {1, 2, 3, 4} with xV 5 = xV 1 . The length Li of the ith side of the rectangle is given by Li = bi , with L1 = L3 and L2 = L4 . In [5], a general formulation with rotated target regions is given; for simplicity, in this paper we will consider rectangular target tracking regions aligned with the W -frame, so that the unit vectors joining consecutive vertices ˆ2 = yˆW , b ˆ3 = −x ˆ4 = −yˆW . Denoting the ˆ1 = x ˆW , b ˆ W , and b are given by b ˆ ˆ ˆi+2 and b ˆ⊥i = b ˆi+1 for ˆ perpendicular vector to bi as b⊥i , we have bi = −b ˆ ˆ ˆ ˆ i ∈ {1, 2, 3, 4}, with b5 = b1 and b6 = b2 . Figure 3b) shows an example of a target tracking region T , which can be expressed as ˆi+1 ≥ 0, ∀i ∈ {1, 2, 3, 4}}. T = {xP : (xP − xV i ) · b

(3)

In this paper, we focus on the simple case of one target tracking region for one surveillance robot, and assume that the position of the intruder can be obtained by a global sensor whenever its projection onto the plane is outside of the target tracking region. The addition of more surveillance robots with their corresponding target tracking regions and the interaction among them in a more complex cluttered region will be studied in a follow-up paper. 2.1

Control Objectives

It is assumed that a global sensor can provide the position of the intruder when ˆ W yˆW -plane, denoted as p (xA (t)), is not the projection of its position onto the x in the target tracking region, that is, p (xA (t)) ∈ T c where T c is the complement of T . When p (xA (t)) ∈ T , the surveillance robot relies on its local sensor to acquire the position of the intruder. The intruder is said to be within the sensing range of the sensor if the orientation errors φ (t) − φAD (t) and η (t) − ηAD (t) are bounded by real positive constants Δφ and Δη such that |φ (t) − φAD (t) | ≤ Δφ and |η (t) − ηAD (t) | ≤ Δη. In order to ensure that the position of the

Ensured Continuous Surveillance

409

intruder can be acquired without discontinuities, the intruder must be within the sensing range by the time it reaches the boundary of the target tracking region. Furthermore, it is also desired for the distance between the surveillance robot and the projection of the intruder to be bounded once the intruder enters the target tracking region, such that p (xA (t)) − xD (t) ≤ R for some positive constant R. The control objectives are formalized as follows: Given a target tracking region T on the plane according to (3); the continuously diﬀerentiable position xA (t) of the intruder with initial position xA (0) ∈ T c , velocity bounded by x˙ A (t) ≤ vA , vA > 0, and zA (t) ≥ zmin > 0; the position xD (t) and orientation θ (t) of the surveillance robot with dynamics given by (1), initial position xD (0) ∈ T , and velocity bounded by x˙ D (t) ≤ vD , vD > 0; and a sensor with dynamics given by (2) and angular speeds bounded by |φ˙ (t) | ≤ φ˙ max and |η˙ (t) | ≤ η˙ max where φ˙ max , and η˙ max are real constants; determine the real positive constant R and design the control law u (t) = T T [ ux (t),uy (t),uβ (t),uφ (t),uη (t) ] bounded by [ ux (t),uy (t) ] ≤ vD , |uφ (t) | ≤ φ˙ max and |uη (t) | ≤ η˙ max such that the following conditions are satisﬁed: Control Objective 1. p (xA (t)) − xD (t) ≤ R whenever p (xA (t)) ∈ T . Control Objective 2. The orientation errors φ (t)−φAD (t) and η (t)−ηAD (t) should satisfy |φ (t) − φAD (t) | ≤ Δφ and |η (t) − ηAD (t) | ≤ Δη whenever p (xA (t)) ∈ T . The control actions ux (t) and uy (t) can be selected based on the results from [5], and will be brieﬂy described in Sect. 3. The main contribution of this paper is on the selection of the control actions uβ (t), uφ (t) and uη (t), which is described in Sect. 5. In the rest of the paper, the explicit dependency on time and other variables is omitted when not indispensable.

3

Selection of ux and uy to Solve Control Objective 1

The work in [5] deﬁnes a rectangular region on the plane S of length s1 + s3 and width s2 + s4 surrounding the surveillance robot, with si ≥ 0 measured from the position xD in the direction of the ith side of the target tracking region. Then, it presents control laws that√ensure p (xA ) ∈ S whenever p (xA ) ∈ T . For a square region S, selecting R = 2si ensures that S ⊂ {xP : xP − xD ≤ R}, and therefore, the same control laws can be used to ensure p (xA ) − xD (t) ≤ R whenever p (xA ) ∈ T as required by the Control Objective 1. Following the results in [5], let TALi (xA , xV i ) = ⎧ 2 2 ((p(x A )−x V i )·bˆi+1 ) +((p(x A )−x V i )·bˆi −Li ) +2 − ⎪ ˆi ≥ Li ⎪ for (p (xA ) − xV i ) · b ⎪ vA ⎪ ⎪ ⎨ 2 ((p(x A )−x V i )·bˆi+1 ) +2 − ˆi < Li , for 0 < (p (xA ) − xV i ) · b ⎪ v A ⎪ ⎪ ⎪ ⎪ ⎩ ((p(x A )−x V i )·bˆi+1 )2 +((p(x A )−x V i )·bˆi )2 +2 − ˆi ≤ 0 for (p (xA ) − xV i ) · b vA

410

L. Guerrero-Bonilla et al.

ˆi+1 − √1 R (x D (t)−x V i )·b 2 √ , vD / 2 4 set Cxy = i=1 {(xD , xA )

TDLi (xD (t) , xV i ) = i ∈ {1, 2, 3, 4}. Let the Cxy , R satisﬁes

R > max

i∈{1,2}

and let hLi = TALi − TDLi for : hLi ≥ 0}. If (xD (0) , xA (0)) ∈

⎛ ⎞ 2 Li vD ⎝ Li √ − + 2 − ⎠ , 2 2 vA

(4)

and the constants γ1 , γ2 , γ3 and γ4 satisfy, for j ∈ {1, 2}, γj = γj+2 = ⎞−1 ⎛ ⎝

Lj+1 2

vA

2

+2 −

−

Lj+1 2

R −√ √ 2 vD / 2

⎠

, then by Theorem 3 in [5] the control law

v vD D ux = max √ (1 − γ2 hL2 ) , 0 + min 0, − √ (1 − γ4 hL4 ) 2 2 v vD D uy = max √ (1 − γ3 hL3 ) , 0 + min 0, − √ (1 − γ1 hL1 ) 2 2

(5) (6)

solves the Control Objective 1. The reader is referred to [5] for the proofs corresponding to the general case.

4

Suﬃcient Conditions to Solve Control Objective 2

In this section, suﬃcient conditions to solve Control Objective 2 are described. Let T c be partitioned into regions Ri and Ri,i+1 such that Ri = {xP : (xP − xV i ) · ˆi+1 < 0, 0 ≤ (xP − xV i ) · b ˆi ≤ Li } and Ri,i+1 = {xP : (xP − xV i ) · b ˆi+1 < b ˆi > Li } for i ∈ {1, 2, 3, 4} with R4,5 = R4,1 , and R5,6 = R1,2 . 0, xP : (xP − xV i )· b Figure 4 shows an image of the regions Ri and Ri,i+1 . Let the continuously diﬀerentiable function TA (xA ) be deﬁned as

Fig. 4. Partition of T c .

Ensured Continuous Surveillance

411

⎧ 2 2 ˆ ⎪ ⎪ ((p(x A )−xˆ V i )·b i+1 ) + − for p (xA ) ∈ Ri ⎪ v ⎪ A ⎨ 2 2 TA (xA ) = , ((p(x A )−xˆ V i )·bˆi+1 ) +((p(x A )−xˆ V i+1 )·bˆi+2 ) +2 − ⎪ for p (xA ) ∈ Ri,i+1 ⎪ vA ⎪ ⎪ ⎩ 0 for p (xA ) ∈ T

(7) ) = for i ∈ {1, 2, 3, 4}. The continuously diﬀerentiable under-approximation f (x P xP 2 + 2 − ≤ xP is used, selecting a small value of such that 0 < < 1. Let the functions hφ+ (xA , φ, φAD ), hφ− (xA , φ, φAD ), hη+ (xA , η, ηAD ) and hη− (xA , η, ηAD ) be deﬁned as follows: hφ+ = TA −

φ − φAD − Δφ ωφ

(8)

hφ− = TA −

φAD − φ − Δφ ωφ

(9)

hη+ = TA −

η − ηAD − Δη ωη

(10)

hη− = TA −

ηAD − η − Δη ωη

(11)

+vD where φ˙ max and η˙ max are such that ωφ = φ˙ max − vAzmin > 0 and ωη = η˙ max − vA +vD > 0. The following theorem provides suﬃcient conditions to ensure that zmin the intruder is detected at the boundary and within the target tracking region.

Theorem 1. If hφ+ and hφ− satisfy hφ+ ≥ 0 and hφ− ≥ 0, then |φ−φAD | ≤ Δφ whenever p (xA ) ∈ T . Similarly, if hη+ and hη− satisfy hη+ ≥ 0 and hη− ≥ 0, then |η − ηAD | ≤ Δη whenever p (xA ) ∈ T . Proof. Leaving the term φ − φAD on one side of the inequalities hφ+ ≥ 0 and hφ− ≥ 0 leads to φ−φAD ≤ ωφ TA +Δφ and φ−φAD ≥ −ωφ TA −Δφ respectively. If p (xA ) ∈ T then, by (7), TA = 0, and the equations become φ − φAD ≤ Δφ and φ − φAD ≥ −Δφ. These two last equations together imply that, if hφ+ ≥ 0 and hφ− ≥ 0, then |φ − φAD | ≤ Δφ whenever p (xA ) ∈ T . The same procedure

with hη+ and hη− leads to the corresponding conclusion.

5

Control Laws to Solve Control Objective 2

To ensure that the conditions of Theorem 1 are met, a control law for the sensor dynamics (2) is developed using results on Control Barrier Functions. A brief introduction is given next, but the reader is referred to [1]. Consider the set C deﬁned as the superlevel set of a continuously diﬀerentiable function h : D ⊂ Rn → R such that C = {x ∈ D ⊂ Rn : h (x) ≥ 0}, with its

412

L. Guerrero-Bonilla et al.

boundary given by ∂C = {x ∈ D ⊂ Rn : h (x) = 0}, and consider a system of the form x˙ = f (x) + g (x) u (12) where x ∈ Rn , u ∈ U ⊂ Rm , with f and g locally Lipschitz continuous. For any initial condition x (0), there exists a maximum time interval I (x (0)) = [0,τmax ) such that x (t) is the unique solution to (12) on I (x (0)). In the case when (12) is forward complete, τmax = ∞. The set C is called forward invariant if for every x (0) ∈ C, x (t) ∈ C for all t ∈ I (x (0)). Definition 1 (Definition 2, [1]). Let C ⊂ D ⊂ Rn be the superlevel set of a continuously diﬀerentiable function h : D → R, then h is a Control Barrier Function (CBF) if there exists an extended class K∞ function α such that for the control system (12), supu ∈U [Lf h (x) + Lg h (x) u] ≥ −α (h (x)) , ∀x ∈ D. ∂h(x) The Lie derivative notation is used, so that h˙ (x) = ∂h(x) ∂x f (x) + ∂x g (x) u = Lf h (x) + Lg h (x) u. Given a CBF h, deﬁne the set K (x) = {u ∈ U : Lf h (x) + Lg h (x) u + α (h (x)) ≥ 0} for each x ∈ D.

Theorem 2 (From Theorem 2, [1]). Let C ⊂ Rn be a set deﬁned as the superlevel set of a continuously diﬀerentiable function h : D → R. If h is a ∂h (x) = 0 for all x ∈ ∂C, then any Lipschitz control barrier function on D and ∂x continuous controller u (x) ∈ K (x) for the system (12) renders the set C forward invariant. These results can be applied to time varying systems, as described in [14]. T Let the state x ∈ R10 be deﬁned as x = xD θ β φ η xA , with dynamics ⎡

ux uy 0

⎤

⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ − sin(θ(t))ux (t)+cos(θ(t))uy (t) ⎥ ⎢ ⎥ 07×1 ⎥ d +⎢ x˙ = ⎢ ⎥ x˙ A (t) uβ ⎢ ⎥ ⎢ ⎥ uφ ⎢ ⎥ ⎣ ⎦ uη 03×3 ,

(13)

where ux and uy correspond to (5) and (6). Let hφ+ , hφ− , hη+ and hη− be CBF candidates, and let the sets Cφ and Cη be deﬁned as Cφ = {x : hφ+ ≥ 0} {x : hφ− ≥ 0}, (14) Cη = {x : hη+ ≥ 0}

{x : hη− ≥ 0}.

(15)

The corresponding time derivatives are given by ˙ φ˙ φAD h˙ φ+ = T˙A − + ωφ ωφ

(16)

Ensured Continuous Surveillance

413

˙ φAD φ˙ + h˙ φ− = T˙A − ωφ ωφ

(17)

η˙ ηAD ˙ + h˙ η+ = T˙A − ωη ωη

(18)

ηAD ˙ η˙ + h˙ η− = T˙A − ωη ωη

(19)

˙ In order to verify that hφ+ , hφ− , hη+ and hη− are CBFs, bounds on T˙A , φAD and ηAD ˙ are ﬁrst developed. 5.1

Bounds on the Time Derivatives, and a Selection for uβ (t)

T˙A is given by T˙A (xA ) = ⎧ ˆi+1 )b ˆi+1 ·p(x˙ A ) ˆ V i )·b ((p(x A )−x ⎪ for p (xA ) ∈ Ri ⎪ 2 ⎪ 2 ˆ ⎪ ˆ ⎪ ⎨ vA ((p(x A )−x V i )·b i+1 ) + ˆi+1 )b ˆi+2 )b ˆi+1 +((p(x A )−x ˆi+2 )·p(x˙ A ) ˆ V i )·b ˆ V i+1 )·b (((p(x A )−x for p (xA ) ∈ Ri,i+1 , ⎪ 2 ⎪ ˆ ˆ ⎪ vA ((p(x A )−xˆ V i )·b i+1 ) +((p(x A )−xˆ V i+1 )·b i+2 )2 +2 ⎪ ⎪ ⎩ 0 for p (xA ) ∈ T Since any vector v ∈ Rn satisﬁes √

v v 2 +2

˙ A ) ≤ x˙ A ≤ vA , < 1 and p (x

then |T˙A | < 1 ∀xA , so that T˙A can be bounded from below by T˙A > −1 between the W and for all xA . Let the xAD = xA − x D . The transformation cos(θ+β) sin(θ+β) 0 β β the β frames is given by xP = − sin(θ+β) cos(θ+β) 0 xW P , leading to xAD = 0

0

1

β W W W xW AD cos (θ + β) + yAD sin (θ + β), yAD = −xAD sin (θ + β) + yAD cos (θ + β) and β W W W ≥ zmin > 0, zAD = zAD . From this transformation and the fact that zAD = zA the angle φAD can be calculated using the diﬀerentiable function ⎧ β zAD ⎪ ⎪ for xβAD > 0 arctan ⎪ xβ ⎨ AD φAD = π2 (20) for xβAD = 0 , β ⎪ ⎪ zAD β ⎪ ⎩π + arctan for x