271 85 8MB
English Pages XXIV, 208 [219] Year 2020
Bin Fang · Fuchun Sun · Huaping Liu Chunfang Liu · Di Guo
Wearable Technology for Robotic Manipulation and Learning
Wearable Technology for Robotic Manipulation and Learning
Bin Fang • Fuchun Sun • Huaping Liu • Chunfang Liu • Di Guo
Wearable Technology for Robotic Manipulation and Learning
Bin Fang Department of Computer Science and Technology Tsinghua University Beijing, Beijing, China
Fuchun Sun Department of Computer Science and Technology Tsinghua University Beijing, Beijing, China
Huaping Liu Department of Computer Science and Technology Tsinghua University Beijing, Beijing, China
Chunfang Liu Beijing University of Technology Beijing, Beijing, China
Di Guo Department of Computer Science and Technology Tsinghua University Beijing, Beijing, China
ISBN 978-981-15-5123-9 ISBN 978-981-15-5124-6 (eBook) https://doi.org/10.1007/978-981-15-5124-6 © Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Foreword
Through years of studies, much progress has been made in robotic learning and manipulation. However, smart, small, soft robotic manipulators remain a grand challenge, especially with robust learning capabilities across various sensing, learning, and manipulation modes. In recent years, wearable devices that are capable of capturing and delivering information in a smart and timely manner have attached much research interest. The integration of wearable devices with robots begins to demonstrate great promise in applications in industrial robots, medical robots, and service robots and is leading to a new avenue for next-generation smart robots. The field clearly needs a monograph in summarizing recent progress and advances. I am pleased to see this book edited by Drs. Bin Fang, Fuchun Sun, and Huaping Liu, which offers promising solutions for motion capture using wearable devices. The book includes several novel wearable devices and wearable computation methods that can capture and recognize gestures in multiple scenarios. The applications demonstrate a comprehensive wearable solution for robotic manipulation and learning. The work opens a new revenue for robotic manipulation and learning in an unstructured and complex environment. This book clearly shows that wearable devices play a critical role in robotic manipulation and learning. To the best of my knowledge, it is the first book on robotic manipulation and learning using wearable devices. I believe it will bring significant insights and broad impacts on research and education for wearable technologies in robotics. Ohio State University, Columbus, OH, USA April 2020
Zhang Mingjun
v
Preface
Wearable devices have served in various application scenarios such as interaction, healthcare, and robot learning. Due to the advantages on capturing the information of the humans’ manipulation and providing high-quality demonstrations for robotic imitation, wearable devices are applied on robots to acquire manipulation skills. However, the robotic manipulation learning using wearable device faces great challenges. Wearable devices often need multiple sensors to capture human operation information, which brings in many challenges such as wearable device design, sensor calibration, wear calibration, information fusion of different sensors, and so on. In addition to that, none of the existing work exploits the complete motion including that of not only hands but also arms. Meanwhile, the demonstration dataset is difficult to be reused due to the different settings (such as in sensing and mechanics) of the teachers and the learners. Furthermore, minimal parameter tuning and learning times requiring few training examples are desirable in learning strategies. To address the above-mentioned challenges, we developed a novel wearable device that captured more information of gestures, built the manipulation demonstration and the robot learning technology was introduced to improve the manipulation performance. This book is divided into three parts: The first part presents the research background and motivation and introduces the development of wearable technologies and applications of wearable devices. The second part focuses on wearable technologies. In Chap. 2, wearable sensors including inertial sensors and tactile sensors are presented. In Chap. 3, multisensor fusion methods are developed to tackle the accurate motion capture by wearable devices. Chapter 4 demonstrates their applications including gesture recognition, tactile interaction, and tactile perception. The third part presents the methods of robotic manipulation learning. Chapter 5 tackles the problem of manipulation learning from the teleoperation demonstration of a wearable device using dynamical movement primitive. Chapters 6 addresses the problem of manipulation learning from visual-based teleoperation demonstration by developing deep neural networks methodology. Chapter 7 focuses on learning from a wearable-based indirect demonstration. The fourth part contains Chap. 8, which summarizes this book and presents some prospects. For a clear illustration, an vii
viii
Preface
Fig. 1 Organization of the book: logical dependency among parts and chapters
outline of the logical dependency among chapters is demonstrated in Fig. 1. Please note that we try our best to make each chapter self-contained. This book reviews the current study status and application of wearable devices to summarize the design methods currently used, analyze their strengths and weak points, and list the future research trend. The remainder of the book is organized as follows: Part II describes the wearable devices. Part III reviews robotic manipulation and learning. Conclusion with future research trends is outlined in Part IV. This book is suitable as a reference book for graduate students with a basic knowledge of machine learning as well as professional researchers interested in robotic tactile perception and understanding and machine learning. Beijing, China Beijing, China Beijing, China Beijing, China Beijing, China May 2020
Bin Fang Fuchun Sun Huaping Liu Chunfang Liu Di Guo
Acknowledgements
This book refers to our research work at Department of Computer Science and Technology, Institute for Artificial Intelligence, Tsinghua University and Beijing National Research Center for Information Science and Technology, China. Five years ago, we started looking into the challenging field of wearable technology. Guodong Yao helped develop the first generation of wearable devices under very difficult conditions. With him, we launched the research work and published a series of results. Meanwhile, one of our visiting students, Qin Lv, built the multi-modal dataset of gestures. We would like to sincerely thank our visiting student Xiao Wei. With him, we were able to explore the idea of using wearable demonstration for robotic manipulation and learning. The visiting students Shuang Li and Xiao Ma carried out the research work on the learning methodology of visual-based teleoperation demonstration. This joint work established a good foundation for robotic manipulation learning. We would like to thank everyone who has participated for their support, dedication, and cooperation. We would like to express our sincere gratitude to Mingjie Dong, Ziwei Xia, Quan Zhou, Xiaojian Ma, Zhudong Huang, and Danling Lu who have provided immense help with the preparation of figures and with the proofreading of the book. We would like to thank our commissioning editors, Lanlan Chang and Wei Zhu, for their great support. A great deal of this research was supported by the Tsinghua University Initiative Scientific Research Program No. 2019Z08QCX15 and National Natural Science Foundation of China under the grant U1613212. This work was also supported partially by the National Natural Science Foundation of China under Grant 91420302 and 61703284. Finally, on a personal note (BF), I would like to thank my parents for standing beside me and supporting me throughout my research and writing this book. Special thanks to my dear wife Mengyuan Lu and my girl Shuyi Fang. During the final stage of writing the book, I did not have much time to spend with them. I would like to thank my parents, wife, and daughter for supporting my research and writing this book.
ix
Contents
Part I
Background
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1 The Overview of Wearable Devices .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.1 Wrist-Worn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.2 Head Mounted . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.3 Body Equipped . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.4 Smart Garment .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.5 Smart Shoes .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.1.6 Data Gloves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2 Sensors of Wearable Device .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.1 Motion Capture Sensor . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.2 Tactile Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.2.3 Physiological Parameter Measurement Sensors .. . . . . . . . . . . . . . 1.3 Wearable Computing Algorithms . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.3.1 Motion Capture Related Algorithms .. . . . . .. . . . . . . . . . . . . . . . . . . . 1.3.2 Motion Recognition Related Algorithms ... . . . . . . . . . . . . . . . . . . . 1.3.3 Comparison of Different Wearable Computing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.4.1 Interaction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.4.2 Healthcare .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.4.3 Manipulation Learning . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Part II
3 3 4 6 7 9 9 10 11 11 13 14 15 15 16 17 18 19 20 21 22 22
Wearable Technology
2 Wearable Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1 Inertial Sensor .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1.1 Analysis of Measurement Noise . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
33 33 34
xi
xii
Contents
2.1.2 Calibration Method . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2 Tactile Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.1 Piezo-Resistive Tactile Sensor Array . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.2 Capacitive Sensor Array .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.2.3 Calibration and Results . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
37 47 50 53 56 58 61 61
3 Wearable Design and Computing . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.2.1 Inertial and Magnetic Measurement Unit Design . . . . . . . . . . . . . 3.2.2 Wearable Design . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3 Motion Capture Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.1 Models of Inertial and Magnetic Sensors . .. . . . . . . . . . . . . . . . . . . . 3.3.2 QEKF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.3.3 Two-Step Optimal Filter . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.1 Orientations Assessment . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.4.2 Motion Capture Experiments.. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
65 65 66 67 68 70 70 72 75 82 82 83 86 87
4 Applications of Developed Wearable Devices . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1 Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1.1 ELM-Based Gestures Recognition .. . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.1.2 CNN-Based Sign Language Recognition ... . . . . . . . . . . . . . . . . . . . 4.1.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.2 Tactile Interaction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3 Tactile Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.1 Tactile Glove Description . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.2 Visual Modality Representation .. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.3 Tactile Modality Representation . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.4 Visual-Tactile Fusion Classification . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.3.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
89 89 90 93 98 108 110 111 112 114 116 117 121 122
Part III
Manipulation Learning from Demonstration
5 Learning from Wearable-Based Teleoperation Demonstration .. . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2 Teleoperation Demonstration .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.1 Teleoperation Algorithm . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.2.2 Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
127 127 129 129 131
Contents
xiii
5.3 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.1 Dynamic Movement Primitives . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.3.2 Imitation Learning Algorithm .. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4.1 Robotic Teleoperation Demonstration . . . . .. . . . . . . . . . . . . . . . . . . . 5.4.2 Imitation Learning Experiments . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.4.3 Skill-Primitive Library . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
132 132 134 136 136 136 139 141 142
6 Learning from Visual-Based Teleoperation Demonstration.. . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.2 Manipulation Learning of Robotic Hand . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.3 Manipulation Learning of Robotic Arm . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 6.4.1 Experimental Results of Robotic Hand .. . .. . . . . . . . . . . . . . . . . . . . 6.4.2 Experimental Results of Robotic Arm . . . . .. . . . . . . . . . . . . . . . . . . . 6.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
145 145 147 152 159 159 163 169 170
7 Learning from Wearable-Based Indirect Demonstration . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.2 Indirect Wearable Demonstration . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3 Learning Algorithm .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.3.1 Grasp Point Generalization on Incomplete Point Cloud . . . . . . 7.3.2 Grasp Model Built by “Thumb” Finger.. . .. . . . . . . . . . . . . . . . . . . . 7.3.3 Wrist Constraints Estimation .. . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.1 Object Classification Based on Shape Descriptors .. . . . . . . . . . . 7.4.2 Comparisons of Shape Descriptors for Grasp Region Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.4.3 Grasp Planning .. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . .
173 173 177 182 182 185 187 191 191
Part IV
192 196 201 202
Conclusions
8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 207
About the Authors
Bin Fang is an Assistant Researcher at the Department of Computer Science and Technology, Tsinghua University. His main research interests include wearable devices and human–robot interaction. He is a leader guest editor for a number of journals, including Frontiers in Neurorobotics and Frontiers in Robotics and AI, and has served as an associate editor for various journals and conferences, e.g. the International Journal of Advanced Robotic Systems and the IEEE International Conference on Advanced Robotics and Mechatronics. Fuchun Sun is a Full Professor at the Department of Computer Science and Technology, Tsinghua University. A recipient of the National Science Fund for Distinguished Young Scholars, his main research interests include intelligent control and robotics. He serves as an associate editor for a number of international journals, including IEEE Transactions on Systems, Man, and Cybernetics: Systems, IEEE Transactions on Fuzzy Systems, Mechatronics, and Robotics and Autonomous Systems. Huaping Liu is an Associate Professor at the Department of Computer Science and Technology, Tsinghua University. His main research interests include robotic perception and learning. He serves as an associate editor for various journals, including IEEE Transactions on Automation Science and Engineering, IEEE Transactions on Industrial Informatics, IEEE Robotics and Automation Letters, Neurocomputing, and Cognitive Computation. Chunfang Liu is an Assistant Professor at the Department of Artificial Intelligence and Automation, Beijing University of Technology. Her research interests include intelligent robotics and vision. Di Guo received her Ph.D. degree from the Department of Computer Science and Technology, Tsinghua University, Beijing, in 2017. Her research interests include robotic manipulation and sensor fusion. xv
Acronyms
3D AHRS AMOLED ANN AR BAN BLE BLEEX BN BR BT CDC CMOS CNN DARPA DIP DL DMP DOF DTW EDR EHPA EKF ELM EMG ESOQ2 FC FPC FPFH GNSS GPS
3-Dimension Attitude and heading reference system Active-matrix organic light-emitting diode Artificial neural network Augmented reality Body area network Bluetooth low energy Berkeley lower extreme exoskeleton Batch normalization Basic rate Bluetooth Capacitance to digital conversion Complementary metal oxide semiconductor Convolutional neural network Defense Advanced Research Projects Agency Distal interphalangeal point Deep learning Dynamic movement primitive Degree of freedom Dynamic time warping Enhanced data rate Exoskeletons for human performance augmentation Extended Kalman Filter Extreme learning machine Electromyography Second estimator of the optimal quaternion Fully connected layers Flexible printed circuit board Fast point feature histograms Global navigation satellite system Global positioning system xvii
xviii
HMM HSV HULC ICP IDC IMMU IMU KF KNN LED LM LSTM LTE LTPO MCP MCU MEMS MIMU MLE MRI MSE NASA NFC OLED PC PCA PDMS PF PIP PPG QEKF ReLU RMS RMSE RNN ROS SFKF SHOT SLFN SLR SLRNet SPFH SPI SVM TF
Acronyms
Hidden Markov model Hue saturation value Human Universal Load Carrier Iterative closest point International Data Corporation Inertial and magnetic measurement unit Inertial measurement unit Kalman Filter K-Nearest neighbors Light-emitting diode Leap motion Long short-term memory Long-term evolution Low-temperature polycrystalline oxide Metacarpophalangeal Micro control unit Micro electrical mechanical systems Miniature inertial measurement unit Maximum likelihood estimation Magnetic resonance imaging Mean squared error National Aeronautics and Space Administration Near-field communication Organic light-emitting diode Personal computer Principal components analysis Polydimethylsiloxane Particle filter Proximal interphalangeal Photoplethysmograph Quaternion extended Kalman Filter Rectified linear unit Root mean square Root mean square error Recurrent neural network Robot operating system Sensor-based fuzzy Kalman filter Signature of Histograms of OrienTations Single-hidden layer feed-forward neural network Sign language recognition Sign Language Recognition Network Simplified point feature histogram Serial peripheral interface Support vector machine Transform frame
Acronyms
TMCP UA UMTS USB VR WL
xix
Trapeziometacarpal Under armor Universal Mobile Telecommunications System Universal serial bus Virtual reality Waveform length
Mathematical Notation
Va a b K ε Vω ω Θ¯ k (T ) ξs σ 2 (T ) Ci M ag va kia bia vg I iO Ξ Uk Γk Ik E Fn D ni Xi
The voltage of acceleration The acceleration The sensor bias The scale factor (or acceleration gain) The sensor’s noise The voltage of angular velocity or weight The true angular velocity The average of the output rate for a cluster which starts from the kth data point and contains n data points A set of random variables The Allan variance of length T The i-th matrix The misalignment matrix The input specific force expressed in platform coordinates The measurement noise The scale factor of i-th accelerometer output, i=x,y,z The bias of the i-th accelerometer output, i=x,y,z The measurement noise The input vector The MIMU’s measurement of i axis, i = x, y, z The experimental procedure or positions The number of n tests, k ∈ [1, n] The input vector The vector at test point k The global region of positions The design information matrix of n test positions The object function based on n test positions The state vector at time i xxi
xxii
Zi wi Φ H G U I R Δx Δy L Enl Ehys Erpt w g C bn Hh am hm bm N Ts Qk R K P xn xb x q rij qi T R(q Nb2 ) pN bi h
Mathematical Notation
The system output (measured signal) at time i The process noise at time i The state matrix of the state-space representation The output matrix of the state-space representation The observability matrix of the n-state discrete linear time-invariant system The voltage The current or identity matrix The resistance, the orientation of the distal phalanx with respect to the body, the groundtruth robot arm directional angles The input increment The output increment The full range The nonlinear degree The hysteresis The repeatability The noise which is supposed to be Gaussian with zero-means The gravity vector The orientation cosine matrix The unit vector The measurement of the accelerometer The measurement of the magnetometer The disturbance vector included the magnetic effects and the magnetometers bias The global coordinate or the number of training samples The sampling period of the measurements, the gyro measurement noise vector The process noise covariance matrix The covariance matrix The Kalman gain The posteriori error covariance matrix The vector in the global coordinate The vector in the body coordinate The posteriori state estimate The quaternion of the relative orientations The quaternion of the absolute orientations of the i-th coordinate The transformation The orientation of the distal phalanx with respect to the body The position of the palm The values of magnetometers
Mathematical Notation
A wi vi e φ pN m3 βi gi (x) h(x) xi H† l(Ti , Rj ) d(Ti , Rj ) γ θ∗ xjl xjl−1 Mj kijl wij f d μ Kp T τ y˙ y¨ αz and βz J L ||x||0 ||x||1 ||X||2,1 ||x||2 ||X||F ||x||∞
xxiii
The rotation matrix that transforms a vector from the inertial frame to the body frame The measured unit vector for the i-th observation as expressed in the body frame The known reference vector for the i-th observation The rotation axis The rotation angle The position of the distal frame expressed in the global frame The output weight vector of the node of the it h hidden layer The hidden nodes of nonlinear piecewise continuous activation functions The output vector of the hidden layer The i-th input The Moore–Penrose generalized inverse of matrix H The cumulative distance The current cell distance The prescribed adjusting parameter The angle of each joint and is the value after normalization The output feature map The input feature map The selected area in the l − 1 layer The weight parameter The weight between the links xi and yj The activation function in neural networks The number of pixels in the image The average value The proportionality coefficient The sampling interval The constant about time The velocity of the joint trajectory The acceleration The gain term The groundtruth joint angles The mean squared error (MSE) loss The number of the non-zero elements in the vector x The sum of the absolute values of all elements in the vector x The sum of the Euclidean norms of all row vectors in the matrix X The Euclidean norm of the vector x The Frobenius norm of the matrix X The maximum values of the absolute values of all elements in the vector x
xxiv
Lse Lrp Lrg Lphy Θ Rcol Rn Pmn Qnm Xn n X R Θ λ Ra δia d Ps0r Pe0r i Ti+1 A θ PC IP 1{·} Np Cp Ct h p Th Tr Ur OT m d
Mathematical Notation
The human arm keypoint position loss The robot arm posture loss The robot joint angle generation loss The physical loss which enforces the physical constraints and joint limits The robot joint angles The minimum collision free radius between two links The n-dimensional Euclidean space The coordinates of point m in frame n The homogeneous coordinates of point m in n frame The groundtruth coordinates of the n-th keypoint of human arm The estimated coordinates of the n-th keypoint of human arm The estimated robot arm directional angles The estimated robot joint angles of robot arms The balance weight The rotation matrix around axis a The angle of rotation around axis a The distance The shoulder point of Baxter arm The elbow point of Baxter arm The transform matrix The directional angle set The arm joint angle The change in a figure or amount The point cloud set of the object The interest point’s set The indicator function which is equal to 1 if its argument is true and 0 otherwise The number of neighboring points The number of filtered candidate grasp points The threshold The position of the contact point The orientation of the human thumb tip in Euler angle The rotation angle The homogenous equation The homogeneous transformation matrix The weight The approaching direction
Part I
Background
This part of the book provides the survey about the wearable technologies. Chapter 1 serves as a background of the whole book, by providing the development and applications of wearable devices.
Chapter 1
Introduction
Abstract Wearable sensing devices are smart electronic devices that can be worn on the body as implants or accessories. With the rapid development of manufacturing and sensor technologies, they have aroused the worldwide interest and are widely applied in many practical applications. In this chapter, we provide an overview of the research of wearable technologies and their future research trends. The contents of wearable devices, sensor technologies, wearable computing algorithms, and their applications are presented in sequence.
1.1 The Overview of Wearable Devices Nowadays, human–computer interaction is an essential part of most people’s daily life. It refers to the process of collecting relevant data from human, sending the data to the computer for further analysis, and giving corresponding feedback. The traditional human–computer interaction modes range from the original keyboard to the current mouse, joystick, and wireless input devices. They have greatly facilitated the interaction between people and computers and made it easier for people to operate the computer and improve work efficiency [1]. However, this kind of interaction mode cannot completely meet demands of human–computer interaction due to the dependence on additional input hardware devices. To solve the problem, wearable devices are considered, especially which are worn on hands together with arms. From hands and arms, a variety of information, such as human’s physiological status and physical movements, can be detected and sent through devices worn on arms and hands. A useful example is hand gestures, which can be defined as a variety of gestures or movements produced by hands and arms combined. Hand gestures serve well in expressing human intention so as to act as an approach of natural communication between human and machine. Besides, the wearable devices can also acquire and share various other kinds of information [2], such as heart rate, blood pressure, and time spent exercising at anytime and anywhere, which can be used in medical health monitoring, measuring athletic performance, and so on. Merits above highly drive people to study on wearable devices. With the rapid development of science and technology, wearable devices begin to enter people’s © Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_1
3
4
1 Introduction
daily life and it has made great changes to society. According to the IDC survey [3], the market of wearable devices keeps rising, and we can infer that the development of wearable devices will also increase accordingly. Wearable device is usually a kind of portable device, which can be easily and comfortably worn by people. They can provide many innovative functions such as real-time monitoring and information support [4]. They are recognized by their mobility, wearability, interactivity, etc. Smart watches, smart bracelets, and virtual reality(VR)/augmented reality(AR) devices are the most popular wearable devices. As a personal daily data collection portal and decision support platform, wearable devices can be more closely connected to individuals and the environment, and thus serve human beings. Wearable device relies on various kinds of sensors to perceive users’ status and their surrounding environment information [5]. Reliable data communication is critical for cloud or software support. At present, sensors used in wearable devices mainly include three-axis accelerometer, three-axis gyroscope, three-axis magnetic sensor, GPS, photoelectric heart rate sensor, altimeter, ambient light sensor, temperature sensor, bioelectrical impedance sensor, capacitance sensor, etc. These sensors can obtain the data of human motion characteristics, pose characteristics, heart rate characteristics, environmental characteristics, skin state characteristics, and emotional characteristics [6–10]. At the same time, in order to better transmit the information acquired by wearable devices, many standardized communication protocols have been proposed [11] and the commonly used communication technologies include 4G/5G, WiFi, Bluetooth, etc. The communication methods include both wired and wireless connections. Based on the functions of wearable devices and how they are worn by people, we make a classification of wearable devices as demonstrated in Fig. 1.1.
1.1.1 Wrist-Worn Considering the convenience of wrist wearings, wearable devices are often designed to be worn on the wrist. At present, the wrist-worn devices are mainly divided into
Fig. 1.1 The classification of wearable devices
1.1 The Overview of Wearable Devices
5
smart watch and smart bracelet. These devices mainly include two functions, one is to collect daily movement information, heart rate information, sleep state, and other data of human body [12], which can provide data support for human–computer interaction and human–environment interaction. The other is to play an important role in the information transmission between people and smart phones. (1) Smart Watch Smart watch is one of the most popular wearable devices which combines traditional watches with modern technology. Its function mainly includes three aspects: (1) Function of traditional watch to prompt time. (2) As a communication and notification tool between smart phone and user. For example, when you have phone calls, e-mail, or messages, you can be notified and you can even control your cell phone by it. (3) Help users collect their daily data such as movement information, heart rate, and sleep state information for health analysis. Traditionally, watches are usually used for prompting time, smart watch combines a variety of sensors which help to provide more functions, setting off a trend around the world. The launch of smart watch quickly aroused people’s interest in the application of wearable devices in daily life. It can work together with smart phones. The watch can transmit data via WiFi and Bluetooth. When it is connected to the smart phone, users can receive notifications such as SMS, phone calls. And for users who are concerned about their health status, the watch can also provide some useful functions. Through the heart rate sensor, it can well monitor the user’s heart health status. A variety of sensors are included in a smart watch such as electrical heart sensor, gyroscope, ambient light sensor, etc. With these sensors, the smart watch can provide a variety of functions to make better user experience. There are many smart watch products in the market, such as Apple Watch, Huawei Watch, etc. [13, 14]. Besides the convenience of smart watch, there are also some concerns. First, the power problem. Users often need to charge the device frequently, which is a little inconvenient. Second, the binding problem with the mobile phone. Most functions usually require a smart phone to be fully realized. (2) Smart Bracelet The birth of smart bracelet is mainly to record the movement of human body, help health monitoring, and cultivate good and scientific exercise habits. With the upgrading of the solution, it extends to the continuous monitoring functions such as activity feedback, exercise, sleep monitoring, etc. Unlike smart watch, smart bracelet usually does not have to have a display while it can be interacted with through a connection to a smart phone. The smart bracelet is widely applied in people’s daily life and there are many smart bracelet products [15]. Smart bracelets such as Jawbone UP and Fibit Flex can track the wearer’s sleep, exercise, and diet status. Once smart bracelets were launched, they quickly occupied a lot of market and attracted consumers by their accurate and exciting performance. Many manufacturers around the world have released their smart bracelet products and the smart bracelet begins to enter a rapid growth period.
6
1 Introduction
Huawei Band 4 [16] is an example of smart bracelet released in 2019. It is a smart bracelet product with better performance. Through heart rate sensor and advanced AI algorithm, Huawei band 4 can well monitor the user’s heart health. This also shows the improvement of AI algorithm for wearable device application performance. Huawei Band 4 can better measure blood oxygen and help users to detect blood oxygen saturation at any time by optimizing hardware, algorithm, and multi-light source fusion. There are also many other smart bracelet products, such as Mi Band [17]. The advantages of smart bracelet are that it can save electricity and has some practical functions. The disadvantages are that like smart watch, it might need to be connected to a mobile phone to achieve most of its functions.
1.1.2 Head Mounted With the increasing demand for human–computer interaction and virtual reality, the head mounted device has been launched. (1) Smart Eyewear Smart eyewear mainly refers to glasses with multisensor and wireless transmission functions, which are often used in VR or AR scenarios. In 1995, the portable 3D display Virtual Boy [18] is released with a game controller and it is the first time that the game industry has touched the area of virtual reality. Sony announced a special virtual reality device for PlayStation, named Project Morpheus [19]. This device can not only judge the player’s head position through a variety of built-in sensors, but also give feedback to players and allow players to experience the reality of the first person game more intuitively. One of the most famous smart eyewear devices is Google Glass [20], which is a kind of “Augmented Reality” glasses released by Google. It has the very same functions as smart phones. It can control photos, video calls and identify directions through voice, surf the Internet, process text messages and e-mails, etc. Microsoft HoloLens [21], a head mounted augmented reality computing device was released by Microsoft. HoloLens has the ability of three-dimensional perception and can model the three-dimensional scene around it. HoloLens also has the ability for human–computer interaction, which can be controlled by gestures. HoloLens obtains the depth image of the surrounding environment through the depth camera, and then constructs the environment and object model through algorithm calculation. Recently, Microsoft released the new generation product HoloLens 2, which has more functions. (2) Ear-Buds Without the bother of wires, Bluetooth ear-buds bring great convenience for listening to music or answering phone calls. Bluetooth ear-buds have become one of the most popular consumer wearable devices and best sellers in the market. With the development of technology, ear-buds can provide more and more functions.
1.1 The Overview of Wearable Devices
7
The ear-buds can have modern functions such as motion-detection and speechdetection functions. They can even be controlled by voice. For example, you can control them by saying things like playing a song, making a call, or getting directions. One of the most famous intelligent ear-buds is the Apple AirPods [22]. Its built-in infrared sensor can automatically recognize whether the ear-buds are in the ear automatically. With the support from wireless charging case, the working time of the ear-buds can be extended. Recently plenty of companies had released some comparable ear-buds products, such as Huawei FreeBuds 3, Jabra Elite series, and Bragi Dash Pro. For Huawei FreeBuds 3 [23], with the Kirin A1 chipset, FreeBuds 3 can eliminate the background noise during calls and listening, and with the built-in bone voice sensor, FreeBuds 3 can better pick up the wearer’s voice through bone vibrations so that the voice will be enhanced to make it clearer. Wind noise usually ruins important calls when wearer is walking in a windy day, and FreeBuds 3 has the Aerodynamic Mic Duct Design, which can suppress the wind passing by to reduce wind noise efficiently and so wearer can have clearer phone calls. Different from the wrist-worn devices, the ear-buds can be controlled and interacted with through voice. In the future, we may be able to conduct more interaction through ear-buds, for example, in the process of sports, mobile terminals collect and analyze the user’s sports information, and then give the sports guidance through ear-buds to help users exercise.
1.1.3 Body Equipped Body wearable equipment usually refers to exoskeleton. Exoskeleton robot is equipped on human body with a set of mechanical devices as the auxiliary [24]. It combines human flexibility and mechanical strength. In the military field, it can improve the combat ability of soldiers and make them carry more weapons and equipment. In the field of fire fighting and disaster relief, it can help people to complete the task of carrying heavy objects. In the field of medical treatment, it can help paraplegic patients to carry out rehabilitation treatment. A large number of schools and research institutions are conducting relevant research, such as University of California, Berkeley [25], University of Electronic Science And Technology of China [26], University of Science and Technology of China [27]. Some famous exoskeletons are demonstrated in Fig. 1.2. (1) Assisted Exoskeleton In 2000, the U.S. Defense Advanced Research Projects Agency (DAPRA) began to research on the Exoskeletons for Human Performance Augmentation (EHPA) project [28] to develop exoskeleton robots that can carry more ammunition and heavier weapons. In the early 20th century, an exoskeleton called XOS was launched. The XOS weighs 68 kg and allows the wearer to lift 90 kg easily. In 2010, the Raytheon company launched XOS 2 [29], which is more robust and flexible
8
1 Introduction
Assist Exoskeleton
2009
2018
2005 2010
The XOS 2 Exoskeleton [29]
The BLEEX Exoskeleton [25]
Medical Exoskeleton
Skelex 360 [31]
The HULC Exoskeleton [30]
2016
2018
2020
2010
The HOCOMA Exoskeleton [34] The AiLegs and AiWalker [35]
USTC ARC’s Exoskeleton [36]
The Rex Exoskeleton [32]
Fig. 1.2 The typical exoskeletons
than the first generation. In 2005, U.C. Berkeley’s human engineering and Robotics Laboratory launched Berkeley lower extreme exoskeleton (BLEEX) [25] and this exoskeleton allowed a person to comfortably squat, bend, swing from side to side, twist, walk, and run on ascending and descending slopes, and step over and under obstructions while carrying equipment and supplies. While wearing the exoskeleton, the wearer can carry significant loads over considerable distances without reducing his/her agility, thus significantly increasing his/her physical effectiveness. The Human Universal Load Carrier (HULC) [30] is an exoskeleton being developed by Lockheed Martin for dismounted soldiers in 2009 and the HULC enables soldiers to carry loads up to 200lb (91kg). The weight of the load gets transferred to the ground through the shoes of the exoskeleton. Skelex 360 is one of the assist exoskeletons launched in 2018 [31]. The Skelex can be applied on the production line in order to reduce the labor burden of workers and improved their work efficiency. In addition, many research institutions and universities around the world are engaged in the research of exoskeleton. (2) Medical Exoskeleton In 2010, Rex Bionics [32] invented the bionic mechanical legs “Rex” for paralytic patients to help them stand up. Rewalk’s rehabilitation training exoskeleton (ReWalk R/Rehabilitation) [33] released in 2011 is the robot exoskeleton product and it can control the movement by detecting the slight change of the center of gravity position. What’s more, it can achieve a better performance in human gait imitation. In 2016, HOCOMA and Balgrist medical rehabilitation center in Zurich launched Lokomat exoskeleton [34], and this exoskeleton can provide a modifiable serve for rehabilitation patients. In 2018, AI-Robotics [35] launched AiLegs, a bipedal
1.1 The Overview of Wearable Devices
9
rehabilitation robot of lower limb exoskeleton and AiWalker. The mobile platform type lower limb exoskeleton rehabilitation robot, which can help paraplegia patients. In 2020, Li et al. developed a new generation of walking exoskeleton robot without abduction support in multi-terrain environment [36]. Based on bionics and ergonomics, this robot is an intelligent wearable lower limb exoskeleton robot, with a weight of about 40kg. It is an exoskeleton robot with the largest number of DOFs in China at present, with 10 DOFs in the lower limbs. The exoskeleton robot also integrates a variety of sensors, such as foot force sensor, attitude sensor, and absolute value encoder. Through multisensor information fusion, the exoskeleton robot is more stable and safe in the operation process.
1.1.4 Smart Garment Smart garments mainly include shirts, pants, coat, and undergarments. There are plenty of products for wearers to monitor the human physiological signals and biomechanics. Athos [37] is characterized by a built-in electromyography sensor, which is mainly used to monitor the muscle movement intensity of several main parts of the body, so as to prevent users from damaging muscles, and is more suitable for wearing in the gym. In addition, its App is also very intelligent and can clearly visualize the muscle activation, heart rate and provide a feedback to users, just like a personal fitness coach. Athos also can be washed and dried at will, just like ordinary clothes can be reused. LUMO Run [38] is a kind of sports shorts with intelligent running experience. It has men’s and women’s styles and is full of a large number of sensors, which can monitor the movement, pelvic circling, step length, and other details. It has special help for exercise monitoring and fat burning. LUMO Run supports real-time training, sending feedback to headphones to help improve the form of running and reduce the chance of injury. The battery can be even used for one month at a time.
1.1.5 Smart Shoes With the popularity of smart bracelets and smart watches, plenty of manufacturers begin to develop more and more wearable devices. Similar to other wearable devices, smart shoes also find potential application scenarios in tracking human movement signals and biomechanics. In 2019, Under Armour launched a smart running shoe called UA HOVR infinite [39]. The UA HOVR infinite running shoes have a high fidelity sensor chip embedded in the right foot midsole, which realizes real-time interconnection with the UA run application through low-power Bluetooth Technology (BLE). After the completion of the Internet, the running shoes can record, analyze, and store detailed running data, and then put forward guidance suggestions to help runners improve
10
1 Introduction
their performance. Nike also launched a series smart shoes called Adapt [40]. It can be adjusted electronically according to the foot shape and create the ideal fit for individualization. Although there are many smart shoes products, there are still some practical problems. The function of many products does not improve the user experience clearly in daily life, and the price is high because of the manufacturing cost.
1.1.6 Data Gloves Data glove is one of the natural interactive devices in human–computer interaction applications [41], as shown in Fig. 1.3. Its direct purpose is to obtain the action posture of human hand in real time, and send the posture information to the computer. As a device for scene simulation and interaction, it can help grasp, move, assemble, operate, and control objects. It has wireless and wired connection modes. It is easy to use and operate. With its wide operation range and high data quality, it becomes an ideal tool for human–computer interaction, especially for the VR system which needs multiple DOFs hand model to operate complex virtual objects [42]. The gesture of human hand is accurately transferred to the virtual environment in real time, and the contact information with the virtual object can be sent back to the operator. With the data glove, the operator can interact with the virtual world in a more direct, natural, and effective way. The data glove can also be used for human–robot interaction. Nowadays, it has been gradually applied in various fields like robotic system, operation surgery, virtual assembly training, sign language recognition system, education, entertainment, and so on.
Fig. 1.3 Data glove
1.2 Sensors of Wearable Device
11
1.2 Sensors of Wearable Device Since the development of sensor technologies, there have been three basic categories of sensors that are used in wearable devices. The first is motion capture sensor, which is used to capture the motion of human. The second is tactile sensor, which is used to help users to obtain information about the external environment. The third is physiological parameter measurement sensor, which is used to monitor physiological parameters of human body, as shown in Fig. 1.4.
1.2.1 Motion Capture Sensor The motion capture is of vital importance for the development of intelligent wearable devices. A wearable device must be able to perceive human motion in order to interact, cooperate, or imitate in an intelligent manner [43]. Common motion capture sensors used in upper limb wearable devices include inertial sensors, bending sensors, electromyography signals (EMG), and vision sensors. (1) Inertial Sensor Inertial sensors usually combine several types of sensors such as accelerometers, gyroscopes, and magnetometers. They are used to track position and orientation of objects [44, 45]. The inertial measurement unit (IMU) is often used to estimate the joint flexion and extension angles of upper limbs [46, 47]. With the development of Micro Electrical Mechanical Systems (MEMS), micro inertial sensors have numerous advantages such as small size, low-power consumption, large dynamic range, low cost, and so on. It has gradually become the main sensor of human motion capture. It is non-obtrusive, comparably cost effective, and easy to integrate [48]. The position of each joint can be estimated by double integrating the acceleration components once the gravity vector has been removed. There have been many advantages in using inertial sensors in the wearable devices. Firstly, MEMS-based inertial sensors are lightweight, good for fast motion Fig. 1.4 Sensors for wearable devices
Sensors for wearable devices
Physiological parameter measurement sensors
Inertial sensors
Motion capture sensors
Bending sensors
EMG
Tactile sensors
Vision
12
1 Introduction
tracking, and can cover a large sensing range [49, 50]. Secondly, they are independent from an infrastructure and do not require an external source (completely self-contain) [51, 52]. Thirdly, they are flexible and easy to wear on with little limitation on position [53, 54]. However, they lack long-term stability due to the problem of severe zero drift. And a major disadvantage of inertial sensor is that the position or angular is estimated by integrating accelerations or angular velocity, so the error can be accumulated [55]. Therefore, it must be calibrated before use, and the output needs to be corrected by integrating with other sensors [56, 57]. (2) Bending Sensor The bending sensor changes its resistance depending on the amount of bends on the sensor. It converts the change in bend to electrical resistance, and the more the bend is, the higher the resistance value is [58]. Its function is to estimate certain bending angle of the object attached to the bending sensor. When the substrate of the bending sensor is bent, the bending sensor produces a proportional electric resistance output. The designed circuits convert this change into bending angle. The advantage of using bending sensor is that the design and manufacturing are tractable and they have flexibility in design and in accommodating anatomical variations. Since the bending sensor can accommodate to the flexible and deformable natures of the human body, it is becoming more and more popular. They are developed by some laboratories [59, 60] and some well-known companies, such as 5DT [61, 62] and Flexpoint Sensor Systems [63]. And the bending sensor has been widely used in rehabilitation [64] and motion tracking [65] in recent years. However, it still suffers from the lack of durability, which has limited its development to some extent. (3) EMG EMG is the acronym of Electromyography. It is an electrodiagnostic medical technique for evaluating and recording the electrical activity produced by skeletal muscles [66]. An electromyograph is the instrument that EMG used to detect the electric potential generated by muscle cells when these cells are electrically or neurologically activated. The signals can be analyzed to detect medical abnormalities, activation levels, recruitment orders, and analyze the biomechanics of human or animal movement [67]. For machine control, they can also be interpreted and translated into useful computer commands. Thus, wearable devices based on EMG can track motion and provide information for control purpose [68]. And there have been many applications using EMG to implement hand gesture recognition [69], motion analysis [70], and teleoperation [71]. The advantage of EMG is that it is easy to set up and it can detect the motion of human hands and arms indirectly and noninvasively. However, its easy setup can cause noise motion artifacts and possible readout failures. (4) Vision Vision-based wearable devices mainly have two types: recognition with markers and RGB-D recognition.
1.2 Sensors of Wearable Device
13
The recognition with markers is to attach several markers on the upper limb. Then vision recognition is used to scan and extract data to gain the displacements of markers [72]. In general, the markers’ movements are caused by a set of predefined motion of the upper limb. The displacement of the marker relative to the underlying bone is observed and quantified. In a specific research, firstly MRI (Magnetic Resonance Imaging) is used to capture a static upper limb in several different poses and reconstructs the 3D models of the bones. Then, the wearable device with reflective markers is worn on object’s upper limb and a sequential protocol is used to track markers’ position in each posture. With displacement of markers, the posture of the upper limb can be estimated [73]. The advantage of the method is that it is not affected by magnetic disturbance and gravitation, and has higher precision and faster measurements. However, the drawback is that it is hard to identify markers, and markers may be hindered in dexterous tasks. Besides that, it is inconvenient to use since markers have to adhere to the hand’s surface [74]. While in the process of RGB-D recognition, RGB-D sensor is used to acquire color and depth information. Computationally efficient action features are extracted from images provided by the depth and color videos. The essence of whole process is to integrate two kinds of information to estimate motion. Based on the estimation, computers will transform real features into digital data which can later be utilized [75]. The advantage of using RGB-D is that it can provide precise egomotion estimation with low rate in long term, and the user’s hands have freedom to be at any position without wearing any specific clothing or wearable devices [76]. However, the drawback is that it is easily affected by environment, and easily to be disturbed by illumination [77].
1.2.2 Tactile Sensors A tactile sensor can be defined as a device or system that can measure a given property of an object or contact event through physical contact between the sensor and the object [78]. It is of vital importance to help users to obtain the force information and also to obtain information about the external environment as haptic feedback. Reference [79] proposes a scalable tactile glove integrated with the 548 sensors, as shown in Fig. 1.5. The sensor array is assembled on a knitted glove and consists of a piezo-resistive film connected by a network of conductive thread electrodes that are passively probed. They use the scalable tactile glove and deep convolutional neural networks (CNN) to show that sensors uniformly distributed over the hand can be used to identify individual objects, estimate their weight, and explore the typical tactile patterns that emerge while grasping objects. Besides, some other wearable devices of upper limbs have been developed based on tactile sensors, such as the ergonomic vibrotactile feedback device for the human arm [80], the exoskeleton glove “RML” [81], and so on. The wearable devices using tactile sensors can be used in dynamic signature verification [82], human and
14
1 Introduction
Fig. 1.5 A scalable tactile glove [79]
object recognition [83], tactile driven dynamic grasp control [84], providing tactile feedback for virtual instruments [85], grasp analysis [86], etc. In order to better adapt to different shapes of wearable devices, various tactile sensors have been developed in the last two decades [87, 88]. In the future, we believe that the tactile sensor will become smaller, lighter, and more flexible.
1.2.3 Physiological Parameter Measurement Sensors Physiological parameter measurement sensors are usually used in upper limb wearable devices for real-time detection of human physiological parameters. The physiological parameter measurement sensors have allowed healthcare researchers to monitor patient at home, hospital, or outdoors, and allowed remote medical diagnosis which prevents patients from traveling to hospital for long distances [89]. There have been many physiological parameter measurement sensors developed for heart rate monitor [90], non-invasion blood pressure measurement [91], sleep apnea monitoring [92], arterial pulse waveform measurement [93], etc. Nowadays, the demand of physiological parameter measurement sensors will increase with people’s needs for requesting an innovation, low cost, and flexible system to be aware of the change of their physiological and non-physiological body parameters.
1.3 Wearable Computing Algorithms
15
1.3 Wearable Computing Algorithms Wearable computing algorithms are mainly for the motion capture and motion recognition tasks. The motion capture process involves sensor calibration, multisensor estimation, and data fusion, while motion recognition involves motion identification, tracking, classification, and mapping. Nowadays, various wearable computing algorithms have been applied to the field of upper limb recognition. Here, we introduce several classical and commonly used algorithms.
1.3.1 Motion Capture Related Algorithms Three common algorithms of motion capture are described in the following. (1) Kalman Filter The Kalman Filter (KF) is a real-time recursive algorithm used to optimally estimate the underlying states of a series of noisy and inaccurate measurements observed over time. It is optimal in the sense of minimum mean squared error for jointly Gaussian case or linear minimum mean squared error when only the first two moments are available. It has attracted extensive attention because of its favorable recursive nature [94]. And nowadays, it has been widely used to fuse data of multi-sensors for wearable devices [65, 95]. (2) Extended Kalman Filter The Kalman Filter(KF) heavily relies on the linear Gaussian assumption associated with it. Therefore, to deal with the nonlinear model, many derived versions of KF have been proposed, in which the Extended Kalman Filter (EKF) is the most typical and commonly used, such as applications in the full-body inertial motion capturing system [96], and compensation for both orientation and position drift in hand pose estimation [97]. (3) Particle Filter Particle Filter (PF) based tracking and gesture recognition systems with wearable devices have become popular recently in comparison to other methods. PF does not make assumption on posterior model compared with KF, and it is very effective in estimating the state of dynamic systems from sensor information. The key idea of the method is to represent probability densities by a set of samples. As a result, it has the ability to represent a wide range of probability densities, allowing real-time estimation of nonlinear, non-Gaussian dynamic systems [98]. Nowadays, PF has also been widely used in wearable motion tracking systems for the exploitation of optimal probabilistic filtering of IMU signals [99].
16
1 Introduction
1.3.2 Motion Recognition Related Algorithms In order to apply the wearable devices in practice, motion recognition algorithms are necessary. Then the details of the algorithms are presented. (1) K-Nearest Neighbors K-Nearest Neighbors (KNN) algorithm is a procedure where each point of the input space is labeled as the label of its k closest neighbors, and the distances are generally computed according to the Euclidean norm, while the parameter k is usually tuned through cross validation. Application of KNN in gesture classification of wearable devices can be found in [100]. (2) Hidden Markov Model Hidden Markov Model (HMM) is a very popular gesture classification algorithm. It is a double stochastic process that consists of an underlying Markov chain with a finite number of states and a set of random functions, each associated with one state. Each transition between the states has a pair of probabilities: (a) transition probability, which provides the probability for undergoing the transition and (b) output probability, which defines the conditional probability of emitting an output symbol from a finite alphabet when given a state [98]. HMM is rich in mathematical structures and has been used efficiently in recognition of wearable devices [101]. (3) Support Vector Machine Support Vector Machine (SVM) is a novel large margin classifier used for classification and regression. SVM maps vectors onto a much higher dimensional space, and sets up a maximum margin hyperplane that divides clusters of vectors. New objects are then located in the same space and are recognized to belong to a class based on the area they fall in. This method can minimize the upper bound of the generalization error and provide an excellent generalization ability. It is effective in a high-dimensional space and is compatible with different kernel functions specified for the decision function, with well provided common kernels and the freedom to specify custom kernels [102]. The SVM has been widely used to improve the classification performance of wearable devices, and also sometimes combined with other classification methods [103]. (4) Artificial Neural Networks Artificial Neural Networks (ANN) is a supervised learning classifier, which is composed of many neurons. Each neuron receives input data, processes input data, and gives output data. It can be used to estimate functions that depend on a large number of input data. Because of its natural and concise multi-class classification ability, ANN has been used for motion classification task with wearable sensors [104]. For example, the accelerometers’ data is classified using ANN for forearm control [105], and in [106], ANN is used to estimate forearm pronation/supination and elbow flexion/extension.
1.3 Wearable Computing Algorithms
17
(5) Dynamic Time Warping Dynamic Time Warping (DTW) is an effective algorithm based on dynamic programming to match a pair of time sequences that contain temporal variability. It is an important component in template-based pattern recognition and classification tasks. And it can be useful for personalized gesture recognition, where a large set of training samples are hard to be collected [107]. Recently, there have been many DTW applications for online time series recognition in gesture recognition with wearable sensors [108]. (6) Extreme Learning Machine Extreme Learning Machine (ELM) is a new learning approach with a single-hidden layer feedforward neural network. The major advantage of the ELM is that it uses a convex optimization machine learning technique instead of non-deterministic and time-consuming backpropagation methods to train ANNs. And it assigns the weights of the first layer randomly, so the optimization of weight values in the output layer only requires the solution of a linear system, which can be performed in a single optimization step [109]. Besides that, it has a better generalization performance at a much faster learning speed and is insensitive to parameters. Recently, it has been used in 3D human gesture capture and recognition [48]. (7) Deep Learning Deep Learning (DL) is an emerging machine learning technique based on welldeveloped concept of neural networks, augmented with series of improvements in the structure and training of such networks. It is originally designed for image classification and recognition, imitating the working mechanism of animal visual cortex. CNN and Recurrent Neural Networks (RNN) are two popular deep learning algorithms. There are a few recent studies utilizing deep learning framework to process wearable sensors’ data [110].
1.3.3 Comparison of Different Wearable Computing Algorithms The advantages and disadvantages of different wearable computing algorithms are provided in Table 1.1. From the comparison, we can see that different wearable computing algorithms have their advantages and disadvantages. In order to achieve a better performance, appropriate combination of algorithms should be considered.
18
1 Introduction
Table 1.1 Comparison of different wearable computing algorithms Algorithms KF EKF PF
KNN HMM SVM
ANN
DTW
ELM DL
Advantages Optimal in the sense of minimum mean squared error [94] Can deal with nonlinear model efficiently Effectively in estimating the state of dynamic systems from sensor information [98] Simple and effective Flexibility of training and verification, model transparency [113] Effective in a high-dimensional space and is compatible with different kernel functions specified for decision function [102] Has natural and concise multi-class classification ability and fixed-size parametric model [104] Template based, identifies the temporal variations by selectively warping the time scale of observation sequence to template [114] Faster learning speed, insensitive to the parameters [48] Reducing the need for feature engineering [115]
Disadvantages Heavily relies on the linear Gaussian assumption [94] Difficult to choose the numerous parameters and compute [111] Computationally expensive [99]
K needs to be chosen carefully [112] Many free parameters need to be adjusted [113] Time-consuming and difficult to implement on large scale training samples Poor accuracy in case of insufficient data sample length Time and space consuming
Cannot encode more than one layer of abstraction Requires a large amount of data, computationally expensive [115]
1.4 Applications For a long time, motion capture and recognition applications have been dominated by image-based devices, and wearable devices are not widely used. However, highly reliable camera-based systems require a complex stationary setup with multiple cameras and a permanent line of sight between the tracked objects [116]. Besides, single camera suffers from its low robustness; stereo camera suffers from its computational complexity and calibration difficulties; Kinect suffers from its distance limitation, and all the image-based devices can be affected by illumination [112]. In contrast, wearable devices can be easily attached to the human body, which is convenient for human motion capture and recognition applications. They also overcome spatial limitations of the stationary camera setup. Therefore, many scientists, engineers, and innovators are interested in the practical applications with wearable devices. And with the development of smart fabrics, 3D printing, manufacturing, and sensor technologies, wearable devices have been widely used in interaction and healthcare in the last few years.
1.4 Applications
19
1.4.1 Interaction The application of wearable devices used in interaction can be divided into VR, interaction with robot, and sign language recognition. (1) Virtual Reality VR has received a lot of scientific and public attention in the past few years, and it is on its way to becoming an integral part in our lives. Devices which are easy to wear, good at measuring human motion and generating precise force feedback are required to effectively interact with a human. For VR applications, users must be able to put the device on and take it off easily. And there have been many VR applications using wearable devices, such as making more immersive and realistic games [117], identifying virtual 3D geometric shapes [118], visualizing physical activity in virtual reality [119], and so on. For the reason that wearable haptics can provide the compelling illusion of touching the superimposed virtual objects without constraining the motion or the workspace of the user, it will have broader application in VR area. (2) Interaction with Robot Interaction with robot using wearable devices is a hot research topic in robotics. One of the most popular ways to interact with robot is teleoperation. In the robotic teleoperation, the robot receives instructions from human operator far away through some sensors or control mechanism by a communicating network. Meanwhile, the robot sends its status and environment information to the human operator as feedback. Only the efficient interaction between the robot and the human operator can make the robotic teleoperation system perform well. A comprehensive examination of the human role in robotic teleoperation is presented in [120], which stresses the importance of human performance. There are a variety of human–machine interaction interfaces for robotic teleoperation systems. Traditional user interfaces such as keyboard or joysticks are the most fundamental human–machine interaction interfaces in the robotic teleoperation system. They are simple and reliable to use while it is an open loop control approach. With the rapid development of sensing technology and wearable devices, more and more different human–machine interaction interfaces have aroused. Recently, teleoperation system [121, 122] with the haptic device is increasingly popular. With the force/tactile feedback provided by the haptic device, the operator can control the robot more precisely and flexibly. It can even be used in some delicate surgery to localize the position of the tumor [123]. Besides the above contacting interfaces, Kofman et al. [124] proposed a vision-based human–machine interface that provides a more intuitive way for a human operator. (3) Sign Language Recognition Sign language, which involves voluntary movements of hands and fingers in order to express a clear action, plays a vital role for deaf and mute people to communicate among themselves or with normal people in a non-verbal manner. A sign language recognition (SLR) system serves as an important assistive tool to bridge the
20
1 Introduction
communication gap between this community and individuals who do not know how to translate sign language into text or speech. SLR can also be used as a reference designed for a gesture based human–computer interface, which shares the same principles and is widely used in human daily life [125]. However, the major issue with communication using sign language is that the majority of people without speech impairment have no understanding of the sign languages and hence are completely incapable of communicating effectively with the speech-impaired people [126]. Given the fact that gesture is an intuitive way of presenting a specific meaning or intent, researchers in sign language interpretation have used wearable devices to serve as an auxiliary tool for deaf and mute people to integrate into the society without barriers [127]. There are different sign languages in different regions, and there have been many studies using wearable devices to recognize different sign languages, such as Indian sign language [126], American Sign Language [128], Pakistani Sign Language [129], Sinhala Sign Language [130], and so on. Research in this area is bound to become a hot topic in the next few years.
1.4.2 Healthcare Applications in healthcare using wearable devices include behavior monitoring, health monitoring, and medical rehabilitation evaluation. (1) Daily Behavior Monitoring The wearable devices equipped with sensors provide good candidates to monitor users’ daily behavior due to their small sizes, reasonable computation power, and practical power capabilities. Recent advancement of wearable technology has resulted in utilization of wearable and non-intrusive systems for daily behavior monitoring, which motivates the users to maintain healthy living style to some extent. For example, there are recognizing daily and sports activities [131], hand gesture and daily activity recognition for robot-assisted living with smart assisted living system [132], detecting arm gestures related to human meal intake [133], and so on. They will empower users to quantify and take control of their lifestyle. (2) Health Monitoring Health monitoring is a technique for early detection of signs of health damage through continuous monitoring of physiological parameters or regular inspection and analysis. Clinically, health monitoring is widely used for vital sign monitoring and disease diagnosis. In the aspect of vital signs monitoring, doctors usually need to continuously monitor the patient’s heart rate, blood pressure, and other parameters after an intensive care and surgery, and to detect dangerous signals in a timely manner. Although clinically used health monitoring devices can achieve non-invasive measurement of parameters such as electrocardiogram, blood pressure, and oxygen saturation, they are not suitable for home and personal use. Therefore, providing practical wearable health monitoring devices for home users has become one of the most important research directions in the field of health monitoring,
1.4 Applications
21
including heart rate monitor [90], non-invasion blood pressure measurement [91], sleep apnea monitoring [92], and so on. (3) Medical Rehabilitation The ultimate goal of medical rehabilitation process should be to fully recover people from temporary motor impairments, or in case of permanent disorder, at least to mitigate patients’ struggles by aiming at a level of independence as high as possible [134]. With the development of wearable devices, patients can undergo rehabilitation training at home without the need to go to the specialized center daily or weekly. In the meantime, the effectiveness of rehabilitation can be increased and the recovery time can be reduced. Medical rehabilitations, such as arthritis rehabilitation [135], poststroke rehabilitation [136], have been widely investigated around the world using wearable devices.
1.4.3 Manipulation Learning Robotic manipulation learning is another important type of application using wearable devices, which means to help robots to learn operation skills from humans effectively, in other words, to transfer the human experience to the robot. It is a useful technique to augment a robot’s behavioral inventory, especially for small or medium size production lines, where the production process needs to be adopted or modified more often [137]. In order to achieve this goal, the wearable devices need to acquire human manipulation data, build a mapping between the human and the robots, and extract skill features from the mapped manipulation data, just as the human experience learning system shown in [138], with its teleoperation scheme of robotic system using wearable devices (Fig. 1.6). Some wearable devices such as data gloves are also commonly used in the robotic manipulation teleoperation system. With the help of data gloves, the robot can perform in a more natural and intuitive manners. Usually, the configuration of the data glove is mapped to that of a robotic hand and then the robotic hand will track the motion of the data glove [139, 140]. In [141], Hu et al. proposed an arm/hand teleoperation system where the human operator utilizes a data glove to control the
Fig. 1.6 Robotic teleoperation with wearable device
22
1 Introduction
movement of the dexterous hand. However, current data gloves are only available for teleoperating the robotic hand, while the arm is controlled separately [142]. In order to better cooperate the movement of the arm/hand system, we take a data glove that can capture the motion of both the hand and arm to illustrate the effectiveness of the wearable devices in the robotic manipulation system. During the process of manipulation learning, the mapping between the wearable devices and the robot is of vital importance. And for the reason that it can remove the burden of robot programming from experts, there have been many researches on the robotic manipulation learning using wearable devices [143, 144]. In the foreseeable future, we believe that the robotic manipulation learning using wearable devices will draw more and more attentions.
1.5 Summary A review of recent advances in wearable devices has been presented. The wearable devices are classified based on wearable type, and various sensors commonly used are described in details with their advantages and disadvantages. In addition, several classical and commonly used wearable computing algorithms applied to the field of motion capture and recognition are introduced with comparison of their performances. We have also highlighted some application areas in the field. Clearly, this field is progressing at an unprecedented rate. And it is certain that the wearable devices will be smaller, safer, more comfortable, and more powerful with lower cost.
References 1. Prasad S, Kumar P, Sinha K (2014) A wireless dynamic gesture user interface for HCI using hand data glove. In: 7th international conference on contemporary computing 2014, IC3 2014, pp 62–67. https://doi.org/10.1109/IC3.2014.6897148 2. Kale S, Mane S, Patil P (2017) Wearable biomedical parameter monitoring system: A review, pp 614–617. https://doi.org/10.1109/ICECA.2017.8203611 3. IDC survey, IDC, USA, [Online]. Available from: https://www.idc.com/getdoc.jsp? containerId=prAP44909819 4. Lee BC, Chen S, Sienko K (2011) A Wearable device for real-time motion error detection and vibrotactile instructional cuing. IEEE Trans Neural Syst Rehabil Eng Publ IEEE Eng Med Biol Soc 19:374–81. https://doi.org/10.1109/TNSRE.2011.2140331 5. Bianchi V, Grossi F, De Munari I, Ciampolini P (2012) MUlti sensor assistant: A multisensor wearable device for ambient assisted living. J Med Imag Health Inf 2:70–75. https://doi.org/ 10.1166/jmihi.2012.1058 6. Maurer U, Smailagic A, Siewiorek DP, Deisher M (2006) Activity recognition and monitoring using multiple sensors on different body positions. In: Proc int workshop wearable implantable body sensor netw. 2006, 4 pp. https://doi.org/10.1109/BSN.2006.6 7. Casale P, Pujol O, Radeva P (2009) Physical activity recognition from accelerometer data using a wearable device. Lect Notes Comput Sci 6669:289–296
References
23
8. Dinh A, Teng D, Chen L, Ko S-B, Shi Y, McCrosky C, Basran J, Bello-Hass V (2009) A wearable device for physical activity monitoring with built-in heart rate variability. In: 3rd international conference on bioinformatics and biomedical engineering, iCBBE 2009, pp 1– 4. https://doi.org/10.1109/ICBBE.2009.5162260 9. Cole C Blackstone E, Pashkow F, Pothier C, Lauer M (2000) Heart rate recovery immediately after exercise as a predictor of mortality. J Cardiopulm Rehabil 20:131–132. https://doi.org/ 10.1097/00008483-200003000-00012 10. Asada H, Shaltis P, Reisner A, Rhee S, Hutchinson R (2003) Mobile monitoring with wearable photoplethysmographic biosensors. IEEE Eng Med Biol Mag Q Mag Eng Med Biol Soc 22:28–40. https://doi.org/10.1109/MEMB.2003.1213624 11. Hachisuka K, Nakata A, Takeda T, Shiba K, Sasaki K, Hosaka H, Itao K (2003) Development of wearable intra-body communication devices. Sensors Actuat A Phys 105:109–115. https:// doi.org/10.1016/S0924-4247(03)00060-8 12. Bifulco P, Cesarelli M, Fratini A, Ruffo M, Pasquariello G, Gargiulo G (2011) A wearable device for recording of biopotentials and body movements. In: MeMeA 2011 - 2011 IEEE international symposium on medical measurements and applications, proceedings, pp 469– 472. https://doi.org/10.1109/MeMeA.2011.5966735 13. Apple Watch Series 5-Apple, Apple Inc., Cupertino, CA, USA, [Online]. Available from: https://www.apple.com/apple-watch-series-5/ 14. Huawei Watch GT 2, Huawei, China, [Online]. Available from: https://consumer.huawei.com/ en/wearables/watch-gt2/ 15. Huang Y, Junkai X, Bo Yu, Peter BS (2016) Validity of FitBit, Jawbone UP, Nike+ and other wearable devices for level and stair walking. Gait Posture 48:36–41 16. Huawei Band 4, Huawei, China, [Online]. Available from: https://consumer.huawei.com/en/ wearables/band4/ 17. Mi Smart Band 4, Xiaomi, China, [Online]. Available from: https://www.mi.com/global/mismart-band-4 18. Virtual Boy, Nintendo, Japan, [Online]. Available from: https://www.nintendo.com/ consumer/systems/virtualboy/index.jsp 19. Goradia I, Jheel D, Lakshmi K (2014) A review paper on oculus rift & project morpheus. Int J Current Eng Technol 4(5):3196–3200 20. Google Glass (2016). [Online]. Available from: https://en.wikipedia.org/wiki/Google_Glass 21. Microsoft Hololens (2016) [Online]. Available from: https://www.microsoft.com/microsofthololens/en-us 22. AirPods [Online] Available from: https://en.wikipedia.org/wiki/AirPods 23. Huawei FreeBuds 3, Huawei, China, [Online]. Available from: https://consumer.huawei.com/ en/audio/freebuds3/ 24. Yeem S, Heo J, Kim H, Kwon Y (2019) Technical analysis of exoskeleton robot. World J Eng Technol 07:68–79. https://doi.org/10.4236/wjet.2019.71004 25. BLEEX, Berkeley Robotics„ Human Engineering Laboratory, USA, [Online]. Available from: https://bleex.me.berkeley.edu/research/exoskeleton/bleex/ 26. Song G, Huang R, Qiu J, Cheng H, Fan S (2020) Model-based control with interaction predicting for human-coupled lower exoskeleton systems. J Intell Robot Syst. https://doi.org/ 10.1007/s10846-020-01200-5 27. He W, Li Z, Dong Y, Zhao T (2018) Design and adaptive control for an upper limb robotic exoskeleton in presence of input saturation. IEEE Trans Neural Netw Learn Syst, 1–12. https://doi.org/10.1109/TNNLS.2018.2828813 28. Garcia E, Sater J, Main J (2002) Exoskeletons for human performance augmentation (EHPA): A program summary. J Robot Soc Jpn 20:822–826. https://doi.org/10.7210/jrsj.20.822 29. XOS 2, Raytheon, USA, [Online]. Available from: https://www.army-technology.com/ projects/raytheon-xos-2-exoskeleton-us/ 30. HULC, Berkeley Robotics & Human Engineering Laboratory, USA, [Online]. Avaliable: https://bleex.me.berkeley.edu/research/exoskeleton/hulc/
24
1 Introduction
31. Skelex 360, Skelex, Netherlands, [Online]. Available from: https://www.skelex.com/skelex360/ 32. Rex, Rex-Bionics, [Online]. Available from: https://www.rexbionics.com/ 33. Rewalk, Rewalk, [Online]. Available from: https://rewalk.com/ 34. HOCOMA, HOCOMA, [Online]. Available from: https://www.hocoma.com/ 35. AiLegs, AiWalker, AI-Robotics, China, [Online]. Available from: https://www.ai-robotics.cn/ 36. Wearable Robotics and Autonomous unmanned Systems Laboratory, China, [Online]. Available from: http://wearablerobotics.ustc.edu.cn/direction/exoskeletonrobot/ 37. Athos, Athos, [Online]. Available from: https://www.liveathos.com/ 38. LUMO Run, Lumobodytech, [Online]. Available from: https://www.lumobodytech.com/ 39. HOVR, Underarmour, USA, [Online]. Available from: https://www.underarmour.com/ 40. Nike Adapt, Nike, USA, [Online]. Available from: https://www.nike.com/ 41. Fang B, Sun F, Liu H, Tan C, Guo D (2019) A glove-based system for object recognition via visual-tactile fusion. Science China Inf Sci 62. https://doi.org/10.1007/s11432-018-9606-6 42. Fang B, Sun F, Liu H, Liu C (2017) 3D human gesture capturing and recognition by the IMMU-based data glove. Neurocomputing 277. https://doi.org/10.1016/j.neucom.2017.02. 101 43. Field M, Pan Z, Stirling D, Naghdy F (2011) Human motion capture sensors and analysis in robotics. Ind Robot 38:163–171. https://doi.org/10.1108/01439911111106372 44. Jiménez A, Seco F, Prieto C, Guevara J (2009) A comparison of Pedestrian Dead-Reckoning algorithms using a low-cost MEMS IMU. In: IEEE international symposium on intelligent signal processing, Budapest, 2009, pp 37–42. https://doi.org/10.1109/WISP.2009.5286542 45. Zhou S, Fei F, Zhang G, Mai J, Liu Y, Liou J, Li W (2014) 2D human gesture tracking and recognition by the fusion of MEMS inertial and vision sensors. IEEE Sensors J 14:1160– 1170. https://doi.org/10.1109/JSEN.2013.2288094 46. Lin B-S, Hsiao P-C, Yang S-Y, Su C-S, Lee I-J (2017) Data glove system embedded with inertial measurement units for hand function evaluation in stroke patients. IEEE Trans Neural Syst Rehabil Eng 1–1. https://doi.org/10.1109/TNSRE.2017.2720727 47. O’Reilly M, Caulfield B, Ward T, Johnston W, Doherty C (2018) Wearable inertial sensor systems for lower limb exercise detection and evaluation: A systematic review. Sports Med 48. https://doi.org/10.1007/s40279-018-0878-4 48. Fang B, Sun F, Liu H, Liu C (2017) 3D human gesture capturing and recognition by the IMMU-based data glove. Neurocomputing 277. https://doi.org/10.1016/j.neucom.2017.02. 101 49. Lin Bor, Lee I-J, Hsiao P, Yang S, Chou W (2014) Data glove embedded with 6-DOF inertial sensors for hand rehabilitation. In: 2014 Tenth international conference on intelligent information hiding and multimedia signal processing, Kitakyushu, 2014, pp 25–28 50. Cavallo F, Esposito D, Rovini E, Aquilano M, Carrozza MC, Dario P, Maremmani C, Bongioanni P (2013) Preliminary evaluation of SensHand V1 in assessing motor skills performance in Parkinson disease. In: IEEE international conference on rehabilitation robotics: [proceedings] 2013, pp 1–6. https://doi.org/10.1109/ICORR.2013.6650466 51. Zhang Z-Q, Yang G-Z (2014) Calibration of miniature inertial and magnetic sensor units for robust attitude estimation. IEEE Trans Instrum Meas 63:711–718. https://doi.org/10.1109/ TIM.2013.2281562 52. Magalhães F, Vannozzi G, Gatta G, Fantozzi S (2014) Wearable inertial sensors in swimming motion analysis: A systematic review. J Sports Sci 33. https://doi.org/10.1080/02640414. 2014.962574 53. Kortier H, Schepers M, Veltink P (2014) On-body inertial and magnetic sensing for assessment of hand and finger kinematics. In: Proceedings of the IEEE RAS and EMBS international conference on biomedical robotics and biomechatronics, pp 555–560. https:// doi.org/10.1109/BIOROB.2014.6913836 54. Bai L, Pepper MG, Yan Y, Spurgeon SK, Sakel M, Phillips M (2014) Quantitative assessment of upper limb motion in neurorehabilitation utilizing inertial sensors. IEEE Trans Neural Syst Rehabil Eng 23. https://doi.org/10.1109/TNSRE.2014.2369740
References
25
55. De Agostino M, Manzino A, Piras M (2010) Performances comparison of different MEMSbased IMUs. In: Record - IEEE PLANS, position location and navigation symposium, pp 187–201. https://doi.org/10.1109/PLANS.2010.5507128 56. Lüken M, Misgeld B, Rüschen D, Leonhardt S (2015) Multi-sensor calibration of lowcost magnetic, angular rate and gravity systems. Sensors 15:25919–25936. https://doi.org/ 10.3390/s151025919 57. Nguyen A, Banic A (2014) 3DTouch: A wearable 3D input device with an optical sensor and a 9-DOF inertial measurement unit. Computer Science, 2014. arXiv:1406.5581 58. Pathak V, Mongia S, Chitranshi G (2015) A framework for hand gesture recognition based on fusion of Flex, Contact and accelerometer sensor. In: 2015 third international conference on image information processing (ICIIP), Waknaghat, 2015, pp 312–319. https://doi.org/10. 1109/ICIIP.2015.7414787 59. Shen Z, Yi J, Li X, Lo M, Chen M, Hu Y, Wang Z (2016) A soft stretchable bending sensor and data glove applications. Robotics and Biomimetics. 3. https://doi.org/10.1186/s40638016-0051-1 60. Prituja AV, Banerjee H (2018) Electromagnetically enhanced soft and flexible bend sensor: A quantitative analysis with different cores. IEEE Sensors J, 1–1. https://doi.org/10.1109/JSEN. 2018.2817211 61. Ramakant, Noor-e-Karishma S, Lathasree V (2015) Sign language recognition through fusion of 5DT data glove and camera based information. In: IEEE international advance computing conference, IACC 2015, pp 639–643. https://doi.org/10.1109/IADCC.2015.7154785 62. Conn M, Sharma S (2016) Immersive telerobotics using the oculus rift and the 5DT ultra data glove. In: International conference on collaboration technologies and systems (CTS), Orlando, FL, 2016, pp 387–391. https://doi.org/10.1109/CTS.2016.0075 63. Saggio G (2011) Bend sensor arrays for hand movement tracking in biomedical systems. In: Proceedings of the 4th IEEE international workshop on advances in sensors and interfaces, IWASI 2011. https://doi.org/10.1109/IWASI.2011.6004685 64. Kim H, Park H, Lee W, Kim J, Park Y-L (2017) Design of wearable orthopedic devices for treating forward head postures using pneumatic artificial muscles and flex sensors. In: International conference on ubiquitous robots and ambient intelligence (URAI), Jeju, 2017, pp 809–814. https://doi.org/10.1109/URAI.2017.7992831 65. Ponraj G, Ren H (2018) Sensor fusion of leap motion controller and flex sensors using Kalman filter for human finger tracking. IEEE Sensors J, 1–1. https://doi.org/10.1109/JSEN.2018. 2790801 66. Ajiboye A, Weir R (2005) A heuristic fuzzy logic approach to EMG pattern recognition for multifunctional prosthesis control. IEEE Trans Neural Syst Rehabil Eng Publ IEEE Eng Med Biol Soc 13:280–91. https://doi.org/10.1109/TNSRE.2005.847357 67. Wu J, Sun L, Jafari R (2016) A wearable system for recognizing american sign language in real-time using IMU and surface EMG sensors. IEEE J Biomed Health Inform 20:1–1. https:// doi.org/10.1109/JBHI.2016.2598302 68. Minati L, Yoshimura N, Koike Y (2017) Hybrid control of a vision-guided robot arm by EOG, EMG, EEG biosignals and head movement acquired via a consumer-grade wearable device. In: IEEE access, PP. 1–1. https://doi.org/10.1109/ACCESS.2017.2647851 69. Zhang X, Xiang C, Li Y, Lantz V, Wang K, Yang J (2011) A framework for hand gesture recognition based on accelerometer and EMG sensors. IEEE Trans Syst Man Cybern A Syst Humans 41:1064–1076. https://doi.org/10.1109/TSMCA.2011.2116004 70. Ju Z, Liu H (2013) Human hand motion analysis with multisensory information. IEEE/ASME Trans Mechatron 19. https://doi.org/10.1109/TMECH.2013.2240312 71. Leitner J, Luciw M, Forster ¨ A, Schmidhuber J (2014) Teleoperation of a 7 DOF humanoid robot arm using human arm accelerations and EMG signals. In: International symposium on artificial intelligence, robotics and automation in space, 2014 72. Buczek F, Sinsel E, Gloekler D, Wimer B, Warren C, Wu J (2011) Kinematic performance of a six degree-of-freedom hand model (6DHand) for use in occupational biomechanics. J Biomech 44:1805–9. https://doi.org/10.1016/j.jbiomech.2011.04.003
26
1 Introduction
73. Suau X, Alcoverro M, Lopez-M ´ e´ ndez A, Ruiz-Hidalgo J, Casas J (2014) Real-time fingertip localization conditioned on hand gesture classification. Image Vis Comput 32. https://doi.org/ 10.1016/j.imavis.2014.04.015 74. Bianchi M, Salaris P, Bicchi A (2012) Synergy-based optimal design of hand pose sensing. In: IEEE/RSJ international conference on intelligent robots and systems, pp 3929–3935. https:// doi.org/10.1109/IROS.2012.6385933 75. Regazzoni D, De Vecchi G, Rizzi C (2014) RGB cams vs RGB-D sensors: Low cost motion capture technologies performances and limitations. J Manuf Syst 33. https://doi.org/10.1016/ j.jmsy.2014.07.011 76. Palacios-Gasos J, Sagues C, Montijano E, Llorente S (2013) Human-computer interaction based on hand gestures using RGB-D sensors. Sensors 13:11842–60. https://doi.org/10.3390/ s130911842 77. Tang G, Asif S, Webb P (2015) The integration of contactless static pose recognition and dynamic hand motion tracking control system for industrial human and robot collaboration. Ind Robot 42. https://doi.org/10.1108/IR-03-2015-0059 78. Lee M, Nicholls H (1998) Tactile sensing for mechatronics: A state of the art survey. Mechatronics 9:1–31 79. Sundaram S, Kellnhofer P, Li Y, Zhu J-Y, Torralba A, Matusik W (2019) Learning the signatures of the human grasp using a scalable tactile glove. Nature 569:698–702. https:// doi.org/10.1038/s41586-019-1234-z 80. Schätzle S, Ende T, Wüsthoff T, Preusche C (2010) VibroTac: An ergonomic and versatile usable vibrotactile feedback device, pp 705–710. https://doi.org/10.1109/ROMAN.2010. 5598694 81. Ma Z, Ben-Tzvi P (2014) RML glove - An exoskeleton glove mechanism with haptics feedback. IEEE/ASME Trans Mechatron 20. https://doi.org/10.1109/TMECH.2014.2305842 82. Sayeed S, Besar R, Kamel N (2006) Dynamic signature verification using sensor based data glove. In: 2006 8th international conference on signal processing, Beijing. https://doi.org/10. 1109/ICOSP.2006.345880 83. Gandarias J, Gomez-de-Gabriel J, Garcia A (2017) Human and object recognition with a high-resolution tactile sensor. In: 2017 IEEE sensors, Glasgow, 2017, pp 1–3. https://doi.org/ 10.1109/ICSENS.2017.8234203 84. Steffen J, Haschke R, Ritter H (2007) Experience-based and tactile-driven dynamic grasp control. In: IEEE international conference on intelligent robots and systems, pp 2938–2943. https://doi.org/10.1109/IROS.2007.4398960 85. Scheggi S, Morbidi F, Prattichizzo D (2014) Human-robot formation control via visual and vibrotactile haptic feedback. IEEE Trans Haptic 7:499–511. https://doi.org/10.1109/TOH. 2014.2332173 86. EBattaglia E, Bianchi M, Altobelli A, Grioli G, Catalano M, Serio A, Santello M, Bicchi A (2015) ThimbleSense: A fingertip-wearable tactile sensor for grasp analysis. IEEE Trans Haptic 9. https://doi.org/10.1109/TOH.2015.2482478 87. Huang C-Y, Sung W-L, Fang W (2017) Develop and implement a novel tactile sensor array with stretchable and flexible grid-like spring. In: IEEE sensors, pp 1–3. https://doi.org/10. 1109/ICSENS.2017.8233960 88. Ponraj G, Senthil Kumar K, Thakor Nv, Yeow RC-H, Kukreja S (2017) Development of flexible fabric based tactile sensor for closed loop control of soft robotic actuator. In: IEEE conference on automation science and engineering (CASE), pp 1451–1456. https://doi.org/ 10.1109/COASE.2017.8256308 89. Al Ahmad M, Ahmed S (2017) Heart-rate and pressure-rate determination using piezoelectric sensor from the neck. In: IEEE international conference on engineering technologies and applied sciences (ICETAS), pp 1–5. https://doi.org/10.1109/ICETAS.2017.8277911 90. Mohapatra P, Sp P, Sivaprakasam M (2017) A novel sensor for wrist based optical heart rate monitor. In: IEEE international instrumentation and measurement technology conference (I2MTC), pp 1–6. https://doi.org/10.1109/I2MTC.2017.7969842
References
27
91. Xin Q, Wu J (2017) A novel wearable device for continuous, non-invasion blood pressure measurement. Comput Biol Chem 69. https://doi.org/10.1016/j.compbiolchem.2017.04.011 92. Sheng T, Zhen F, Chen X, Zhao Z, Li J (2017) The design of wearable sleep apnea monitoring wrist watch. In: IEEE 19th international conference on e-health networking, applications and services (Healthcom), pp 1–6. https://doi.org/10.1109/HealthCom.2017.8210850 93. Wang D, Shen J, Mei L, Qian S, Li J, Hao J (2017) Performance investigation of a wearable distributed-deflection sensor in arterial pulse waveform measurement. IEEE Sensors J, 3994– 4004. https://doi.org/10.1109/JSEN.2017.2704903 94. Ge Q, Shao T, Duan Z, Wen C (2016) Performance analysis of the Kalman filter with mismatched noise covariances. IEEE Trans Autom Control 61:1–1. https://doi.org/10.1109/ TAC.2016.2535158 95. Bruckner H-P, Nowosielski R, Kluge H, Blume H (2013) Mobile and wireless inertial sensor platform for motion capturing in stroke rehabilitation sessions. In: IEEE international workshop on advances in sensors and interfaces IWASI, pp 14–19. https://doi.org/10.1109/ IWASI.2013.6576085 96. Zhou H, Hu H (2010) Reducing drifts in the inertial measurements of wrist and elbow positions. IEEE Trans Instrum Meas 59:575–585. https://doi.org/10.1109/TIM.2009.2025065 97. Kortier H, Antonsson J, Schepers M, Gustafsson F, Veltink P (2014) Hand pose estimation by fusion of inertial and magnetic sensing aided by a permanent magnet. IEEE Trans Neural Syst Rehabil Eng 23. https://doi.org/10.1109/TNSRE.2014.2357579 98. Mitra S, Acharya T (2007) Gesture recognition: A survey. IEEE Trans Syst Man Cybern C Appl Rev 37:311–324. https://doi.org/10.1109/TSMCC.2007.893280 99. Ruffaldi E, Peppoloni L, Filippeschi A, Avizzano C (2014) A novel approach to motion tracking with wearable sensors based on probabilistic graphical models. In: Proceedings IEEE international conference on robotics and automation, pp 1247–1252. https://doi.org/10. 1109/ICRA.2014.6907013 100. Cenedese A, Susto GA, Belgioioso G, Cirillo GI, Fraccaroli F (2015) Home automation oriented gesture classification from inertial measurements. IEEE Trans Autom Sci Eng 12:1200–1210. https://doi.org/10.1109/TASE.2015.2473659 101. Xu R, Zhou S, Li W (2012) MEMS accelerometer based nonspecific-user hand gesture recognition. IEEE Sensors J 12:1166–1173. https://doi.org/10.1109/JSEN.2011.2166953 102. Xue Y, Ju Z, Xiang K, Chen J, Liu H (2017) Multiple sensors based hand motion recognition using adaptive directed acyclic graph. Appl Sci 7:358. https://doi.org/10.3390/app7040358 103. Patel S, Lorincz K, Hughes R, Huggins N, Growdon J, Standaert D, Akay Y, Dy J, Welsh M, Bonato P (2009) Monitoring motor fluctuations in patients with Parkinson’s disease using wearable sensors. IEEE Trans Inf Technol Biomed Publ IEEE Eng Med Biol Soc 13:864–73. https://doi.org/10.1109/TITB.2009.2033471 104. Yap, Hong Kai„ Mao, Andrew„ Goh, James„ Yeow, Raye Chen-Hua. (2016). Design of a wearable FMG sensing system for user intent detection during hand rehabilitation with a soft robotic glove. In: IEEE international conference on biomedical robotics and biomechatronics (BioRob), pp 781–786. https://doi.org/10.1109/BIOROB.2016.7523722 105. Mijovic B, Popovic M, Popovic D (2008) Synergistic control of forearm based on accelerometer data and artificial neural networks. Braz J Med Biol Res 41(5):389–397. https://doi.org/ 10.1590/S0100-879X2008005000019 106. Blana D, Kyriacou T, Lambrecht J, Chadwick E (2015) Feasibility of using combined EMG and kinematic signals for prosthesis control: A simulation study using a virtual reality environment. J Electromyogr Kinesiol 8. https://doi.org/10.1016/j.jelekin.2015.06.010 107. Chen M, Alregib G, Juang B-H (2013) Feature processing and modeling for 6D motion gesture recognition. IEEE Trans Multimed 15:561–571. https://doi.org/10.1109/TMM.2012. 2237024
28
1 Introduction
108. Hartmann B, Link N (2010) Gesture recognition with inertial sensors and optimized DTW prototypes. In: Conference proceedings - IEEE international conference on systems, man and cybernetics, pp 2102–2109. https://doi.org/10.1109/ICSMC.2010.5641703 109. Marqués G, Basterretxea K (2015) Efficient algorithms for accelerometer-based wearable hand gesture recognition systems. In: IEEE 13th international conference on embedded and ubiquitous computing, pp 132-139. https://doi.org/10.1109/EUC.2015.25 110. Zhu J, Pande A, Mohapatra P, Han J (2015) Using deep learning for energy expenditure estimation with wearable sensors. In: International conference on E-health networking, application & services (HealthCom), pp 501–506. https://doi.org/10.1109/HealthCom.2015. 7454554 111. Chou W, Fang B, Ding L, Ma X, Guo X (2013) Two-step optimal filter design for the low-cost attitude and heading reference systems. IET Sci Meas Technol 7:240–248. https://doi.org/10. 1049/iet-smt.2012.0100 112. Liu H, Wang L (2018) Gesture recognition for human-robot collaboration: A review. Int J Ind Ergon 68:355–367. https://doi.org/10.1016/j.ergon.2017.02.004 113. Bilal S, Akmeliawati R, Shafie AA, Salami M (2013) Hidden Markov model for human to computer interaction: A study on human hand gesture recognition. Artif Intell Rev 40. https:// doi.org/10.1007/s10462-011-9292-0 114. Lin J, Kulic D (2013) On-line segmentation of human motion for automated rehabilitation exercise analysis. IEEE Trans Neural Syst Rehabil Eng Publ IEEE Eng Med Biol Soc 22. https://doi.org/10.1109/TNSRE.2013.2259640 115. Ploetz T, Guan Y (2018) Deep learning for human activity recognition in mobile computing. Computer 51:50–59. https://doi.org/10.1109/MC.2018.2381112 116. Bruckner H-P, Nowosielski R, Kluge H, Blume H (2013) Mobile and wireless inertial sensor platform for motion capturing in stroke rehabilitation sessions. In: IEEE international workshop on advances in sensors and interfaces IWASI, pp 14–19. https://doi.org/10.1109/ IWASI.2013.6576085 117. Horie R, Nawa R (2017) A hands-on game by using a brain-computer interface, an immersive head mounted display, and a wearable gesture interface. In: IEEE 6th global conference on consumer electronics (GCCE), pp 1–5. https://doi.org/10.1109/GCCE.2017.8229324 118. Martinez J, García A, Oliver M, Molina M, Jose P, González P (2016) Identifying 3D geometric shapes with a vibrotactile glove. IEEE Comput Graph Appl 36:42–51. https://doi. org/10.1109/MCG.2014.81 119. Gradl S, Wirth M, Zillig T, Eskofier B (2018) Visualization of heart activity in virtual reality: A biofeedback application using wearable sensors. In: IEEE 15th international conference on wearable and implantable body sensor networks (BSN), pp 152–155. https://doi.org/10.1109/ BSN.2018.8329681 120. Chen J, Haas E, Barnes M (2007) Human performance issues and user interface design for teleoperated robots. IEEE Trans Syst Man Cybern C Appl Rev 37:1231–1245. https://doi.org/ 10.1109/TSMCC.2007.905819 121. Bolopion A, Régnier S (2013) A review of haptic feedback teleoperation systems for micromanipulation and microassembly. IEEE Trans Autom Sci Eng 10:496–502. https://doi. org/10.1109/TASE.2013.2245122 122. Troy J, Erignac C, Murray P (2009) Haptics-enabled UAV teleoperation using motion capture systems. J Comput Inf Sci Eng JCISE 9. https://doi.org/10.1115/1.3072901 123. Talasaz A, Patel R, Naish M (2010) Haptics-enabled teleoperation for robot-assisted tumor localization. In: Proceedings - IEEE international conference on robotics and automation, pp 5340–5345. https://doi.org/10.1109/ROBOT.2010.5509667 124. Kofman J, Wu X, Luu T, Verma S (2005) Teleoperation of a robot manipulator using a vision-based human-robot interface. IEEE Trans Ind Electron 52:1206–1219. https://doi.org/ 10.1109/TIE.2005.855696
References
29
125. Wu J, Tian Z, Sun L, Estevez L, Jafari R (2015) Real-time American sign language recognition using wrist-worn motion and surface EMG sensors. In: IEEE 12th international conference on wearable and implantable body sensor networks (BSN), pp 1–6. https://doi. org/10.1109/BSN.2015.7299393 126. Singh A, John B, Subramanian S, Kumar A, Nair B (2016) A low-cost wearable Indian sign language interpretation system. In: International conference on robotics and automation for humanitarian applications (RAHA), pp 1–6. https://doi.org/10.1109/RAHA.2016.7931873 127. Lee BG, Lee S (2017) Smart wearable hand device for sign language interpretation system with sensors fusion. IEEE Sensors J 1–1. https://doi.org/10.1109/JSEN.2017.2779466 128. Wu J, Sun L, Jafari R (2016) A wearable system for recognizing American sign language in real-time using IMU and surface EMG sensors. IEEE J Biomed Health Inform 20:1–1. https:// doi.org/10.1109/JBHI.2016.2598302 129. Kanwal K, Abdullah S, Ahmed Y, Saher ’l Y, Raza A (2014) Assistive glove for Pakistani sign language translation Pakistani sign language translator. In: IEEE international multi topic conference 2014, pp 173–176. https://doi.org/10.1109/INMIC.2014.7097332 130. Madushanka ALP, Senevirathne RGDC, Wijesekara LMH, Arunatilake SMKD, Sandaruwan KD (2016) Framework for Sinhala sign language recognition and translation using a wearable armband. In: International conference on advances in ICT for emerging regions (ICTer), pp 49–57. https://doi.org/10.1109/ICTER.2016.7829898 131. Barshan B, Yuksek M (2013) Recognizing daily and sports activities in two open source machine learning environments using body-worn sensor units. Comput J 57:1649–1667. https://doi.org/10.1093/comjnl/bxt075 132. Zhu C, Sheng W (2011) Wearable sensor-based hand gesture and daily activity recognition for robot-assisted living. IEEE Trans Syst Man Cybern A 41:569–573. https://doi.org/10.1109/ TSMCA.2010.2093883 133. Amft O, Junker H, Troster G (2005) Detection of eating and drinking arm gestures using inertial body-worn sensors. Wearable computers, 2005. In: Proceedings ninth IEEE international symposium on wearable computers, pp 160–163. https://doi.org/10.1109/ISWC. 2005.17 134. Daponte P, De Vito L, Riccio M, Sementa C (2014) Design and validation of a motiontracking system for ROM measurements in home rehabilitation. Measurement 55:82–C96. https://doi.org/10.1016/j.measurement.2014.04.021 135. O’Flynn B, Torres Sanchez J, Connolly J, Condell J, Curran K, Gardiner P (2013) Novel smart sensor glove for arthritis rehabiliation. In: IEEE international conference on body sensor networks, Cambridge, MA, USA, 2013, pp 1–6. https://doi.org/10.1109/BSN.2013.6575529 136. Lee WW, Yen S-C, Tay E, Zhao Z, Xu T, Ling K, Ng Y-S, Chew E, Cheong A, Huat G (2014) A smartphone-centric system for the range of motion assessment in stroke patients. IEEE J Biomed Health Inform 18:1839–1847. https://doi.org/10.1109/JBHI.2014.2301449 137. Kuklinski K, Fischer K, Marhenke I, Kirstein F, aus der Wieschen M, Solvason D, Krüger N, Savarimuthu T (2015) Teleoperation for learning by demonstration: Data glove versus object manipulation for intuitive robot control. In: International congress on ultra modern telecommunications and control systems and workshops, 2015, pp 346–351. https://doi.org/ 10.1109/ICUMT.2014.7002126 138. Wei X, Sun F, Yu Y, Liu C, Fang B, Jing M (2017) Robotic skills learning based on dynamical movement primitives using a wearable device. In: IEEE international conference on robotics and biomimetics (ROBIO), Macau, 2017, pp 756–761. https://doi.org/10.1109/ROBIO.2017. 8324508 139. Liarokapis M, Artemiadis P, Kyriakopoulos K (2013) Mapping human to robot motion with functional anthropomorphism for teleoperation and telemanipulation with robot arm hand systems. In: Proceedings of the 2013 IEEE/RSJ international conference on intelligent robots and systems. https://doi.org/10.1109/IROS.2013.6696638
30
1 Introduction
140. Kobayashi F, Kitabayashi K, Nakamoto H, Kojima F, Fukui W, Imamura N, Maeda T (2012) Multiple joints reference for robot finger control in robot hand teleoperation. In: 2012 IEEE/SICE international symposium on system integration, SII 2012, pp 577–582. https:// doi.org/10.1109/SII.2012.6427360 141. Hu H, Li J, Xie Z, Wang B, Liu H, Hirzinger G (2005) A robot arm/hand teleoperation system with telepresence and shared control. In: IEEE/ASME international conference on advanced intelligent mechatronics, AIM vol. 2, pp 1312–1317. https://doi.org/10.1109/AIM. 2005.1511192 142. Kuklinski K, Fischer K, Marhenke I, Kirstein F, aus der Wieschen M, Solvason D, Krüger N, Savarimuthu T (2015) Teleoperation for learning by demonstration: Data glove versus object manipulation for intuitive robot control. In: International congress on ultra modern telecommunications and control systems and workshops, 2015, pp 346–351. https://doi.org/ 10.1109/ICUMT.2014.7002126 143. Ekvall S, Kragic D (2004) Interactive grasp learning based on human demonstration. In: Proceedings - IEEE international conference on robotics and automation, vol 4, pp 3519– 3524. https://doi.org/10.1109/ROBOT.2004.1308798 144. Moore B, Oztop E (2012) Robotic grasping and manipulation through human visuomotor learning. Robot Auton Syst 60:441–451. https://doi.org/10.1016/j.robot.2011.09.002
Part II
Wearable Technology
This part of the book focuses on the wearable technologies. It comprises three chapters. In Chap. 2, two wearable sensors including inertial sensors and tactile sensors are introduced. Meanwhile, the calibration methods are developed for determining their performances. This chapter can be regarded as the basis of Chap. 3, which designs the wearable device and solves the motion capture problem. Finally, the applications of developed wearable devices are described in Chap. 4. This part therefore presents a clear perception to connect the wearable demonstration.
Chapter 2
Wearable Sensors
Abstract Wearable sensors are the basis of the wearable devices. In this chapter, two most popular wearable sensors including the inertial sensor and tactile sensor are introduced. The principle of the inertial sensor is described and optimal calibration methods are employed to improve the performance. Then, two tactile sensors with different sensing mechanisms are introduced. The calibrated experimental results are shown in the end.
2.1 Inertial Sensor The availability of MEMS technology enables inertial sensors to be integrated into single chip, which has made inertial sensors suitable for wearable devices. MEMSbased inertial sensors are lightweight, good for fast motion tracking, and can cover a large sensing range [1, 2]. Secondly, they are independent from an infrastructure and do not require the external source (completely self-contained) [3, 4]. MEMSbased inertial sensors are easy to be integrated and allow motion capture with little limitation on position [5, 6]. However, the performance of MEMS inertial sensors is degraded by fabrication defects including asymmetric structures, misalignment of actuation mechanism, and deviations of the center of mass from the geometric center. On the other side, there are inner factors causing the output errors, which are often divided into deterministic and random errors. And the inertial measurement unit integrated to MEMS inertial sensors (MIS) may still contain IMU package misalignment errors and IMU sensor-to-sensor misalignment errors. Therefore, it is essential to develop an effective calibration method to reduce these errors and increase the MIMUs precision and stability. The output of the acceleration measured by the accelerometer can be described by the following equation: Va = Ka a + ba + εa
(2.1)
where Va is the voltage of acceleration, a is the true acceleration, ba is the sensor bias, Ka is the scale factor (or acceleration gain), and εa is the sensor’s noise. © Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_2
33
34
2 Wearable Sensors
A similar equation can be used to describe the angular velocity measured by gyros on the single axis, Vω = Kω ω + bω + εω
(2.2)
where Vω is the voltage of angular velocity, ω is the true angular velocity, bω is the sensor bias, Kω is the scale factor (or angular velocity gain), and εω is the sensor’s noise. Although the MEMS gyro’s and MEMS accelerometer’s parameters are given in manuals, the dynamic range modifications of the MEMS sensors exist. The errors of the parameters such as biases, scale factors are deterministic errors, which can be reduced by calibration. In addition, the random errors are measurement noises, which should not be ignored in calibration and should also be characterized.
2.1.1 Analysis of Measurement Noise Several variance techniques have been developed for stochastic modeling of random errors. They are basically very similar and primarily differ in various signal processing methods that are incorporated into the analysis algorithms in order to achieve a desired result for improving the model parameters. The simplest one is the Allan variance. The Allan variance averages the root mean square (RMS) of the random drift error over a period of time. It is simple to compute and relatively simple to interpret and understand. The Allan-variance method can be used to determine the characteristics of the underlying random processes that could introduce the data noise. This technique can be used to characterize various types of error terms of the inertial-sensor data considering the entire length of data. Assuming that there are N consecutive data points with a sample time of t0 . And each cluster contains n (n < N/2) consecutive data points. Associated with each cluster is with time T , which is equal to nt0 . If the instantaneous output rate of the inertial sensor is Θ(T ), the cluster average is defined as 1 Θ¯ k (T ) = T
tk +T
Θ(T )dt
(2.3)
tk
where Θ¯ k (T ) represents the average of the output rate for a cluster which starts from the kth data point and contains n data points. The definition of the subsequent cluster average is 1 Θ¯ next (T ) = T where tk+1 = tk + T .
tk+1 +T
Θ (T ) dt tk+1
(2.4)
2.1 Inertial Sensor
35
Then the average of the two adjacent clusters can form the difference ξk+1 , k = Θ¯ next (T ) − Θ¯ k (T )
(2.5)
For each cluster time T , the ensemble of ξs defined by Eq. 2.5 forms a set of random variables. We are interested in the variance of ξs over all the clusters of the same size that are formed from the entire data. Thus, the Allan variance of length T is defined as Eq. 2.6 σ 2 (T ) =
N−2n 2 1 Θ¯ next (T ) − Θ¯ k (T ) 2 (N − 2n)
(2.6)
k=1
Obviously, for any finite number of data points (N), a finite number of clusters of a fixed length (T ) can be formed. Hence, Eq. 2.6 represents an estimation of the σ 2 (T ) which depends on the number of independent clusters of a fixed length that are formed. It is normally plotted as the square root of the Allan variance σ (T ) versus T on a log–log plot. And the percentage error is equal to 1 σ (δ) = √ 2(N/n − 1)
(2.7)
where N is the total number of data points, and n is the number of data points contained in one cluster. Equation 2.5 shows that the estimation errors in the region of short cluster length T are small as the number of independent clusters in these regions is large. On the contrary, the estimation errors in the region of long cluster length T are large as the number of independent clusters in these regions is small. In order to avoid the estimation error increasing due to the lack of samples, the overlapping Allan variance was proposed [7]. The use of overlapping samples improves the confidence of the stability estimation. The formula is expressed as follows: ⎧
σ 2 (T ) =
1 2m2 (N − 2m + 1)
N−2m+1 ⎨j +m−1 j =1
⎩
i=j
Θ i+m (T ) − Θ i (T )
⎫2 ⎬ ⎭
(2.8)
where m is the average factor. For example, the static data from the MIMU are collected with a sampling rate of 1000 Hz at room temperature 25◦C. By applying the overlapping Allan-variance method to the whole dataset, a log–log plot of the overlapping Allan standard deviation versus the cluster time is shown in Fig. 2.1 for the accelerometers data and Fig. 2.2 for the gyros data. As shown in these figures, it takes about 20s for the variance to converge, which means that the bias drifts of the accelerometers and gyros are very noisy during this period of time. This implies that the sensor bias
36
2 Wearable Sensors
Fig. 2.1 Accelerometers overlapping Allan-variance results
Fig. 2.2 Gyros overlapping Allan-variance results
should be averaged over a period of at least 20s so that the average bias will not change significantly in the next 20s intervals. In the period of 20s, the random error can be considered as white noise. Dealing with the sensor signals over a period of time determined using the Allanvariance analysis will keep the bias drift minimal during the following time period when the calibration data is collected. The white noise is only considered in the calibration.
2.1 Inertial Sensor
37
2.1.2 Calibration Method Calibration is the process of comparing the instrument output with a known reference information and determining the coefficients that encourage the output to agree with the reference information over a range of output values [7]. Many calibration methods of IMU have been developed. The method most commonly used for the calibration of MEMS accelerometers is six-position method [7], which requires the inertial system to be mounted on a leveled surface with each sensitivity axis of each sensor pointing alternately up and down. This calibration method can be used to determine the bias and scale factors of the sensors, but cannot estimate the axes misalignments or non-orthogonalities. The multi-position calibration method is proposed to detect more errors [8]. These methods depend on the earth’s gravity as a stable physical calibration standard. Furthermore, some special apparatuses such as the motion sensing equipment [9] or robotic arm [10, 11] are used for calibration. For the calibration of the low-cost MEMS gyroscopes, the Earth rotation rate is smaller than its resolution. Therefore, the traditional calibration method depends on a mechanical platform, rotating the IMU into different predefined, precisely controlled orientations with fixed angular rates. Such method is primarily designed for in-lab tests, which often requires the use of expensive equipment [7]. Therefore, some calibration methods that do not require the mechanical platform have been proposed. In [12], an optical tracking system is used. In [13], an affordable threeaxis rotating platform is designed for the calibration. Meanwhile, the schemes for in-field user calibration without external equipment are proposed. Fong et al. [14] calibrated gyroscopes by comparing the outputs of the accelerometer and the IMU orientation integration algorithm after arbitrary motions, which requires an initial rough estimate of gyroscope parameters. Jurman [15], Hwangbo [16] proposed a kind of shape-from-motion calibration method with magnitude constraint of motion. Furthermore, calculation algorithm is another important issue. The calibration parameters are computed by the algorithm. The least squares method is commonly used in the scalar calibration to estimate the calibration parameters [8, 12, 17]. H.L Zhang et al. [17] implemented the optimal calibration scheme by maximizing the sensitivity of the measurement norms with respect to the calibration parameters. The algorithms typically lead to a biased estimate of the calibration parameters and may give non-optimal estimates of the calibration coefficients. To avoid this, Panahandeh et al. [18] solved the identification problem by using the maximum likelihood estimation (MLE) framework, but only simulation results were presented. In fact, the biases and scale factors vary with temperature. The thermal calibration is also an indispensable process for MIMU. Before the description of our calibration method, a clear concept of the calibration method is introduced firstly. The calibration method of IMU should be composed of two aspects, which are calibration scheme and calibration algorithm. The scheme is to design the experiments, and the algorithm is to compute the parameters by the experimental data. A novel calibration method is presented in this section according to the concept. Firstly, the complete calibration model of MIMU
38
2 Wearable Sensors
is described. Then the optimal calibration method is proposed by optimal calibration scheme and optimal calibration algorithm. The scheme is the design of calibration experiment for the gyroscope triad and the accelerometer triad, and the calibration results are computed by the optimal algorithm. In the end, the thermal calibration is described. (1) Calibration Model The measurement model of the accelerometer or gyroscope generally includes bias, scale factor, and random error [19]. Bias and scale factors are considered as the most significant parameters, which change in different temperatures. It is also a very important characteristic for inertial sensors, but this is dealt with by the independent thermal calibration, as explained in the end. In this section, it is assumed that the calibration procedure is taken at the stable temperature, and the self-heating effects can be ignored, for the MIMU is calibrated after the sensors are warmed up to thermal stable. Moreover, the random error can be considered as white Gaussian noise from the last section. The MIMU consists of three almost orthogonally mounted gyroscopes and a three-axis accelerometer in this section. As a unit, the misalignment error would be induced because of the sensors installation. The misalignment error is divided into two types, package misalignment error and sensor-to-sensor misalignment error. Package misalignment error is defined as the angle between the true axis of sensitivity and the body axis of the package. Sensor-to-sensor misalignment error is the misalignment error due to non-orthogonality of IMU axes. The package misalignment angle can be defined as three angles. First, a rotation about the z-axis by an angle of θz , the matrix can be represented as follows: ⎡
⎤ cos θz sin θz 0 C 1 = ⎣− sin θz cos θz 0⎦ 0 0 1
(2.9)
Second, a rotation about the y-axis by an angle θy, the matrix is represented as follows: ⎤ ⎡ cos θy 0 − sin θy C2 = ⎣ 0 1 0 ⎦ sin θy 0 cos θy
(2.10)
And finally, a rotation about the x-axis by an angle of θ x, represented by the following equation: ⎡ ⎤ 1 0 0 C 3 = ⎣0 cos θx − sin θx ⎦ 0 sin θx cos θx
(2.11)
2.1 Inertial Sensor
39
The three angles can be considered as small angles. Then, the package misalignment error can be represented by the following equation: ⎡
MP
⎤ 1 −θz θy = C 1 C 2 C 3 ≈ ⎣ θz 1 −θx ⎦ −θy θx 1
(2.12)
The sensor-to-sensor misalignment error or non-orthogonality can be also considered as creating a series of rotation matrices that define the relationship of the misaligned axes to those of the perfectly orthogonal triad. Generally, it is considered that the axis x of the frames is coincided, and it is assumed small angles, the nonorthogonality matrix can be approximated by ⎡
⎤ 1 0 0 M o = ⎣ βyz 1 0⎦ −βzy βzx 1
(2.13)
Then, the total misalignment error can be written as M = M oM p ⎡
⎤ θy 1 −θz ⎦ =⎣ βyz + θz 1 − βyz θz βyz θy − θx −βzy + βzx θz − θy βyz θz + βzx + θx −βzy θy − βzx θx + 1 ⎤ ⎡ 1 −θz θy (2.14) ≈ ⎣ βyz + θz 1 −θx ⎦ −βzy − θy βzx + θx 1
Consequently, the measurement of the accelerometer cluster can be expressed as a c = K a M a a g + ba + v a
(2.15)
where a g denotes the input specific force expressed in platform coordinates and v a is the measurement noise. The scale factor matrix and bias vector of the accelerometer are defined, respectively, as K a = diag(kxa , kya , kza ),
T ba = bxa bya bza
where kia , bia , respectively, denote the scale factor and the bias of the ith accelerometer output, i = x, y, z.
40
2 Wearable Sensors
The misalignment matrix of accelerometer is written as ⎡
aθ ⎤ 1 −a θ z y M a = ⎣ a βyz + a θz 1 −a θ x ⎦ −a βzy − a θy a βzx + a θx 1
(2.16)
Analogously, the measurement of the gyroscope cluster can be written as ωc = K g M g ωg + bg + v g
(2.17)
where ωg denotes the true platform angular velocity with respect to the inertial coordinates expressed in platform coordinates, K g is the diagonal scale factor matrix, bg is the bias vector of the gyroscope cluster, and v g is the measurement noise. The misalignment matrix of gyroscope triad is written as ⎡
gθ ⎤ 1 −g θ z y M g = ⎣ g βyz + g θz 1 −g θ x ⎦ −g βzy − g θy g βzx + g θx 1
(2.18)
The task of MIMU calibration is to estimate parameters like scale factors, misalignments, biases of accelerometer, and gyroscopes under the stable temperature in this section. The key technology of the calibration can be divided into two aspects: calibration scheme and calibration algorithm. We would discuss separately in the following sections. (2) Calibration Scheme The calibration scheme is the experiment designed for the MIMU. Generally, multipositions calibration scheme is mostly used for the IMU calibration. The general principle of the method is to design enough positions to estimate the calibration parameters. At least twelve positions for MIMU are required in calibration as there are twelve unknown parameters in both the accelerometer triad and the gyroscope triad. In order to avoid computational singularity in the estimation, more positions are desired to get numerically reliable results in reality. Then the question about how to design the positions in calibration arises. The current literatures seldom further discuss about the positions to be optimized. In other words, the scheme should not only make all the calibration parameters identifiable, but also maximize their numerical accuracy. In the current context, the optimal calibration positions are proposed by D-optimal method. According to the Eqs. 2.15 and 2.17, the following equation can be deduced, O = PI + b + v
(2.19)
2.1 Inertial Sensor
41
where P = K a M a or K g M g , I = a g or ωg denotes the input vector of MIMU, O = a c or ωc denotes the vector of MIMU’s measurement, v = v a or v g , b = ba or bg . By rearranging the Eq. 2.19, we can attain the following equations: ⎧ x x ⎪ ⎪ ⎨ O = P 1 · I + bx + v yO = P · I + b + yv 2 y ⎪ ⎪ ⎩z O = P · I + b + z v 3
(2.20)
z
T T T where P 1 = P11 P12 P13 , P 2 = P21 P22 P23 , P 3 = P31 P32 P33 , i O denotes the MIMU’s measurement of i axis, i = x, y, z. Assuming Ξ is an experimental procedure, which is used to calibrate the parameters in Eq. 2.20. The experimental procedure is composed of n tests defined as Uk (k ∈ [1, n]). Each test Uk corresponds to a test position of Ξ , which is generated by input vector Γ k = I k 1 , and I k is the vector at test point k. As a result, every experimental procedure can be mathematically represented as Ξ = {Uk (Γ k ) |k ∈ [1, n] }
(2.21)
with associated outputs i O. It is shown that a D-optimal design can be achieved with p ≤ n ≤ p (p + 1)/2, where p is the number of parameters to be estimated. From Eq. 2.21 one can see that there are 4 unknown parameters for each output axis, so the optimal number of measurement positions must exist [7, 15]. Then, the optimal function can be described as follows: Find: Ξ∗ = Γ 1 Γ 2 · · · Γ k ∈ E With: Ξ = arg max Dn (Ξ ) = det F nT F n/n Ξ ∈E
(2.22)
where E is the global region of positions and F n = Γ 1 Γ 2 · · · Γ k is the design information matrix of n test positions. And the procedure of D-optimal design is described as follows: (1) Select n test positions from global candidate positions, and calculate object function D ni ; (2) Select maximum D ni and let D n = max D ni ; (3) If n ≤ 12, return to step 1); (4) Obtain optimal experiment positions Ξ ∗.
42
2 Wearable Sensors
Four optimal positions from each i O are acquired, therefore, totally 12 optimal positions can be attained for calibration. Moreover, there could be duplicate positions. After removing duplicate positions, 9 optimal positions are listed in the Table 2.1, are used for calibration. It is noticed that the MEMS gyros mentioned in the book are low cost and cannot measure the earth rotation rate. Therefore, the rotation rate table is used in the calibration to produce a reference signal. Meanwhile, the gravity is also considered as a reference signal. (3) Calibration Algorithm Calibration algorithm is to compute the calibration parameters from the data, which is collected by the calibration scheme. Because the collected data from MIMU contains random noise, the optimal estimate algorithm is generally adopted. The least squares algorithm is most commonly used to estimate the calibration parameters. The algorithm typically leads to the biased estimate of the calibration parameters and may give non-optimal estimates of the calibration coefficients. Consequently, the KF is designed for the calibration algorithm in the following. KF eliminates random noises and errors using the knowledge about the statespace representation of system and uncertainties in the process, namely the measurement noise and the process noise. So it is very useful as a signal filter in case of signals with random disturbances or when signal errors can be treated as an additional state-space variable in the system. For linear Gaussian systems, the KF is the optimal minimum mean square error estimator. The model can be expressed as follows: Xi+1 = ΦXi + wi ,
Z i+1 = H · Xi+1 + εi+1
(2.23)
where Xi —the state vector at time i, Z i+1 —the system output (measured signal) at time i; wi —the process noise at time i; εi —the measurement noise at time i; Φ, H —matrices of the state-space representation: Φ—the state and H —the output. And KF can be derived by the following process: Time updating: P i|i−1 = Φ i−1 P i−1 Φ Ti−1 Xˆ i|i−1 = Xˆ i−1
(2.24)
Measurement updating: ⎧
−1 T T ⎪ ⎪ ⎨K i = P i|i−1 H i Hi P i|i−1 H i + Ri Xˆ i = Xˆ i|i−1 + K i Z i − H i Xˆ i|i−1 ⎪ ⎪ ⎩ P i = (I − K i H i ) P i|i−1
(2.25)
2.1 Inertial Sensor
43
Table 2.1 Optimal positions for calibration No.
MIMU Postures x
y
z
1
1
0
0
2
−1
0
0
√
2/2
√ 2/2
0
2/2
0
√ 2/2
3
4
√
5
0
1
0
6
0
−1
0
7
0
√ 2/2
√ 2/2
Illustration
(continued)
44
2 Wearable Sensors
Table 2.1 (continued) No.
MIMU Postures x
y
z
8
0
0
1
9
0
0
−1
According to the optimal positions, a bank of KF are deduced, x i+1 = x Φx i xZ
i+1
= x H x i+1 + x εi+1
y i+1 = y Φy i yZ
i+1
= y Hy i+1 + y εi+1
zi+1 = z Φzi zZ
i+1
= z H zi+1 + z εi+1
T where x Φ = I 4×4 , x X = bx P11 P12 P13 , ⎤ ⎡ 1 r 0 0 ⎢ 0√ 0 ⎥ x H = ⎢1 −r ⎥ √ ⎣1 r/ 2 r/ 2 0 ⎦, r = g or ω, √ √ 1 r/ 2 0 r/ 2 T yΦ = I y 4×4 , X = by P21 P22 P23 , ⎤ ⎡ 1 0 r 0 ⎢ 0√ −r 0 ⎥ y H = ⎢1 ⎥ √ ⎣1 r/ 2 r/ 2 0 ⎦, √ √ 1 0 r/ 2 r/ 2 T zΦ = I z 4×4 , X = bz P31 P32 P33 , ⎤ ⎡ 1 0 0 r ⎥ ⎢ 0 0√ −r z H = ⎢1 √ ⎥ ⎣1 0 r/ 2 r/ 2⎦. √ √ 1 r/ 2 0 r/ 2
Illustration
(2.26)
(2.27)
(2.28)
2.1 Inertial Sensor
45
Three simpler KF have been designed for the MIMU calibration. Then the observability is analyzed to make sure that the calibration parameters are identifiable. According to the Eq. 2.24, the n-state discrete linear time-invariant system has the observability matrix G defined by, G (φ, H ) = H H φ . . . H φ n−1
(2.29)
The system is observable if and only if rank (G) = n. We can derive the following equations from Eqs. 2.26–2.28 rank
x
H = 4, rank y H = 4, rank z H = 4
(2.30)
Because i Φ (i = x, y, z) are identity matrices, the rank of the observability matrix is equal with the quantity of states. We can conclude that three KF are observable, in other words, parameters can be estimated by the measurements, which are measured by the 9-position experiment. Therefore the scale factors, biases, and orthogonalization angles of MIMU are estimated by the three KF. However, the scale factors, biases are not stable in different temperatures. Therefore, thermal calibration is also important for MIMU. The purpose of thermal calibration is to reduce the errors caused by variation of the scale factors and biases of the MIMU when operating under different temperatures. There are two main approaches for thermal testing [7]: (1) In the Soak method, it allows the IMU enclosed in the thermal chamber to stabilize at a particular temperature corresponding to the temperature of the thermal chamber and then records the data. This method of recording the data at specific temperature points is called the Soak method. (2) In the Thermal Ramp method, the IMU temperature for a certain period of time is linearly increased or decreased. We would use the Soak method to investigate the thermal effect of sensors. According to Eqs. 2.15 and 2.17, the model of thermal calibration can be described, ωc = K tg M g ωg + btg + v g
(2.31)
a c = K ta M a a g + bta + v a
(2.32)
where K tg and K ta denote the scale factors in some temperature, btg and bta denote the biases in some temperature. The misalignment parameters can be obtained by the last section, and they are not varied by the temperature, so the only scale factors and biases can be estimated. The calibration scheme can be designed by items No.1–6 in Table 2.1.
46
2 Wearable Sensors
Analogously, the calibration algorithms are deduced,
t Xa i+1 t Za i+1
t Xω i+1 t Zω i+1
= Φ a t Xai a = t H a t Xai+1 + t εi+1
= Φ ω t Xωi ω = t H ω t Xωi+1 + t εi+1
(2.33)
(2.34)
T T where t Xa = kxa kya kza bxa bya bza , t Xω = kxω kyω kzω bxω byω bzω , ⎡ ⎤ r 0 0 100 ⎢−r 0 0 1 0 0⎥ ⎢ ⎥ ⎢ ⎥ ⎢ 0 r 0 0 1 0⎥ a ω a ω t t Φ = Φ = I 6×6 , H = H = ⎢ ⎥. ⎢ 0 −r 0 0 1 0⎥ ⎢ ⎥ ⎣ 0 0 r 0 0 1⎦ 0 0 −r 0 0 1 It is noticed that the variances of MIMU measurements are different in different temperatures, so the covariance matrices of measurement noises should be calculated in different temperatures. In thermal calibration, a turntable and a thermal chamber are assembled together to form a Thermal-Turntable Unit as shown in Fig. 2.3.
Fig. 2.3 The thermal test setup
2.1 Inertial Sensor
47
2.1.3 Experimental Results In order to verify the calibration method, a Monte Carlo simulation is implemented firstly. The true values of the twelve parameters in the calibration model are listed in Table 2.2. It is noted that the parameters are dimensionless parameters. These parameters used in the simulation are chosen to be in the range of expected MIMU parameters. The sensors’ noise standard deviation is σ = 0.001 0.001 0.001
(2.35)
Both the traditional method and the proposed approach are performed in the simulation. It is noted that the traditional method is the method of least squares [7]. The results are shown in Figs. 2.4, 2.5, 2.6, 2.7. The parameters’ errors are smaller with the proposed method. Meanwhile, we find that parameters estimated by the Table 2.2 The simulated parameters
Bias Scale factor θx θy θz βyz βzy βzx
Fig. 2.4 The bias errors of two calibration methods
Sensor1 2.0 0.01 0.05 0.1 0.01 0.5 0.5 0.8
Sensor2 2.5 0.05 0.05 0.1 0.01 0.5 0.5 0.8
Sensor3 2.9 0.09 0.05 0.1 0.01 0.5 0.5 0.8
48
2 Wearable Sensors
Fig. 2.5 The scale factors error of two calibration methods
Fig. 2.6 The package misalignment angle errors of two calibration methods
traditional method will be largely deviated when the noise standard deviation is increased. However, the proposed method is still effective. The simulation results have shown the performance of the filter and the accuracy of the proposed method. Then the MIMU is calibrated in the lab, using a cube and a rate table. The outputs are sampled at a rate of 1000Hz, and a LabVIEW program stores the digitalized analog measurements from the MIMU. The collected data are later read into a MATLAB program for processing. The calibration results
2.1 Inertial Sensor
49
Fig. 2.7 The orthogonalization angle errors of two calibration methods
Fig. 2.8 Bias of accelerometers
are shown as follows. Figures 2.8, 2.9, 2.10, 2.11 show the parameters of threeaxis accelerometers and the gyros’ are shown in Figs. 2.12, 2.13, 2.14, 2.15. The estimated results are listed in Table 2.3. It indicates the estimating procedure runs and takes less than two seconds to complete. The parameters converge fast to the stable values.
50
2 Wearable Sensors
Fig. 2.9 Scale factor of accelerometers
Fig. 2.10 Package misalignment angles of accelerometers
2.2 Tactile Sensor The rapid development of intelligent robot technology has stimulated the strong demand of perception and interaction mode with tactile participation [20–23]. Because tactile can provide information about forces, temperatures, textures, and so on between robots and the environment. In order to provide more comprehensive information support to the robotic operation, more researchers focus on tactile sensing mechanisms [24]. The tactile sensing types include capacitive, piezo-resistance, magnetic, piezo-electric, visual sensing, barometric, etc. [25–27]. Nowadays, tactile sensors are not only satisfied with sensing stress information, but also tend to
2.2 Tactile Sensor
51
Fig. 2.11 Non-orthogonality angles of accelerometers
Fig. 2.12 Bias of gyros
perceive more information, such as slippage, losing of contact, multidimensional force, and so on. Reference [28] proposed the multi-modal capacitive tactile sensor that achieves the static and dynamic sensing by the same layer. In [29], a haptic sensor based on piezo-electric sensor arrays with pin-type modules is designed which could obtain 3-D structural information of artificial structures. Reference [30] presents the vision-based tactile sensor that consists of the miniature camera, LED lights, a transparent elastomer with layer of reflective membrane and markers. The three-dimensional contact forces and texture can be recognized
52
2 Wearable Sensors
Fig. 2.13 Scale factor of gyros
Fig. 2.14 Package misalignment angles of gyros
from the images. Reference [31] compares the advantages and disadvantages of twenty-eight various tactile sensors. The tactile sensors have gradually realized integration, miniaturization, and intelligence [32]. In this section, two kinds of tactile sensors including piezo-resistive tactile sensor array and capacitive sensor array are described.
2.2 Tactile Sensor
53
Fig. 2.15 Non-orthogonality angles of gyros Table 2.3 The calibration results of MIMU Bias (m/s 2 ) (◦ /s) Scale factor (mv/m/s 2 ) (mv/◦ /s) θx (◦ ) θy (◦ ) θz (◦ ) βyz (◦ ) βzy (◦ ) βzx (◦ )
AccX 1.79 80.80 0.06 0.20 0.19 0.01 −0.02 −0.01
AccY 1.66 80.14 0.06 0.20 0.19 0.01 −0.02 −0.01
AccZ 1.74 80.84 0.06 0.20 0.19 0.01 −0.02 −0.01
GyroX 2.51 5.07 −0.18 0.05 0.15 1.32 1.05 1.15
GyroY 2.52 5.09 −0.18 0.05 0.15 1.32 1.05 1.15
GyroZ 2.48 5.17 −0.18 0.05 0.15 1.32 1.05 1.15
2.2.1 Piezo-Resistive Tactile Sensor Array The most common form of piezo-resistive array tactile sensor is the sandwich structure. The sensor is made by embedding pressure-sensitive materials between the upper and lower flexible electrode layers. The structure of piezo-resistive tactile sensor is shown in Fig. 2.16. It is consisted of the top layer of insulating material, the top layer of transverse electrode, the pressure-sensitive material, the bottom layer of longitudinal electrode, and the bottom layer of insulating material. The pressuresensitive material is deformed by external force, and the resistance value between the upper and lower electrodes changes. Then the contact force value is obtained indirectly. By designing a sensor unit in the form of array, it can widely sense the change of force during contact and improve the sensitivity of contact force. Here, the top and bottom electrode are designed in the form of “five horizontal and five vertical,” so that the whole pressure-sensitive material can be automatically divided into 5×5 matrix form. The pressure-sensitive array uses a double-layer FPC
54
2 Wearable Sensors
Fig. 2.16 The structure of the piezo-resistive tactile sensor
(Flexible Printed Circuit board) as the sensor electrode. The row electrode wires from the top layer of FPC, and the column electrode wires from the bottom layer of FPC. Finally, the tactile sensor with flexible piezo-resistive array is obtained. The proposed sensor is flexible and can withstand a large degree of bending. Sensor arrays of the same form and size (larger or smaller) can be fabricated using the same technology to meet the requirements of different applications. The equivalent circuit of the pressure sensor is shown in Fig. 2.17. Value of resistance of layer A and layer B is related to pressure. Through measuring the value of resistance, the value of pressure is acquired. Electrode A and B cascade with resistance. Electrode C and D cascade with resistance, then connecting to ground. When applying a positive voltage (Vcc) to the electrode A, the voltage from A, B, C, D can be detected. After switching voltage between electrode A and B, each point voltage is detected again. To set up the equation for the collected voltage value, at T1 , for the circuit on the left side, the following equation can be established: ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
I11 + I12 =
I01
U11 − I01 R0 =
I11 R11
U21 − I01 R0 =
I12 R12
(2.36)
where I11 is the current through R11 ,U11 is voltage of point A, U21 is voltage of point B, U01 is voltage of point C, U02 is voltage of point D.
2.2 Tactile Sensor
55
Fig. 2.17 Two occasions of resistance network state
At T2 , the equation is shown: ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
I11 + I12 =
I01
U12 − I01 R0 =
I11 R11
U22 −
I01 R0
=
(2.37)
I12 R12
Simultaneous equation can be obtained from Eqs. 2.36 and 2.37,
U11 R12 = I01 R0 R12 + I11 R11 R12 U21 R11 = I01 R0 R11 + I12 R11 R12
(2.38)
U11 R12 + U21 R11 = I01 R0 R12 + I11 R11 R12 + I01 R0 R11 + I12 R11 R12
(2.39)
U11 R12 − I01 R0 R12 + U21 R11 − I01 R0 R11 = I11 R11 R12 + I12 R11 R12
(2.40)
U11 − I01 R0 U21 − I01 R0 + = I11 + I12 = I01 R11 R12
(2.41)
U12 R12 = I01 R0 R12 + I11 R11 R12
U22 R11 = I01 R0 R11 + I12 R11 R12
R
R U12 − I01 U22 − I01 0 0
+ = I11 + I12 = I01 R11 R12 ⎧ U11 − I01 R0 U21 − I01 R0 ⎪ ⎪ + = I01 ⎪ ⎨ R11 R12
R
R ⎪ U12 − I01 U22 − I01 0 0 ⎪
⎪ ⎩ + = I01 R11 R12
(2.42)
(2.43)
(2.44)
56
2 Wearable Sensors
U11 − I01 R0 U21 − I01 R0
R U − I R U12 − I01 0 22 01 0
1/R11 1/R12
I = 01
I01
(2.45)
After replacing variable I01 R0 with variable U01 as the voltage value at T1 , and
R with variable U as the voltage value at T , we have replacing variable I01 0 02 2
U11 − U01 U21 − U01 U12 − U02 U22 − U02
1/R11 1/R12
=
I01 I02
(2.46)
After generalizing this method to 5×5 array tactile sensor, pressure value can be obtained by 1/Rxy , and the result can be calculated as the following: ⎡
U11 − U01 ⎢U − U ⎢ 12 02 ⎢ ⎢ U13 − U03 ⎢ ⎣ U14 − U04 U15 − U05
U21 − U01 U22 − U02 U23 − U03 U24 − U04 U25 − U05
U31 − U01 U32 − U02 U33 − U03 U34 − U04 U35 − U05
U41 − U01 U42 − U02 U43 − U03 U44 − U04 U45 − U05
⎤⎡ ⎤ ⎡ ⎤ I01 1/R11 U51 − U01 ⎢ ⎥ ⎢ ⎥ U52 − U02 ⎥ ⎥ ⎢ 1/R12 ⎥ ⎢ I02 ⎥ ⎥⎢ ⎥ ⎢ ⎥ U53 − U03 ⎥ ⎢ 1/R13 ⎥ = ⎢ I03 ⎥ ⎥⎢ ⎥ ⎢ ⎥ U54 − U04 ⎦ ⎣ 1/R14 ⎦ ⎣ I04 ⎦ U55 − U05 1/R15 I05 (2.47)
2.2.2 Capacitive Sensor Array The capacitive sensors have high sensitivity, low-power consumption, and good noise immunity [33, 34]. Researchers apply themselves to improve capacitive tactile sensors’ performance. Aiming at optimizing sensor structure, various types of materials are used to fabricate capacitive sensors, such as polysilicon, polymer materials, typically polydimethylsiloxane (PDMS) and SU-8 [35, 36]. As for flexible capacitive tactile sensors, polymer materials are used for intermediate layers. And the manufactured sensors have high sensitivity and very short response time [37]. On the other hand, researchers are not content with single performance tactile sensors. Multi-modal tactile sensor is designed. Reference [38] presents a multi-modal capacitive tactile sensor which can not only have static sensing (normal pressure and shear force), but also dynamic sensing (slippage). Meanwhile the sensor is sensitive to temperature variation. For fingertip sensing applied in robotic manipulation, the tactile sensor is required to have similar characteristics as human skin. Therefore, the requirements of the tactile sensor array include highly sensitivity, highly density, and flexibility [39] which present new challenges to the performance of tactile sensor. In this study, on the basis of capacitive tactile sensing principles and structures, the petal-like array capacitive tactile sensor with micro-pin is introduced. To meet the requirements of high sensitivity and spatial resolution, the structural design and fabrication process for the tactile sensor array are conducted. The performance of the sensor array and its applications for robotic fingertip sensing are also studied.
2.2 Tactile Sensor
57
The capacitive sensor is a flat plate capacitor consisting of two parallel metal plates separated by an insulating medium. And the capacitive tactile sensor commonly comprises a three-layer structure, namely a bottom electrode, a top electrode, and an elastic medium layer sandwiched in the middle. The performance of capacitive tactile sensor array is mainly determined by the mechanical and electrical properties of the elastic insulation layer. At present, there are two typical designs [40]: one is to use a flexible elastomer material like PDMS as the elastic insulating layer, and the other is to use the air groove as the elastic insulating layer. The difference in the elastic insulation layer makes a considerable difference in performance. The sensor using the air groove as the elastic insulating layer would be more sensitive than the sensor using PDMS as the insulating layer. However, due to the lack of pressure under the button, the sensor is more susceptible to aging. For long-term use, the repeatability and durability of the sensor using PDMS as the insulating layer are better than the performance of the sensor using the air groove. Combining the advantages of the above two types, we propose a new type of flexible tactile sensor that combines the characteristics of air and the high elastic properties of PDMS. The structure of PDMS that is made into flexible micro-pin as pressure support of the air slots is inspired by the human fingertip’s Merkel cell complex, as shown in Fig. 2.18. The tip of pin ensures high sensitivity performance of the sensor under initial stress condition. Meanwhile, the larger diameter root of pin ensures that the sensor has greater pressure bearing capacity. On the other hand, the array distribution form is also important for the performance of sensor like cross-coupling. The petal array design is proposed. It is more suitable for fingertip state recognition than the traditional rectangular array. We integrate 25 tactile sensing points in the area of 0.785 cm2 . The area is a disk whose radius is 0.5 cm. We design the sensing point at the front of the finger, whose main sensing point is a circular point with a radius of 0.2 cm, and the secondary sensing points are 24 coaxial sector points. The surrounding secondary sensing point is mainly used to provide sensitive information such as the direction of force and vibration. The ring array design simulates the sensing process of human fingertip and provides the more delicate perception of information. Figure 2.19 shows the overall structure of the proposed tactile sensor. The sensor is divided into four layers. A button layer is used to transmit the pressure to the sensitive area to increase the sensitivity of the sensor. A top electrode layer that provides a driving voltage is the upper plate of the capacitor. An elastic insulating layer, which is composed of the
Fig. 2.18 Pin Structure in the human skin (left) and the corresponding pin design in the proposed tactile sensor (right)
58
2 Wearable Sensors
Fig. 2.19 Structure of the proposed tactile sensor. (a) button layer, (b) top electrode layer, (c) elastic insulating layer, (d) bottom electrode layer
Fig. 2.20 The proposed tactile sensor. (a) the front of the sensor, (b) the back of the sensor
flexible elastic micro-pin array, is the force propagation layer and the supporting structure. The bottom electrode layer is the lower plate of the capacitor. Due to the number of sensing points, three pieces of CDC (capacitance to digit conversion) chip are used. The CDC chip type is 7148ACPZ. Each CDC chip is used for sampling data of eight secondary sensing points and outputs their sampling data by IIC protocol. The main sensing point is sampled by the microcontroller unit. In our sensor, the specific model of microcontroller unit is M430F149. The overall circuit is a “T-type” structure which is easy to place on fingertips. The proposed tactile sensor is shown in Fig 2.20.
2.2.3 Calibration and Results To study the sensing performance of the sensor array, the XBD2102G micro-control electronic pressure test platform is used. Figure 2.21 shows the instrument and test environment. The XBD2102G micro-control electronic pressure test platform uses an AC stepper motor and a deceleration system to drive the precision force measurement system for force loading and unloading. The test platform also provides software control and display functions, which can simultaneously output the relationship between displacement and force, and also provide measurement functions such as elastic modulus and tensile strength. The tactile sensor is attached to the test platform. By controlling the parameter setting of the platform, the
2.2 Tactile Sensor
59
Fig. 2.21 The calibration of tactile sensors
dynamometer presses the sensor at a constant speed (1.5 mm/min) and pauses when the pressure reaches the set threshold, and then rises at a constant speed to complete the loading and unloading of the force applied to the sensor. The main static performance of the sensor includes range, sensitivity, resolution, nonlinearity, hysteresis, and repeatability. The specific definitions are introduced as follows: 1. Measuring range: it refers to the sensor measuring range, which is from the smallest input of induction and the difference between the maximum inputs. 2. Sensitivity: it refers to the ratio between input increment Δx and output increment Δy, which are the input and output curve slopes, usually expressed in S. 3. Resolution: it refers to the minimum input measuring range that can be detected represented by Δxmin , also can be expressed by the ratio of the minimum input Δxmin and full range Lmin . 4. The nonlinear degree: it is mainly used to represent the degree of the input and output curve of the sensor. It usually can be represented by the ratio of the scope of deviating from the fitting curve of maximum deviation Δymax and full range Lmin within full range: Enl = Δymax /Lmax × 100%
(2.48)
5. The hysteresis: it is also known as the hysteresis error, which is used to characterize the non-coincidence of input–output curves during sensor loading and unloading. Used in the process of loading and unloading, with the maximum misalignment Δm and the ratio of full range Lmax , it is defined as Ehys = Δm/Lmax × 100%
(2.49)
60
2 Wearable Sensors
6. Repeatability: it is used to represent the degree of inconsistency of input–output curves in the same test input. In limited cases, it can be expressed by the average mean square error of the measurement and the ratio of full range Lmax with many times. However, the non-repeatability is usually a random error. It is not reasonable to use the limit situation, and the limit error of 2–3 times is generally used to express: Erpt = ±2σ /Lmax × 100%
(2.50)
The comprehensive test of the flexible tactile sensor is conducted in the test environment. There are five repeated tests, and 250 test points are recorded for each test in total. The results of the output data are analyzed by statistics, and the static indexes of the main sensing point and the secondary sensing points of the tactile sensor are obtained, as shown in Table 2.4. Meanwhile, the piezo-resistive tactile sensor array is also tested. The technical parameters of the tactile sensor are obtained, as shown in Table 2.5. Table 2.4 The static indexes of sensing points Performance Size Measuring range Sensitivity Resolution The nonlinear degree Hysteresis Repeatability
Table 2.5 technical parameters of the tactile sensor
Main sensing point 12.56 mm2 0–30N 76.9%/N 0.05N 25.03% 5.25% 2.62%
Secondary sensing points 2 mm2 0–30N 62.5%/N 0.05N −8.45% 9.36% 2.36%
Performance Area Perception unit Resolution Measurement range Thickness Maximum load Response time Maximum repeatability error Maximum hysteresis error
Parameter 15*15 mm2 25 0.1 N 0–20 gf (one perception unit) 0.2 mm 500 N 1 ms 7.19% 6.45%
References
61
2.3 Summary The wearable sensors are the basis of wearable technology. The two typical sensors including inertial sensors and tactile sensors are introduced in this chapter. The calibration of MIMU is an important phase to improve the performance. The optimal calibration method has been presented together with deduction, simulation, and experimental results. The new concept of calibration method is proposed. And the new calibration scheme is presented by the D-optimal method. The new calibration algorithms are designed by the Kalman Filters, which are used to obtain optimal estimates of the calibration parameters. Simulations have verified its feasibility and effectiveness. The thermal calibration has been accomplished because the biases of accelerometers and gyroscopes vary significantly with temperature. The experimental results have demonstrated the new approach that outperforms the traditional methods. Meanwhile, the flexible tactile sensors are proposed. The calibration processes are introduced to show the superior performance of sensors.
References 1. Lin B, Lee I-J, Hsiao P, Yang S, Chou W (2014) Data glove embedded with 6-DOF inertial sensors for hand rehabilitation. In: Tenth international conference on intelligent information hiding and multimedia signal processing, pp 25–28. https://doi.org/10.1109/IIH-MSP.2014.14 2. Cavallo F, Esposito D, Rovini E, Aquilano M, Carrozza MC, Dario P, Maremmani C, Bongioanni P (2013) Preliminary evaluation of SensHand V1 in assessing motor skills performance in Parkinson disease. In: IEEE international conference on rehabilitation robotics [proceedings], pp 1–6. https://doi.org/10.1109/ICORR.2013.6650466 3. Zhang Z-Q, Yang G-Z (2014) Calibration of miniature inertial and magnetic sensor units for robust attitude estimation. IEEE Trans Instrum Meas 63:711–718. https://doi.org/10.1109/ TIM.2013.2281562 4. Magalhaes F, Vannozzi G, Gatta G, Fantozzi S (2014) Wearable inertial sensors in swimming motion analysis: A systematic review. J Sports Sci 33. https://doi.org/10.1080/02640414.2014. 962574 5. Kortier H, Schepers M, Veltink P (2014) On-body inertial and magnetic sensing for assessment of hand and finger kinematics. In: Proceedings of the IEEE RAS and EMBS international conference on biomedical robotics and biomechatronics, pp 555–560. https://doi.org/10.1109/ BIOROB.2014.6913836 6. Bai L, Pepper MG, Yan Y, Spurgeon SK, Sakel M, Phillips M (2014) Quantitative assessment of upper limb motion in neurorehabilitation utilizing inertial sensors. IEEE Trans Neural Syst Rehabil Eng 23. https://doi.org/10.1109/TNSRE.2014.2369740 7. Howe D, Allan DU, Barnes JA (1981) Properties of signal sources and measurement methods. In: IEEE 35th international frequency control symposium, pp 464–469. https://doi.org/10. 1109/FREQ.1981.200541 8. Chatfield A (2020) Fundamentals of high accuracy inertial navigation. American Institute of Aeronautics and Astronautics, Reston 9. Aggarwal P, Syed Z, Niu X, El-Sheimy N (2008) A standard testing and calibration procedure for low cost MEMS inertial sensors and units. J Navig 61:323–336. https://doi.org/10.1017/ S0373463307004560
62
2 Wearable Sensors
10. Syed Z, Aggarwal P, Goodall C, Niu X, El-Sheimy N (2007) A new multi-position calibration method for MEMS inertial navigation systems. Meas Sci Technol 18:1897. https://doi.org/10. 1088/0957-0233/18/7/016 11. Ang W, Khosla P, Riviere C (2007) Nonlinear regression model of aLow-g MEMS accelerometer. IEEE Sensors J 7:81–88. https://doi.org/10.1109/JSEN.2006.886995 12. Renk E, Rizzo M, Collins W, Lee F, Bernstein D (2006) Calibrating a triaxial accelerometermagnetometer—using robotic actuation for sensor reorientation during data collection. IEEE Control Syst 25:86–95. https://doi.org/10.1109/MCS.2005.1550155 13. Bonnet S, Bassompierre C, Godin C, Lesecq S, Barraud A (2009) Calibration methods for inertial and magnetic sensors. Sens Actuators A Phys 156:302–311. https://doi.org/10.1016/j. sna.2009.10.008 14. Kim A, Golnaraghi MF (2004) Initial calibration of an inertial measurement unit using an optical position tracking system. In: IEEE position location and navigation symposium, pp 96– 101. https://doi.org/10.1109/PLANS.2004.1308980 15. Lai Y-C, Jan S-S, Hsiao F-B (2010) Development of a low-cost attitude and heading reference system using a three-axis rotating platform. Sensors (Basel, Switzerland) 10:2472–91. https:// doi.org/10.3390/s100402472 16. Fong W, Ong SK, Nee A (2008) Methods for in-field user calibration of an inertial measurement unit without external equipment. Meas Sci Technol 19. https://doi.org/10.1088/09570233/19/8/085202 17. Jurman D, Jankovec M, Kamnik R, Topic M (2007) Calibration and data fusion solution for the miniature attitude and heading reference system. Sens Actuators A Phys 138:411–420. https:// doi.org/10.1016/j.sna.2007.05.008 18. Hwangbo M, Kanade T (2008) Factorization-based calibration method for MEMS inertial measurement unit. In: Proceedings - IEEE international conference on robotics and automation, pp 1306–1311. https://doi.org/10.1109/ROBOT.2008.4543384 19. Zhang H, Wu Y, Wu W, Wu M, Hu X (2009) Improved multi-position calibration for inertial measurement units. Meas Sci Technol 21:015107. https://doi.org/10.1088/0957-0233/21/1/ 015107 20. Konstantinova J, Stilli A, Althoefer K (2017) Fingertip fiber optical tactile array with two-level spring structure. Sensors 17:2337. https://doi.org/10.3390/s17102337 21. Zhang J, Liu W, Gao L, Zhang Y, Tang W (2018) Design, analysis and experiment of a tactile force sensor for underwater dexterous hand intelligent grasping. Sensors 18:2427. https://doi. org/10.3390/s18082427 22. Alspach A, Hashimoto K, Kuppuswarny N, Tedrake R (2019) Soft-bubble: A highly compliant dense geometry tactile sensor for robot manipulation. In: IEEE international conference on soft robotics, pp 597–604. https://doi.org/10.1109/ROBOSOFT.2019.8722713 23. Yoo S-Y, Ahn J-E, Cserey G, Lee H-Y, Seo J-M (2019) Reliability and validity of non-invasive blood pressure measurement system using three-axis tactile force sensor. Sensors 19:1744. https://doi.org/10.3390/s19071744 24. Fang B, Sun F, Liu H, Tan C, Guo D (2019) A glove-based system for object recognition via visual-tactile fusion. Science China Inf Sci 62. https://doi.org/10.1007/s11432-018-9606-6 25. Liu W, Gu C, Zeng R, Yu P, Fu X (2018) A novel inverse solution of contact force based on a sparse tactile sensor array. Sensors 18. https://doi.org/10.3390/s18020351 26. Asano S, Muroyama M, Nakayama T, Hata Y, Nonomura Y, Tanaka S (2017) 3-axis fullyintegrated capacitive tactile sensor with flip-bonded CMOS on LTCC interposer. Sensors 17:2451. https://doi.org/10.3390/s17112451 27. Kim K, Song G, Park C, Yun K-S (2017) Multifunctional woven structure operating as triboelectric energy harvester, capacitive tactile sensor array, and piezoresistive strain sensor array. Sensors 17:2582. https://doi.org/10.3390/s17112582 28. Suen M-S, Chen R (2018) Capacitive tactile sensor with concentric-shape electrodes for three-axial force measurement. In: Multidisciplinary digital publishing institute proceedings, vol 2(13), pp 708. https://doi.org/10.3390/proceedings2130708
References
63
29. Shibuya K, Iwamoto Y, Hiep T, Ho V (2019). Detecting sliding movement location on morphologically changeable soft tactile sensing system with three-axis accelerometer. In: IEEE international conference on soft robotics, vol 14–18, pp 337–342. https://doi.org/10.1109/ ROBOSOFT.2019.8722818 30. Li W, Konstantinova J, Noh Y, Alomainy A, Alhoefer K, Althoefer K (2019) An elastomerbased flexible optical force and tactile sensor. In: IEEE international conference on soft robotics, vol 14–18, pp 361–366. https://doi.org/10.1109/ROBOSOFT.2019.8722793 31. Le T-H-L, Maslyczyk A, Roberge J-P, Duchaine V (2017) A highly sensitive multimodal capacitive tactile sensor. In: IEEE international conference on robotics and automation, pp 407–412. https://doi.org/10.1109/ICRA.2017.7989053 32. Shin K, Kim D, Park H, Sim M, Jang H, Sohn JI, Cha S, Kang DJ (2019) Artificial tactile sensor with pin-type module for depth profile and surface topography detection. IEEE Trans Ind Electron 1–1. https://doi.org/10.1109/TIE.2019.2912788 33. Fang B, Sun F, Yang C, Xue H, Chen W, Zhang C, Guo D, Liu H (2018) A dual-modal visionbased tactile sensor for robotic hand grasping. In: IEEE international conference on robotics and automation, pp 2483–2488. https://doi.org/10.1109/ICRA.2018.8461007 34. Kappassov Z, Corrales R, Juan A, Perdereau V (2015) Tactile sensing in dexterous robot hands – review. Robot Auton Syst 74. https://doi.org/10.1016/j.robot.2015.07.015 35. Dahiya R, Metta G, Valle M, Sandini G (2010) Tactile sensing–from humans to humanoids. IEEE Trans Robot 26:1–20. https://doi.org/10.1109/TRO.2009.2033627 36. Larsen L, Recchiuto C, Oddo C, Beccai L, Anthony C, Adams M, Carrozza MC, Ward MCL (2011) A capacitive tactile sensor array for surface texture discrimination. Microelectron Eng 88:1811–1813. https://doi.org/10.1016/j.mee.2011.01.045 37. Rana A, Roberge J-P, Duchaine V (2016) An improved soft dielectric for a highly sensitive capacitive tactile sensor. IEEE Sensors J 1–1. https://doi.org/10.1109/JSEN.2016.2605134 38. Zou L, Ge C, Wang Z, Cretu E, Li X (2017) Novel tactile sensor technology and smart tactile sensing systems: a review. Sensors 17:2653. https://doi.org/10.3390/s17112653 39. Ge C, Cretu E (2017) MEMS transducers low-cost fabrication using SU-8 in a sacrificial layerfree process. J Micromech Microeng 27. https://doi.org/10.1088/1361-6439/aa5dfb 40. Mannsfeld S, Tee B, Stoltenberg R, Chen C, Barman S, Muir B, Sokolov A, Reese C, Bao Z (2010) Highly sensitive flexible pressure sensors with microstructured rubber dielectric layers. Nat Mater 9:859–64. https://doi.org/10.1038/nmat2834
Chapter 3
Wearable Design and Computing
Abstract Wearable design and computing are up to the performance of the wearable device. In this chapter, we introduce the wearable device that comprises eighteen low-cost inertial and magnetic measurement units (IMMUs). It makes up the drawbacks of traditional data glove which only captures incomplete gesture information. The IMMUs are designed compact and small enough to wear on the upper-arm, forearm, palm, and fingers. The orientation algorithms including Quaternion Extended Kalman Filter (QEKF) and two-step optimal filter are presented. We integrate the kinematic models of the arm, hand, and fingers into the whole system to capture the motion gesture. A position algorithm is also deduced to compute the positions of fingertips. Experimental results demonstrate that the proposed wearable device can accurately capture the gesture.
3.1 Introduction Recently, various gesture capturing and recognition technologies have been proposed. These studies can be divided into two categories based on their motion capture mechanism: vision-based or glove-based [1]. Vision-based techniques rely on image processing algorithms to extract motion trajectory and posture information. On the other hand, glove-based techniques rely on physical interaction with the user. Based on vision method, users generally do not need to wear collection equipment and they can move freely, but it is easily affected by illumination, occlusion, placement of camera, and other environmental factors [2]. In comparison, glove-based techniques are easy to implement, and generally provide more reliable motion data [3]. Different types of sensory gloves have been developed, with both commercial and prototype ones. The commercial products [4] usually use expensive motion-sensing fibers and resistive-bend sensors, and are consequently too costly for the consumer market [5]. The prototype data gloves are developed with lower
© Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_3
65
66
3 Wearable Design and Computing
cost [6]. The flex sensors or bending sensors are integrated into the data gloves. However, the above sensors just measure the relative orientation of articulated segments by mounting the sensor across the joint of interest. This requires an accurate alignment of sensors with particular joint. Moreover, re-calibration during utilization is necessary to mitigate estimation errors due to sensor displacements. A general disadvantage of data gloves is the lack of user customization for individual subject’s hands and obstruction of tactile sensing from the palm surface of the hand. Often this inherently goes with mounting space required for embedding the sensors in clothing. To overcome the shortcomings, the low-cost, small wearable inertial and magnetic sensors are becoming increasingly popular for data glove. The KHU-l data glove [7] is developed by using six three-axis accelerometers, but it only captures the several kinds of gestures. In [8] the data glove is developed based on sixteen micro inertia sensors, which can capture the movements of each finger and palm, but the information of the heading angle is missing. In [9] the inertial and magnetic measurement unit is used, but it only uses the four inertial and magnetic measurement units, it is unable to obtain the information of each finger joint. PowerGlove [10] is developed that includes six nine-axis micro inertial sensors and ten six-axis micro inertial sensors. It covers each joint of the palm and fingers, and motion characteristics can be better evaluated. However, it does not fully use nine micro inertial sensors. Some state heading angle solution is instable and would lead to the estimation errors of the joint angle. The research shows that the current gesture capture device does not take the three-dimensional motion of the arm and hand into account. Therefore, we use the inertial and magnetic measurement units to develop a novel data glove, which can fully capture the gesture motion information of the forearm, upper arm, palm, and fingers.
3.2 Design Human arm-hand system can be approximated as the rigid body parts. These segments are upper arm, forearm, palm, and five fingers, which are connected by different joints, as shown in Fig. 3.1. The kinematic chain of the arm that this model produces consists of seven degrees of freedom (DOFs). Shoulder is modeled as a ball and socket joint with three DOFs; elbow is modeled as a rotating hinge joint with two DOFs, and wrist is also modeled as a rotating hinge joint with two DOFs. The human hand model has 20 kinematic DOFs. The distal interphalangeal (DIP) and proximal interphalangeal (PIP) joints of each finger have 1 DOF, while the metacarpal (MCP) joints have 2 DOFs. The third DOF of the trapeziometacarpal (TMCP) joint allows the thumb to rotate longitudinally as it is opposite to the fingers. According to the above descriptions, 18 sensor modules should be used to capture the motions of all the segments of human arm-hand. The design of the proposed device is described as follows.
3.2 Design
67
Fig. 3.1 The model of the human arm-hand
3.2.1 Inertial and Magnetic Measurement Unit Design One of the major challenges in the development of data glove is the design of the low-cost IMMUs. Commercial IMMU, such as Xsens-MTi [11] or Shimmer [12], uses the state-of-the-art miniaturized MEMS inertial and magnetic sensors. However, these sensors usually integrate other components, such as processing units and transceiver modules, which increases the cost, weight, and packaging size. Due to their package dimensions, it is difficult to place these IMMUs precisely and stably on the keypoints of the fingers or arms, which is essential for reducing tracking errors caused by unwanted motion artifacts and significant misalignments between sensor and hand centers. Besides, the structure and size of the casing make it difficult to add more IMMUs within small distance in order to benefit from redundant measurements or gain more accurate measurements of fingers. Here, we use the MPU9250 [13], which deploys System in Package technology and combines 9-axis inertial and magnetic sensors in a very small package. This results in the design and development of low-cost, low-power, and light-weight IMMU. This again enables powering of multiple IMMUs by micro control unit (MCU), which reduces the total weight of the system. Moreover, small IMMU can be fastened to the glove, which makes it more appealing and easier to use. Figure 3.2a shows a photograph of the IMMU. The MPU9250 sensor is mounted on a solid PCB with dimensions of 10 × 15 × 2.6 mm and a weight of about 6 g. Another important issue in the design is the connectivity. Different types of communication architectures in Body Area Networks (BANs) are studied in [14]. Intra-BAN communication describes connections between sensor units and access points on the body. Wireless networking approaches for these connections are applied in prototypes to increase mobility, but it also increases the complexity of wireless networks and there is always a trade-off between data rates and energy consumption. To avoid this problem, a wired approach is used in [15], all sensor
68
3 Wearable Design and Computing
Fig. 3.2 The hardware of data glove: (a) IMMU; (b) MCU
units are directly connected to a central controlling unit using cables. This results in a very complex wiring. In this work, a cascaded wiring approach is proposed and developed by exploiting the master SPI bus of each IMMU. This approach simplifies the wiring system without any need for extra components. When reading data from one string of IMMUs, the MCU does not switch to all the IMMUs to fetch the data, which results in a lower power consumption. In order to increase the flexibility, textile cables are used to connect the IMMU to each other and to the MCU. Here the STM32F4 microcontroller is used to develop the MCU, which is shown in Fig. 3.2b.
3.2.2 Wearable Design After determining the above designs, the wear design of data glove can be determined. There are eighteen segments of the arm and hand, hence eighteen IMMUs are used to cover all of the segments. Six strings are designed that each string deploys three IMMUs for corresponding segments. The proposed data glove is shown in Fig. 3.3. The IMMUs’ data is sampled, collected, and computed by the MCU and subsequently transmitted via Bluetooth to external devices. The MCU processes the raw data, estimates the orientations of the each unit, encapsulates them into a packet, and then sends the packet to the PC by Bluetooth. The baud rate for transmitting data is 115,200 bps. By using this design, the motion capture can be demonstrated by the virtual model on the PC immediately. The interface is written by C#. The data glove is shown as Fig. 3.4. The proposed data glove is designed based on the low-cost IMMU, which can capture the more information of motion than the traditional sensors. The traditional sensors of data glove like fiber or hall-effect sensor are frail. Nevertheless, the board of inertial and magnetic sensor that is an independent unit. It is more compact, more
3.2 Design
69
Fig. 3.3 The design of the wearable device
Fig. 3.4 The proposed data glove and wearing demonstration Table 3.1 Comparison of the proposed data glove with other prototypes Data Glove Sensor
KHU-1 data glove 5 Three-Axis accelerometer
Capture Quantity of MCU Calculation of real time in MCU
Fingertip 1 No
PowerGlove 9 Six-Axis INSs 7 Nine-Axis INSs Hand 3 No
Proposed data glove 18 Nine-Axis INSs Hand and arm 1 Yes
durable, and more robust. Commercial data gloves are too costly for the consumer market, while the proposed data glove in this chapter is low-cost. Moreover, the proposed data glove can not only capture the motion of hand but also capture the motion of arm, and the estimated results of motion can be output in real time. Some of the important physical properties of the data glove are compared with the KHU-1 data glove and the PowerGlove in Table 3.1.
70
3 Wearable Design and Computing
3.3 Motion Capture Algorithm The motion capture algorithm is proposed in this section. The models of inertial and magnetic sensors are presented firstly, then two estimation algorithms including QEKF and two-step optimal filter are deduced.
3.3.1 Models of Inertial and Magnetic Sensors In application of inertial and magnetic sensors, two coordinate frames are set up: the navigation coordinate frame and the body frame are fixed to the rigid body. The orientation of a rigid body in space is determined when the axis orientation of a coordinate frame attached to the body frame is specified with respect to the navigation frame. The model of the arm-hand and the frames of each IMMU are shown in Fig. 3.5. Among the fingers, the middle finger is used in the model, which is a simplified six bar linkage mechanism. The global frame is set in the shoulder, and local reference frames are located in each IMMU, respectively. The global reference frame Z-axis is defined along the axial axis (from the head to the feet) of the subject, the Y -axis along the sagittal (from the left shoulder to the right shoulder) axis, and the X-axis along the coronal axis (from the back to the chest). The local frame z-axis is defined along the axial axis (normal to the surface of IMMU along the downward) of the subject, the y-axis along the sagittal (from the left side to the right side of the IMMU) axis, and the x-axis along the coronal axis (from the back to the forward of IMMU). Meanwhile, two assumptions about data glove in use are made: (1) The body keeps static, and only the arm and hand are in motion; (2) The
Fig. 3.5 The frames of the data glove
3.3 Motion Capture Algorithm
71
local static magnetic field is homogeneous throughout the whole arm. According to the coordinate frames, the sensor models are set up as follows. (1) Rate Gyros The angular velocity ωm is measured by the rate gyros. Because the MEMS rate gyro’s sensitivity is low and earth angular velocity cannot be measured, the model can get rid of the earth angular vector. And the output signal of a rate gyro is influenced by noise and by a slowly varying function called bias, that is, ωm = ω + bg + wg
(3.1)
b˙g = 0
(3.2)
where ω is the true rate, bg is the gyro’s bias, and wg is supposed to be Gaussian with zero-means. (2) Accelerometers The accelerometers measurements expressed in the body frame can be written as, a m = C bn (a + g) + ba + wa
(3.3)
where a m is the measurement of the accelerometer, g = [0 0 g]T is the gravity vector, g = 9.81 m/s2 denotes the gravitational constant, a is the inertial accelerations of the body, C bn denotes the Orientation Cosine Matrix representing the rotation from the navigation frame to the body frame, ba denotes the biases, and wa denotes the noise which is supposed to be Gaussian with zero-means. Since the MEMS accelerometers’ biases remotely change once the sensor signals are stabilized, whose performance is better than MEMS gyros, the biases can be assumed to be constant values for short time durations. And the bias errors from the accelerometer can be captured and compensated effectively for using in-use calibration procedures based on the so-called zero velocity updates, hence the biases can be ignored in the model. On the other hand, the absolute acceleration of the rigid body in the inertial coordinate system is commonly supposed to be weak or the rigid body is static. Then, the vector observation due to the accelerometers is a m = C bn g + wa
(3.4)
(3) Magnetometers The magnetic field vector expressed in the navigation frame is supposed to be modeled by the unit vector H h . Since the measurements take place in the body frame, they are given by hm = C bn H h + bh + wh
(3.5)
where hm is the measurement of the magnetometer, bm denotes the disturbance vector including the magnetic effects and the magnetometers’ bias, wh ∈R 3 denotes the noise which is supposed to be Gaussian with zero-means.
72
3 Wearable Design and Computing
The magnetometers should be firstly calibrated to attenuate the environmental magnetic effect, and the sensor outputs are bias-compensated. Accordingly, the previous equation can be rewritten as follows: hm = C bn H h + wh
(3.6)
According to the above descriptions, it is clear that the models of accelerometers and magnetometers are simplified based on the special conditions, which are the ideal environment for motion capture. In real environment, the imperfect condition cannot be ignored. Therefore, it is particularly important to design a reasonable algorithm to estimate orientations.
3.3.2 QEKF Algorithm According to the three kinds of sensors, there are two independent ways to determine the attitude and heading. One is obtained from open-loop gyros. The roll, pitch, and yaw rate of the vehicle are measured using gyros with respect to its body axis system, and the angles are obtained by the open-loop integration process. It has high dynamic characteristic. However, the gyro errors would create wandering attitude angles and the gradual instability of the integration drifting. The other way is determined from open-loop accelerometers and magnetometers. The orientations can be correctly obtained from accelerometers and magnetometers in the ideal environment. If the vehicle is under transient and high dynamic conditions, the results would lead to large errors and loose reliability. The independent ways are both quite difficult to achieve acceptable performance. Sensor fusion is the great choice to attain the stable and accurate orientations. Based on the measured 3D angular velocity, acceleration, and magnetic field of one single IMMU, it is possible to stably estimate its orientation with respect to a global coordinate system. N is the global coordinate and b is the each coordinate of IMMUs. The transformation between the representations of a 3 ∗ 1 column-vector x between N and b is expressed as, x b = C bn [q]x n
(3.7)
⎤ 0 Q3 −Q2 where quaternion q = [q0 ; Q], Q is given by [Q] = ⎣−Q3 0 Q1 ⎦, x n is the Q2 −Q1 0 vector in the global coordinate, x b is the vector in the body coordinate. The attitude matrix C is related to the quaternion by ⎡
C(q) = q02 − Q · Q I + 2QQT + 2q0 [Q] where I is the identity matrix.
(3.8)
3.3 Motion Capture Algorithm
73
The state vector is composed of the rotation quaternion. The state transition vector equation is x i+1 = q k+1 = φ(Ts , wk ) = exp(Ωs Ts )q k +q wk
q
Ts Ts wk = − Γk q V k = − 2 2
[ek ×] + q4k I −ek T
(3.9)
g
vk
(3.10)
where Ts denotes the sampling period of the measurements, and the gyro measurement noise vector g v k is assumed small enough that a first-order approximation of the noisy transition matrix is possible. Then the process noise covariance matrix Qk would have the following expression Qk = (Ts /2)2 Γk Γg ΓkT
(3.11)
The measurement model is constructed by stacking the accelerometer and magnetometer measurement vectors. b a 0 v k+1 C n (k + 1) a k+1 g = = + m mk+1 v k+1 0 C bn (k + 1) h
zk+1
(3.12)
The covariance matrix of the measurement model R k+1 is R k+1 =
a
R k+1 0 0 m R k+1
(3.13)
where the accelerometer and magnetometer measurement noise a v k+1 and m v k+1 are uncorrelated zero-mean white noise process, the covariance matrices of which are a R k+1 = σa2 I and m R k+1 = σm2 I , respectively. Because of the nonlinear nature of Eq. 3.14, the EKF approach requires that a first-order Taylor–Mac Laurin expansion is carried out around the current state estimation by computing the Jacobian matrix, H k+1 =
∂ zk+1 |x k+1 =x − k+1 ∂x k+1
(3.14)
Then the orientations are estimated by the following EKF equations, Compute the a priori state estimate x− k+1 = φ(T s , w k )x k
(3.15)
74
3 Wearable Design and Computing
Compute the a priori error covariance matrix T P− k+1 = φ(T s , w k )P k φ(T s , w k ) + Qk
(3.16)
Compute the Kalman gain
−1 − T T K k+1 = P − k+1 H k+1 H k+1 P k+1 H k+1 + R k+1
(3.17)
Compute the a posteriori state estimate − x k+1 = x − k+1 + K k+1 [zk+1 − f (x k+1 )
(3.18)
Compute the a posteriori error covariance matrix − P k+1 = P − k+1 − K k+1 H k+1 P k+1
(3.19)
According to the above algorithm, the absolute orientations of each IMMU can be estimated. Then the kinematic models of the arm, hand, and fingers are considered to determine the orientations of each segment. The kinematic frames of the arm, hand, and the forefinger are presented here. There are six joints, and the coordinate frames are built including the global coordinate (N), upper-arm coordinate (b1), forearm (b2), palm coordinate (b3), proximal coordinate (b4), medial coordinate (b5), and distal (b6). And the lengths between the joints are l1 , l2 , l3 , l4 , l5 , and l6 , as shown in Fig. 3.5. Then the relative orientations between two consecutive bodies can be determined as follows: q rij = q −1 i · qj
(3.20)
where q rij is the quaternion of the relative orientations, q i is the quaternion of the absolute orientations of the first coordinate, q i is the quaternion of the absolute orientations of the second coordinate. Meanwhile, human arm-hand motion can be approximated as the articulated motion of rigid body parts. Thus, the kinematic chain that this model produces consists of ten variables or DOFs: three in the shoulder joint, two in the elbow joint, two in the wrist joint, two in the proximal joint, one in the medial joint, and one in the distal joint. Hence the responding constraints are used to determine the orientations of each segment. Furthermore, the position of the palm pN b3 , expressed in the hand coordinate frame, can be derived using forward kinematics
pN b3 1
=T
Nb1
b T b1 b2 pb32
=T
Nb2
b p b32 1
(3.21)
3.3 Motion Capture Algorithm
75
where the transformation between two consecutive bodies is expressed by T Nb1 , T b1 b2 . The total transformation T Nb2 is given by the product of each consecutive contribution: R(q Nb2 ) pN Nb2 b2 T = (3.22) 0T3 1 where R(q Nb2 ) is the orientation of the distal phalanx with respect to the body, and pN b2 is the position of the distal frame expressed in the global frame.
3.3.3 Two-Step Optimal Filter Traditional attitude solutions use sensor data to obtain instantaneous attitude measurements and the filter process resorts to one of the many attitude representation alternatives. In the traditional filters, sensor specificity is discarded and even when it is addressed, the nonlinear transformations that are required to obtain the attitude from vector measurements distort noise characteristics. Therefore, the two-step optimal filter is proposed. The optimal orientations are deduced by two steps. The optimal measurements are obtained first, and then the optimal orientations are estimated by the second filter. Meanwhile, the measures for improving robustness are implemented. The details would be described in the following sections. (1) The First Optimal Filter Differentiate C bn with respect to time yields ⎡
⎤ 0 −ωz ωy b C˙ n = −S[ω(t)] · C bn = ⎣ ωz 0 −ωx ⎦ · C bn −ωy ωx 0
(3.23)
where ωx , ωy , ωz is the angular velocity of b . Then the dynamics of a and h are given by a˙ = −S [ωm (t)] a
(3.24)
h˙ = −S [ωm (t)] h
(3.25)
where a and h are, respectively, true values of the accelerometers and magnetometers.
76
3 Wearable Design and Computing
The Eqs. 3.24 and 3.25 are written in compact form as x˙ (t) = Φ (t) · x (t) z (t) = H · x (t) + w
(3.26)
−S [ωm (t)] 0 where x = a h ; z = a m hm ; Φ (t) = ; H (t) = 0 −S [ωm (t)] I 0 . 0I Now, although the system dynamics Eq. 3.26 are nonlinear, they may, nevertheless, be regarded as linear time-varying. Moreover, it is shown that the system is uniformly and completely observable provided that the vector observations are not parallel. In practice, the sensor data is usually sampled and the AHRS algorithm is implemented in a digital setup. Therefore, Eq. 3.26 should derive a discrete-time solution. For that purpose, let Ts denotes the sampling period of the AHRS measurements. Then, let the discrete-time equation be given by x (k + 1) = Φ (k) · x (k) + v (k) z (k + 1) = H (k) · x (k + 1) + w (k)
(3.27)
S [ωm (k)] 0 ; v (k), w (k) are assumed zero0 S [ωm (k)] mean, white Gaussian noises, with intensity matrices Q (k), R (k), respectively.
where Φ (k + 1) = I − Ts
(2) The Second Optimal Filter After estimating the states aˆ and hˆ by the proposed SFKF (Sensor-based Fuzzy Kalman Filter), the corresponding attitude can be computed by the following second optimal filter. The measurements from earth gravity and magnetic field can compose two vectors: g H h and aˆ hˆ . Then, the attitude matrix determination problem becomes to find an estimate one such that n − Cb are minimized. It is noted that the matrices n and b are given by n = g Hh g × Hh b = aˆ hˆ aˆ × hˆ
(3.28) (3.29)
The problem of estimating vehicle attitude from a set of vector observations has been studied in detail. The objective function commonly used to describe this problem is posed by Grace Wahba and is frequently referred to as the Wahba’s problem. The Wahba’s problem is to find the orthogonal matrix A with determinant
3.3 Motion Capture Algorithm
77
1 that minimizes the loss function. The statement is given by 1 ai wi − Av i 2 2 m
Min
J (A) =
(3.30)
i=1
where m is the number of observed vectors, ai is positive weighting on the i-th observation, w i is the measured unit vector for the i-th observation as expressed in the body frame, v i is the known reference vector for the i-th observation as expressed in a known inertial frame, and A is the rotation matrix that transforms a vector from the inertial frame to the body frame. For the present work, the loss function can be written as J (C) =
3 !2 1 ! ! ! ai !b˜ i − Cni ! 2
(3.31)
i=1
where ai is non-negative weight. Because the loss function may be scaled without 3 " affecting the determination C, it is possible to set ai = 1. i=1
The gain function of J (C) is defined by G (C) = 1 − J (C) =
3
T ai tr b˜ i Cni = tr CB T
(3.32)
i=1
where B =
3 " i=1
ai b˜ i nTi .
The maximization of G (C) is complicated for the fact that the nine elements of C are subject to six constraints. Therefore, it is convenient to express C in terms of its related quaternion. Substituting the quaternion into Eq. 3.32, the gain function can be rewritten as (3.33) G (q) = q02 − Q · Q trB T + 2tr QQT B T + 2q0tr QB T where the quaternion q = [q0 ; Q]. Then it leads to the following form by introducing some quantities G (q) = q T Kq
(3.34)
3 " S − σI Z ai b˜ i · ni , , σ = trB = T Z σ i=1 3 3 " " T ai b˜ i nTi + ni b˜ i , Z = ai b˜ i × ni . S = BT + B =
where the 4 × 4 matrix K is given by K =
i=1
i=1
78
3 Wearable Design and Computing
The problem of determining the optimal attitude has been reduced to find the quaternion that maximizes the bilinear form of Eq. 3.34. The constraint of q T q = 1 can be taken into account by the method of Lagrange multipliers. Hence, a new gain function L (q) is defined as L (q) = q T Kq − λ q T q − 1
(3.35)
where λ is a real scalar. Then the original constrained problem is transformed to an unconstrained optimization. Differentiating Eq. 3.35 with respect to q at the extremes gives (K − λI ) q = 0
(3.36)
It is indicated that the optimal quaternion must be a normalized eigenvector and λ is the corresponding eigenvalue of K. After substituting it into Eq. 3.35, we have L (q) = λ. Therefore, the minimization will be achieved only if q is chosen to be the normalized eigenvector corresponding to the largest eigenvalue of K, more concisely Kqopt = λmax qopt
(3.37)
where λmax is the largest eigenvalue of K. In fact, almost all further attention to the Wahba problem focuses on developing faster methods for evaluating the maximum eigenvalue λmax , after which the optimization problem becomes a simple algebraic problem. Among the algorithms of Wahba, ESOQ2 (Second Estimator of the Optimal Quaternion) is the fastest estimator [16]. The details would be described as follows. Equation 3.37 is equivalent to the two equations [(λmax + trB) I − S] Q = q0 Z
(3.38)
(λmax − trB) q0 = QT Z
(3.39)
and
Meanwhile, the relation of the quaternion q opt to the rotation axis e and rotation angle φ is q opt = e sin (φ/2) cos (φ/2) . Inserting q opt into Eqs. 3.38 and 3.39 gives (λmax − trB) cos (φ/2) = eT Z sin (φ/2)
(3.40)
and
(λmax + trB) I − B − B T e sin (φ/2) = Z cos (φ/2)
(3.41)
3.3 Motion Capture Algorithm
79
Multiply Eq. 3.41 by (λmax − trB) and substitute Eq. 3.40 gives Me sin(φ/2) = 0
(3.42)
. . where M = (λmax − trB) (λmax + trB) I − B − B T − ZZ T = m1 ..m2 ..m3 Then the columns of adj M are the cross products of the columns of M: . . adj M = m2 × m3 ..m3 × m1 ..m1 × m2
(3.43)
The rotation angle can be found from Eq. 3.38 or one of the components of Eq. 3.39. Then the optimal quaternion is q opt = #
1 |(λmax − trB) y|2 + (z · y)2
(λmax − trB) y z·y
(3.44)
where y is the column of adj M with maximum norm. (3) Vector Selection It is noted that another important issue in the filter design deals with the situation when anomalous measurements are received. As for the acceleration, in order to answer the question whether the body-fixed measured acceleration vector is suitable for measuring gravity, we can compare its norm with the known value of gravity; for a better choice, we may decide to work directly with the norm of the difference between the measured acceleration vector and the gravity. If the deviation exceeds some properly chosen threshold value, a sensor glitch, or a contamination due to body motion would be suspected. The actions can be taken ⎧ ⎨b = hˆ aˆ × hˆ hˆ × aˆ × hˆ $ $$ $ $$aˆ $ − g $ ≥ εa : ⎩n = H h g × H h H h × (g × H h )
(3.45)
where εa is a suitable threshold. A similar approach can be pursued as for the magnetic measurements. In this case we have to work with the difference between the measured magnetic vector and the local magnetic reference vector. The vectors can be selected as follows: ⎧ ⎨b = aˆ aˆ × hˆ aˆ × aˆ × hˆ $$ $ $ $$ ˆ $ $ $$h$ − |Hh |$ ≥ εh : ⎩n = g g × H h g × (g × H h ) where εh is a suitable threshold.
(3.46)
80
3 Wearable Design and Computing
(4) Summary of the Two-Step Optimal Filter The two-step optimal filter composes of two estimators that are separately deduced by two different optimal criteria. And the final optimal orientations are estimated by two steps. The design is superior to the traditional EKF design. The traditional EKF filter commonly has a 9-D measurement vector, consisting of 3-D angular rate, 3-D acceleration, and 3-D local magnetic field. This 9-D vector directly corresponds to the measurements provided by inertial/magnetic sensors modules. The first three components of the output equation (angular rate portion) are linearly related to the state vector. However, the other six components of the output equation are nonlinearly related to the state vector. It is a nonlinear system because of the nonlinear nature of the matrix, and the EKF approach requires that a first-order Taylor–Mac Laurin expansion is carried out by computing the Jacobian matrix. The EKF designed with this output equation is computationally inefficient and may be divergent due to the linearization. However, the first optimal filter of the proposed method is significantly simpler, owing to the fact that the measurement equations are linear. Moreover, the linearization of the EKF approach distorts the noise characteristics of the sensors, which would affect the precision of the estimation. Nevertheless, no linearization is performed in the proposed algorithm and the practical noise characteristics are considered. Furthermore, the second optimal filter-ESOQ2 is also the fastest estimator among the algorithms of Wahba. Meanwhile, the anti-disturbed strategies, which the traditional EKF does not consider, are also implemented in the step so that enhancing the robust of the filter. The complete diagram of the two-step optimal filter is depicted in Fig. 3.6. (5) Orientations and Positions of the Arm-Hand According to the above algorithm, the orientations of each IMMU can be estimated. Then the kinematic models of the arm, hand, and fingers are used to determine the orientations of each segment. Human arm-hand motion can be approximated as the articulated motion of rigid body parts. These segments are upper-arm (between the shoulder and elbow joints), forearm (between the elbow and wrist joints), hand (between the wrist joint and proximal joint), proximal finger (between the proximal joint and medial joint), medial finger (between the medial joint and distal joint), and distal finger (from the distal joint on). Every joint has its own local axis. Shoulder is modeled as a rotating hinge joint with two DOFs. Movements are calculated between the vector representing the upper-arm and the body. Elbow is modeled as rotating joint with one DOF. Wrist is modeled as a ball joint with three DOFs that is calculated between the vector representing the hand and a fixed point representing the center of the wrist. Proximal joint is modeled as rotating hinge joint with two DOFs. Medial joint and distal joint are modeled as rotating joint with one DOF. Thus, the kinematic chain of this model consists of ten variables or DOFs: two in the shoulder joint, one in the elbow joint, three in the wrist joint, two in the proximal joint, one in the medial joint, and one in the distal joint. Given this model, arm-hand movement can be represented as the temporal evolution of the ten defined degrees of freedom. Hence, the relative angular values can be estimated by the adjoining IMMU.
3.3 Motion Capture Algorithm
81
Fig. 3.6 The diagram of the two-step optimal filter
Furthermore, we assume the body keeps static, and the motion of arm and hand is generated by the rotation of the joints. Hence, the position of the fingertip p N m3 that expressed in the hand coordinate frame as shown in Fig. 3.5 can be derived using the forward kinematics, N m3 pm3 Nb1 b1 b2 b2 b3 b3 m1 m1 m2 m3 Nm3 p m2 T T T T pm2 = T =T 1 1
(3.47)
where the transformation between two consecutive bodies is expressed by T Nb1 , T b1 b2 , T b2 b3 , T b3 m1 , T m1 m2 , T m2 m3 . The superscript denotes the two coordinate frames of which the transformation is described (upper-arm (b1 ), forearm (b2 ), hand (b3 ), proximal of finger(m1), medial of finger (m2 ), and distal of finger (m3 )). The total transformation T Nm3 is given by the product of each consecutive contribution: R(q Nm3 ) pN Nm3 m 3 T = (3.48) 0T3 1 where R(q Nm3 ) is the orientation of the distal phalanx with respect to the body, and pN m3 is the position of the distal frame expressed in the global frame.
82
3 Wearable Design and Computing
3.4 Experimental Results The experiments concentrate on the developed low-cost data glove for arm and hand motion capture. The motion tracking scheme is evaluated in the following experiments.
3.4.1 Orientations Assessment The simulation is implemented to investigate the accuracy and stability of the orientation estimation under various conditions. Three different cases including static, quasi-static, and ferromagnetic disturbed are considered. The real-world noises of sensors are used in the simulation. The parameters are listed in Table 3.2. After deriving all the required parameters to initialize the filters, they are implemented using MATLAB to test the performance and accuracy of the orientation estimates. Figure 3.7 shows the performance of the two-step filter and EKF using the simulated data. The graphs on the left show the orientation estimated by the two-step optimal filter, and the graphs on the right show the orientation estimated by the QEKF. In static case, it can be seen that the proposed algorithm obtains higher precision and smaller variance than the results with the traditional method. While the hand is dynamic during the 1001–2000 s or disturbed by the extra magnetic field during the 2001–3000 s, it causes some orientations deviation by the traditional algorithm. However, the results of the two-step optimal filter are relatively little impacted. It is observed from the comparisons that the proposed scheme is more robust and stable. Meanwhile, the efficiency of the two algorithms is almost the same. Therefore, it can be deduced by the simulation that the proposed algorithm is suitable for the orientation estimation. Table 3.2 The mean and standard variance of simulated noises
Time Gyro-x Gyro-y Gyro-z Acc-x Acc-y Acc-z Mag-x Mag-y Mag-z
Mean, Std. 0–1000s 0, 0.06 0, 0.08 0, 0.09 0, 0.002 0, 0.003 9.8, 0.003 0.6, 0.006 0, 0.004 0.8, 0.005
1001–2000s 0, 0.06 0, 0.08 0, 0.09 0.03, 0.002 0.02, 0.003 9.8, 0.003 0.6, 0.006 0, 0.004 0.8, 0.005
2001–3000s 0, 0.06 0, 0.08 0, 0.09 0, 0.003 0, 0.08 9.8, 0.003 0.6, 0.006 0.01, 0.004 0.8, 0.005
3.4 Experimental Results
83
Fig. 3.7 The estimated attitude results of two algorithms: (a) the two-step optimal filter; (b) the QEKF
Fig. 3.8 The motion capture demonstration
3.4.2 Motion Capture Experiments In order to verify the effectiveness of the proposed data glove, the real-time gesture motion capture experiments are implemented. As shown in Fig. 3.8, the author wears the data glove, and the PC receives the results computed by the MCU by the Bluetooth and the gesture is determined in real time. The frequency can be reached at 50 Hz. Meanwhile, the virtual model of gesture is shown in the interface, which is displayed in the television. Moreover, the trajectory of the fingertip is displayed.
84
3 Wearable Design and Computing
Fig. 3.9 3D Orientations of the IMMUs
Fig. 3.10 Orientations of the gestures
Then the dynamic gesture capture experiment is implemented and assessed. The body keeps static, and the right arm is swinging up and down and the motions are captured by the data glove. To verify the effective of orientation estimation algorithm, three orientations of the IMMUs are evaluated by the different situations. The results of six IMMUs, which are, respectively, on upper-arm, forearm, palm, proximal of the middle finger, medial of the middle finger, distal of the middle finger, are shown in Fig. 3.9. The orientations of the gestures are shown in Fig. 3.10, and the positions of the hand are shown in Fig. 3.11. As can be seen in Fig. 3.10, in 0–4 s, the arm and hand keep static. In 4–12 s, only the upper-arm is swinging up and down.
3.4 Experimental Results
85
Fig. 3.11 The positions of the fingertip Table 3.3 The RMSE of the orientations of the arm-hand
Table 3.4 The RMSE of the positions of the fingertip
Upper-arm Forearm Palm Proximal Medial Distal
Positions of the palm
x (cm) 1.23
Angle (◦) 1.11 1.10 0.20 0.28 0.22 0.18 0.17 0.32 0.20 0.18 y (cm) 1.71
0.86
z (cm) 0.78
In 18–25 s, only the forearm is swinging right and left. In 30–39 s, only the palm is swinging up and down. In 43–67 s, only the fingers are bending and opening. The orientations of the arm, palm, and fingers can be determined in different situations. Moreover, the movements can also be easily distinguished. Meanwhile, the accuracy of the results is assessed by the statistics. The root mean square error (RMSE) of the orientations is less than 0.5◦ and the RMSE of the positions is less than 5 cm. The detailed results are listed in Tables 3.3 and 3.4. As shown in Fig. 3.12, the author wore the device. The PC received the estimate results by the Bluetooth and show the interface, which is also displayed on the television. Then four gestures are captured by the data glove and demonstrated by the virtual model in the television. The first gesture is bending the right fingers, the second gesture is bending the arms, the third gesture is raising the forearms, and the fourth gesture is left thumb up. It is very intuitive to show the performance of arm and hand motion by the wearable device in real time.
86
3 Wearable Design and Computing
Fig. 3.12 The demonstrations of the motion capture by the wearable device. (a) Bending the right fingers. (b) Bending the arms. (c) Raising the forearms. (d) Left thumb up
3.5 Summary We present the design and development of a novel data glove which can be used for motion gesture capture. Commercial motion data gloves usually use high-cost motion-sensing fibers to acquire hand motion data, but we adopt the low-cost inertial and magnetic sensor to reduce the cost. Meanwhile, the low-cost, low-power, and light-weight IMMU is designed, which is even superior to some commercial IMMU. Furthermore, the novel data glove is based on eighteen IMMUs, which covers the complete segments of arm and hand. Both the online and offline calibration methods are designed to improve the accuracy of the units. We also deduced the 3D arm and hand motion estimation algorithms together with the proposed kinematic models of the arm, hand, and fingers, so that the attitude of the gesture and positions of fingertips can be determined. For real-time evaluation and demonstration, the interface with virtual model is designed. Performance evaluations verify that the proposed data glove can accurately capture the motion of gestures. The system is designed in a way that all electronic components can be integrated in the same scheme and it is convenient to wear, which makes it more appealing for the user.
References
87
References 1. Berman S, Stern H (2012) Sensors for gesture recognition systems. IEEE Trans Syst Man Cyb TSMC 42:277–290. https://doi.org/10.1109/TSMCC.2011.2161077. 2. Regazzoni D, De Vecchi G, Rizzi C (2014) RGB cams vs RGB-D sensors: low cost motion capture technologies performances and limitations. J. Manufact Syst 33:719–728. https://doi. org/10.1016/j.jmsy.2014.07.011 3. Dipietro L, Sabatini AM, Dario P (2008) A survey of glove-based systems and their applications. IEEE Trans Syst Man Cyb Part C Appl Rev 38:461–482. https://doi.org/10.1109/ TSMCC.2008.923862 4. Takacs B (2008) How and why affordable virtual reality shapes the future of education. Int J Virtual Real 7:53–66 5. Saggio G (2014) A novel array of flex sensors for a goniometric glove. Sensor Actuat A Phys 205:119–125. https://doi.org/10.1016/j.sna.2013.10.030 6. Lambrecht J, Kirsch R (2014) Miniature low-power inertial sensors: promising technology for implantable motion capture systems. IEEE Trans Neural Syst Rehabilitation Eng 22:1138– 1147. https://doi.org/10.1109/TNSRE.2014.2324825 7. Lin B, Lee I-J, Hsiao P, Yang S, Chou W (2014) Data glove embedded with 6-DOF inertial sensors for hand rehabilitation. In: Tenth international conference on intelligent information hiding and multimedia signal processing (IIH-MSP), pp 25–28. https://doi.org/10.1109/IIHMSP.2014.14 8. Cavallo F, Esposito D, Rovini E, Aquilano M, Carrozza MC, Dario P, Maremmani C, Bongioanni P (2013) Preliminary evaluation of SensHand V1 in assessing motor skills performance in Parkinson disease. In: IEEE international conference on rehabilitation robotics: [proceedings], 2013, pp 1–6. https://doi.org/10.1109/ICORR.2013.6650466 9. Kortier H, Sluiter V, Roetenberg D, Veltink P (2014) Assessment of hand kinematics using inertial and magnetic sensors. J Neuroeng. Rehab 11:70. https://doi.org/10.1186/1743-000311-70 10. Xu Y, Wang Y, Su Y, Zhu X (2016) Research on the calibration method of micro inertial measurement unit for engineering application. J Sensors 2016:1–11. https://doi.org/10.1155/ 2016/9108197 11. Xsens Motion Technologies (2015). Available from http://www.xsens.com/en/general/mti. Accessed 1 March 2016 12. Shimmer, Shimmer Website (2015). Available from http://www.shimmersensing.com/. Accessed 1 March 2016 13. Invense, MPU-9250 Nine-Axis MEMS MotionTracking Device, [Online]. Available from http://www.invensense.com/-products/motion-tracking/9-axis/mpu-9250/. Accessed 1 March 2016 14. Chen M, González-Valenzuela S, Vasilakos A, Cao H (2011) Body area networks: a survey. Mobile Netw Appl 16(2), 171–193. https://doi.org/10.1007/s11036-010-0260-8 15. MIT Media Lab, MIThril Hardware Platform (2015). Available from http://www.media.mit. edu/wearables/mithril/-hardware/index.html. Accessed 1 March 2016 16. Markley L, Mortari D (2000) Quaternion attitude estimation using vector observations. J Astronaut Sci 48(2), 359–380
Chapter 4
Applications of Developed Wearable Devices
Abstract This chapter describes the applications of developed wearable devices. Gesture recognition provides an intelligent, natural, and convenient way for human– robot interaction. The gesture datasets are built by wearable device with the inertial sensors. The ELM and CNN approaches are applied to gesture recognition. Furthermore, the wearable device with the capacitive tactile sensor is introduced to apply to tactile interaction. Finally, the wearable device based on the piezo-resistive tactile sensor is developed for the perception of grasping.
4.1 Gesture Recognition Gestures are expressive and meaningful body motions involving physical movements of the fingers, hands, arms with the intent to convey meaningful information or to communicate with the environment. With the rapid development of computer technology, various approaches of human–computer interaction have been proposed in these years. Human–computer interaction with hand gestures plays a significant role in these modalities. Therefore, hand-gesture-based methods stand out from other approaches by providing a natural way of interaction and communication. Recently, various gesture capturing and recognition technologies have been proposed, such as KNN [1], ANNs [2], and SVM [3]. They have been applied to the field of hand gesture recognition. Nonetheless, it is known that all of them face some challenging issues as follows: (1) slow training speed, (2) trivial human intervene, (3) large computational quantity, (4) poor generalization ability [4]. Compared with those machine learning algorithms, ELM has better generalization performance at a much faster learning speed [5, 6]. In addition, ELM is insensitive to the parameters [7, 8]. According to our investigation, ELM has not been applied in the field of gesture recognition based on the data glove. Hence, we use ELM to recognize the gestures captured by the proposed data glove. Meanwhile, the CNN-based gesture recognition is also developed in this section.
© Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_4
89
90
4 Applications of Developed Wearable Devices
Fig. 4.1 Architecture of the gestures recognition based on data glove using ELM
4.1.1 ELM-Based Gestures Recognition In this section, the ELM-based gesture recognition method is proposed. The framework of the gesture recognition using ELM is described. And the ELMbased recognition methods of static gestures and dynamic gestures are, respectively, presented. (1) Architecture of the Gesture Recognition The framework of the gesture recognition based on the proposed data glove using ELM can be divided into three stages. The first stage is establishing a gesture database. Firstly, the motion data of different gestures is collected through the proposed data glove. Then, we establish a gesture dataset which includes static gestures and dynamic gestures. The second stage is training classifiers. On the basis of the gesture dataset, we extract the 54-dimension hand feature of each gesture. The third stage is experiment and analysis. We collect the motion data of various gestures from different participants, and use the trained classifiers for gesture recognition. The architecture of the system is shown in Fig. 4.1. (2) ELM-Based Static Gestures Recognition ELM is first proposed by Huang et al. [5], which randomly generates the input weights and hidden layer biases of SLFNs and then determines the output weights analytically. And ELM is biologically inspired, providing significant advantages such as fast learning speed, independent from implementation, and minimal human intervention [9]. The model of ELM is shown in Fig. 4.2. If the input data is x, then we can obtain the output function of the L hidden layer node as, " "L fL (x) = L i=1 βi gi (x) = i=1 βi Gi (wi , bi , x) wi ∈ C d , xi ∈ C d , βi ∈ C
(4.1)
4.1 Gesture Recognition
91
Fig. 4.2 The model of the ELM
where βi is the output weight vector of the node of the it h hidden layer, gi (x) denotes nonlinear piecewise continuous activation functions of hidden nodes. According to the theory of Bartlett [10], the method based on the least weight is used to calculate the output weights, and the ELM can obtain the minimum error solution through the minimum norm, which achieves a great general performance. Given N training samples (xi , ti ), the output of L hidden layer nodes is f (x) =
L
βi G(wi , bi , x) = β · h(x)
(4.2)
i=1
where h(x) is the output vector of the hidden layer, the parameters of the hidden layer nodes are randomly assigned, βi is the weight vector connecting the ith hidden neuron and output neurons. Thus, the matrix expression of this linear system is H ·β =T
(4.3)
⎞ G(ω1 , b1 , x1 ) · · · G(ωL , bL , x1 ) ⎟ ⎜ .. .. H =⎝ ⎠ . ··· . G(ω1 , b1 , xN ) · · · G(ωL , bL , xN )
(4.4)
⎛
⎤ β1T ⎢ ⎥ β = ⎣ ... ⎦ ⎡
βLT
L×d
⎡ T⎤ t1 ⎢ .. ⎥ ;T = ⎣ . ⎦ tLT
L×d
(4.5)
92
4 Applications of Developed Wearable Devices
Based on the input xi , the network matrix H is outputs of hidden layers, in which the i-th line represents the output vector of hidden layer. According to all input (x1 , . . ., xn ), the i-th column represents the output of the i-th hidden layer neuron. The minimum norm of the least square solution to the linear system is |H · βˆ − T | = min |H · β − T |
(4.6)
βˆ = H † T
(4.7)
β
Therefore
where H † is the Moore–Penrose generalized inverse of matrix H . In order to improve the ELM with better generalization capabilities in comparison with the least square solution-based ELM, which requires randomly generating input weights, it is proposed to use kernel methods in the design of ELM and suggested to add a positive value 1/c (where c is a user-defined parameter) for the calculation of the output weights such that β = HT
I C
+ HHT
−1
(4.8)
T
Kernel-based ELM can be represented as follows: KELM (xi , xj ) = h(xi ) · h(xj ) = [G(ω1 , b1 , x1 ), . . . , G(ωL , bL , xi )]· [G(ω1 , b1 , xj ), . . . , G(ωL , bL , xj )]
(4.9)
Because of the parameter of (ω,b) is randomly assigned, the h(·) is also randomly generated, and the dual kernel optimization function is 1 minimize : LD = ti tj KELM (xi , xj )αi αj − αi 2 N
N
i=1 j =1
subject to : 0 ≤ αi ≤ C, i = 1, . . . , N
N
i=1
(4.10)
The kernel ELM is a combination of the optimization method of neural learning and standard optimal solution finding. Due to the relatively weak optimization constraints, the kernel ELM has better generalization ability. (3) ELM-Based Dynamic Gesture Recognition In order to cope with the dynamic gestures that are in form of time series, a dynamic time warping (DTW) algorithm is used. DTW is a nonlinear warping technique combined with time warping and distance calculating, which can complete the matching process with global or local extension, compression or deformation. DTW algorithm is to establish a scientific time alignment matching path between the
4.1 Gesture Recognition
93
characteristics of the test pattern and the reference pattern. Suppose we have two time series Ti and Ri : the test pattern feature vector sequence is T = (t1 , t2 , . . . , ti )
(4.11)
and the reference pattern feature vector sequence is R = (r1 , r2 , . . . , rj )
(4.12)
where i, j are the time serial number. A warping path W defines a mapping between Ti and Rj , so we have W = {ω1 , ω2 , . . . , ωN }
(4.13)
where N is the last warping path, ωn = (i, j )n is the n-th mapping between the i-th of test pattern feature vector and the j -th of reference pattern feature vector. So the minimizing warping cost is as follows: + ,N , ωn DTW(T , R) = min-
(4.14)
n=1
This path can be found using dynamic programming to evaluate the following recurrence process which defines the cumulative distance l(Ti , Rj ) and the current cell distance d(Ti , Rj ), . / l(Ti , Rj ) = d(Ti , Rj ) + min l(Ti−1 , Rj −1 ), l(Ti−1 , Rj ), l(Ti , Rj −1 )
(4.15)
Currently, the popular kernel functions of ELM can be defined as follows: (1) polynomial-kernel:K(x, xi ) = [a(x, xi ) + c]q ; (2) RBF-kernel:K(x, xi ) = exp(−γ x − xi 2 ); (3) sigmoid-kernel:K(x, xi ) = tanh[a(x, xi ) + c]. Here we attempt to combine DTW and RBF-kernel to achieve the aim of recognizing dynamic gestures based on time series. It can be expressed as K(R, T ) = exp − γ DTW(R, T )2 (4.16) where γ is a prescribed adjusting parameter.
4.1.2 CNN-Based Sign Language Recognition Sign language that consists of complex gestures can provide a natural interface of interaction between humans and robots. In this section, the system of dynamic sign
94
4 Applications of Developed Wearable Devices
Fig. 4.3 Data collection
language recognition based on inertial sensors based data gloves and Kinect sensor is proposed. The multi-modal dataset including the sign language information of data gloves and skeletons is built. Then the CNN structure for sign language recognition named sign language recognition network (SLRNet) is designed. It mainly consists of convolutional layers, batch normalization layers, and fully connecting layers. (1) Sign Language Dataset Our dataset is built using the devices including inertial-based data gloves, Kinect sensor, and microphone. The human arm and finger’s movements are recorded by each 9-axis IMU in real time. Meanwhile we collect image data and skeleton data from Kinect. The image size is of 1920 × 1080 pixels, 5 frames per second. The skeleton information contains three-dimensional coordinate data of 17 keypoints of the upper limb. The iPhone and AirPods are used to collect voice corresponding to the sign language. The proposed system is shown in Fig. 4.3. The dataset contains a total of ten sentences. The sentences that express demands are “I want to drink a cup of warm water,” “I want to eat bread,” and so on. The sentences that express feelings are “I feel very cold,” “I feel very lonely,” and so on. There are relatively close sentences in the dataset, such as “I want to drink a cup of warm water” and “I want to drink a cup of cold water.” There are also significant differences between sentence lengths, such as “I want to drink a cup of warm water” and “I am very cold.” We invite two women and three men as volunteers. The minimum age is 19 years old and the maximum age is 28 years old. Each person gives five samples for each sentence.
4.1 Gesture Recognition
95
Fig. 4.4 Keypoints of skeleton
(2) Data Processing Before using the above data for sign language recognition, it first needs to be preprocessed. In order to make the data more suitable for the calculation of CNN, it is necessary to normalize the angle data and skeletal points. Firstly, the angle data is normalized, and the angle range is limited from −180◦ to 180◦. Simple normalization can be used assuming that the maximum angle is θmax and the minimum angle is θmin . The normalization formula is θ∗ = 2 ∗
θ − θmin −1 θmax − θmin
(4.17)
where θ ∗ is the angle of each joint and is the value after normalization. For human skeleton data, only the keypoints of upper limb skeleton data are collected. The skeleton points are shown in the Fig. 4.4. The actual data are points in three dimensions. There is little difference on the z axis. The projection of keypoints of skeleton is drawn on the xOy plane. For the normalization of the keypoints of skeleton, the maximum and minimum values of these selected keypoints in the x, y, and z directions are the respective maximum and minimum values in the whole sequence data of a sign language action. In the three directions of x, y, and z, the formulas are similar. Taking the x direction as an example and assuming that the minimum value of all keypoints p in the x direction is xmin and the maximum value is xmax in the data stream of each sentence, the normalized formula is px∗ = 2 ∗
px − xmin −1 xmax − xmin
(4.18)
96
4 Applications of Developed Wearable Devices
Fig. 4.5 The histogram of the data Glove’s data
The normalization methods on the other two axes are the same. After preprocessing, the dataset of sign language needs to be further processed into the samples for training, validation, and testing for the CNN. Because the length of sentences in sign language dataset is different and the completion speed of different participants in the same sentence is different, the sample length of sign language data is different. Generally, CNN require that input data have the same size, so sign language data should be further processed. In the data collection process, the sampling frequency of data gloves is higher than that of Kinect. The histogram distribution of data length of data gloves is shown in Fig. 4.5. From the data length histogram, we can see that the data are quite different. In order to be fed into the same CNN for training, the data need to be scaled. In this experiment, glove data is scaled to the length of 700 by interpolation method, and skeleton data is scaled to the length of 50. When the data collected by data gloves are fed into the neural network, only angular information is selected and other information such as acceleration is discarded, then the data length at one time is 54. The larger the input data is, the more information can be added, but the network also needs to be deepened. If the number of samples is not enough, CNN may be over-fitted. Even when the data is scaled down to 700, the data size of a sentence is still 700 × 54. In order to reduce the size of data and increase the number of samples at the same time, the 700 × 54 sign language collected by data gloves is sampled down to 10 samples, and the data length is 7 × 54 at this time, which is close to the keypoint data 50 × 1. It should be noted that 10 data glove samples are sampled under one sign language corresponding to one skeletal keypoint sample. After processing the data length, the data samples of data gloves are 2500, and the data samples of key bone points are 250. When training CNN, datasets need to be divided into training set, validation set, and test set. The training set is used to train CNN. The validation set reflects the
4.1 Gesture Recognition
97
effect of training together with the results of training set, but does not participate in the training of CNN. The test set evaluates the effect of CNN’s sign language recognition after training. 80% of the 2500 samples are used for training, 10% for validation, and the remaining 10% for testing. Generally, the user’s behavior is unknown when the sign language recognition system is used. Therefore, the experimental verification set and the test set come from the participants whose number is 4. The sign language data does not participate in the training, which is more suitable for the actual use. (3) Sign Language Recognition Method The deep learning method has made great progress in the fields of image recognition, speech recognition, and translation. And the CNN is widely used for image classification, object detection, and semantic segmentation. In order to solve complex tasks, the structure of CNN is becoming more and more complex, and the network is deeper and deeper. In the following, we will introduce several CNN components used in our network. The convolutional layer is the core component of a CNN. The main idea is local links and weight sharing. The convolution layer effectively reduces parameters of the neural network, making it possible for the network to be optimized by using gradient descent. The calculation formula for the convolution layer is ⎛ xjl = f ⎝
⎞ xil−1 ∗ kijl + bjl ⎠
(4.19)
i∈Mj
where xjl is the output feature map, xjl−1 is the input feature map, Mj is the selected area in the l−1 layer, kijl is weight parameter, bjl is bias, and f is activation function. The training difficulty and the training time of CNN increase as the network model becomes more complicated and huge. And batch normalized operation is generally added before the convolution layer. Batch normalized operation changes the distribution of data by calculating the transformation so that the output values of each layer of the CNN are reasonably distributed. It can speed up the training of the network, and can also prevent over-fitting. The calculation formula for the batch normalized is x i − uβ yi = γ # +β σβ2 +
(4.20)
where x is the output of convolutional layer without activation, uβ is the mean of x, σβ2 is the variance of x, and γ , β are parameters to learn. When CNN is used for classification, feature extraction is generally performed using convolutional layers and pooling layers, and finally the extracted features are fed to the final fully connected layer for classification. The fully connected layer is
98
4 Applications of Developed Wearable Devices
similar to the traditional back propagation network. The calculation formula for the fully connected layer is yj = f
0N
1 xi ∗ wij + bj
(4.21)
i=1
where x is the input layer, N is number of input layer nodes, wij is the weight between the links xi and yj , bj is the bias, and f is activation function. In the multi-modal sign language recognition dataset, the data used in this experiment are time series signals. Among them, there are similar sign languages, such as “I want to drink a cup of warm water” and “I want to drink a cup of cold water.” If Long Short-Term Memory (LSTM) is used to do this, then the network may easily overlook the slight difference between the two sentences because of its forgetting mechanism. According to the requirements of the experimental task, we design the CNN structure for sign language recognition which is called SLRNet. The network mainly consists of a convolutional layer, a batch normalization (BN) layer, and a fully connected layer. The usual pooling layer is not used in the network because we use a convolutional layer with a step size greater than 1 and it can also reduce the amount of data. SLRNet is divided into two parts: feature extraction and classification parts. The feature extraction portion is composed of a convolution layer and a batch normalization layer. Each convolution layer is followed by a batch normalization layer. The classification part consists of a fully connected layer. We extract the features of the left and right gloves. The fused features are fed into the fully connected layer. If we want to add other modal information, we can also use the convolutional layer and the batch normalization layer for feature extraction, and then fuse it with the characteristics of data of the inertial-based data gloves. In order to achieve the multi-modal fusion and improve the effect of network model, we also use Kinect data as input. First, we extract feature from keypoints of skeleton with SLRNet based on the keypoint data. Then, we combine these signals as inputs to SLRNet to improve our model and the structure of SLRNet based on multi-modal data is shown in Fig. 4.6. In order to recognize the sign language, it is necessary to set appropriate network hyperparameters for SLRNet, including convolution channel number, convolution core size, and convolution step size. The SLRNet hyperparameters determined by experiments are shown in Table 4.1.
4.1.3 Experimental Results The experiments concentrate on the contribution of this section, i.e., the ELM-based gesture recognition and the CNN-based gesture recognition. Hence, the proposed gestures recognition algorithms are evaluated in the following experiments.
4.1 Gesture Recognition
99
Fig. 4.6 SLRNet based on multi-modal data
(1) The ELM-Based Results The experiments of the static gestures and dynamic gestures are, respectively, implemented to verify the effectiveness of the ELM-based recognition method. The ten static gestures and sixteen dynamic gestures are captured for identification in the following experiments. As a basis for the experiments, we record a dataset containing 10 different classes of static gestures, which are the numbers from “one” to “ten.” The gestures are shown in Fig. 4.7. The data has been recorded by 2 participants. The 54dimension orientations of the data glove are used to express the gestures. Hence we use the 54-dimension feature for gesture recognition classification. Table 4.2 shows that ELM-kernel can achieve better performance. Moreover, we can see from Table 4.2 that the time it takes for ELM is far less than SVM. The average gesture recognition classification accuracy of ELM-Kernel is higher than that of ELM and SVM. Figure 4.8 shows the confusion matrix across all 10 classes. Most model confusions show that the ELM-kernel method has a better gesture recognition accuracy compared with the ELM and SVM. In the dynamic gesture experiments, we record a dataset containing 16 different classes of dynamic gestures. The directions of the gestures include up, down, left, right, up-left, up-right, down-left, down-right. And the eight gestures are waved only by hand. The gestures are shown in Fig. 4.9. The data has also been recorded with 2 participants. We use the DTW distance feature for gesture recognition classification. The ELM-DTW based method and SVM-DTW based method are implemented to
Classifier
Feature extraction of keypoints of skeleton
Corresponding part Left-handed and righthanded feature extraction
Layer number 1 2 3 4 5 6 1 2 3 4 5 1 2
Table 4.1 Hyperparameters of Slrnet
Fully connected layer
Convolutional layer
Layer type Convolutional layer
96 10
48
Channels/nodes 48
No
3×3
Kernel size 3×3
Stride 1 2 1 2 1 2 1 2 1 2 2 No 0.4 0.2
No
Dropout No
ReLU Softmax
ReLU
Activation ReLU
No
Yes
Batch normalization Yes
100 4 Applications of Developed Wearable Devices
4.1 Gesture Recognition
101
Fig. 4.7 The static gestures of numbers
Table 4.2 Gesture recognition accuracy and train time for different approaches
Accuracy Train time (s)
Fig. 4.8 The confusion matrix of ELM, SVM, ELM-Kernel
ELM 68.05% 1.6094
ELM-kernel 89.59% 12.5456
SVM 83.65% 1886.5
102
4 Applications of Developed Wearable Devices
Fig. 4.9 The dynamic gestures of directions Table 4.3 Gesture recognition accuracy and train time for different approaches Accuracy
ELM-DTW 82.5%
SVM-DTW 78.75%
compare the performance of dynamic gesture recognition. Table 4.3 summarizes the results of recognition accuracy, and ELM-DTW based gesture recognition method achieves better performance. Figures 4.10 and 4.11, respectively, show the confusion matrix across all 16 classes. By comparing the confusion matrices of the sixteen classifiers, it is found that directions of gestures can be easily classified. In addition, most model confusions show that the ELM-DTW method has a better gesture recognition accuracy compared with the SVM-DTW method. (2)The CNN-Based Results In our dataset, we evaluate methods based on both hand-crafted features and deep networks based on CNN or LSTM on dynamic gesture recognition. The glove data is high frequency data, so the number of samples can be amplified by data resizing and down-sampling. We obtained a total of 2500 samples through data enhancement. 80% of them are used as training sets, 10% as a verification set, and 10% as a test set. The subject IDs we use for testing are: 4. Before training, the data needs to be normalized. Each IMU in the inertial-based data gloves contains 12-dimensional information. In order to reduce the complexity of the model, we only use the angle information in the inertial sensor. When using SLRNet for sign language recognition, the glove feature extraction part uses 6 layers of convolution layers for both left and right hands, and the BN layer is behind each convolution layer. Finally, the obtained convolution features are fed to the fully connected layer. When only data gloves are used, the six
Fig. 4.10 The confusion matrix of ELM-DTW
4.1 Gesture Recognition 103
4 Applications of Developed Wearable Devices
Fig. 4.11 The confusion matrix of SVM-DTW
104
4.1 Gesture Recognition
105
convolutional channels of the left and right hands are 48. The convolution kernels are all 3 × 3. The number of output nodes is 10 for the action category, and the regularization coefficient of the full connection is 5e–6. When only the keypoint data of the bone is used, the channels of the five convolutional layers are all 48, and the convolution kernels are all 3 × 3. The fully connected layer is the same as when only the skeleton key data is used. When using the skeleton keypoint data and the data glove data at the same time, the parameters are the same as the above two, except that the two signal features are simultaneously fed into the fully connected layer. For the LSTM-based approach, the feature extraction portion of the CNN-based approach is replaced with the LSTM layer. The feature extraction uses a two-layer LSTM layer followed by a fully connected layer. When only the keypoint data of the skeleton is used, the number of nodes of the two LSTM cells is 48, the number of intermediate nodes of the full connection is 96, and the number of the output nodes is 10. The dropout ratios of the two fully connected layers are 0.4 and 0.2, and the regularization coefficient is 1e–10. When only data glove data is used, the number of nodes of the two LSTM cells is 24, the intermediate node of the fully connected layer is 64, the output node is 10, the dropout ratio is 0.2, and the regularization coefficient is 1e–10. When using the keypoint data and the glove data at the same time, the number of nodes of the two-layer LSTM used for the feature extraction of the skeleton keypoints is 36, and the number of nodes of the two-layer LSTM used for the data feature extraction of the data glove is 24. The number of fully connected nodes is 64 and 10, and the dropout ratio is 0.2. Considering that the signal for each joint is a time series signal, manual features, such as mean, variance, RMS, waveform length (WL), etc., can be extracted for each dimension of data, such as the pitch angle time series signal of the little finger. Suppose . / that the signal for a certain dimension is expressed as x = x1 , x2 , . . . xn , n ∈ 1, N Mean formula: x¯ =
N 1 xi N
(4.22)
i
Variance formula: var =
N 1 (xi − x) ¯ 2 N
(4.23)
i
Signal energy calculation formula: + , N ,1 (xi )2 RMS = N i
(4.24)
106
4 Applications of Developed Wearable Devices
Table 4.4 Sign language recognition accuracy
Experiment method PCA+SVM LSTM CNN
Keypoint data 0.788 0.81 0.99
Gloves data 0.82 0.808 0.996
All data 0.82 0.84 1
Signal complexity formula: WL =
N−1
|xi+1 − xi |
(4.25)
i
After extracting these signal features, there are still a large number of features for each sign language. In order to reduce the redundant features and improve the recognition effect of the classifier, it is necessary to reduce the dimensions of the high-dimensional features before classifier classification. A common dimensionality reduction method Principal Components Analysis (PCA) can be used. By projecting n-dimensional features onto a k(k < n) dimension by matrix transformation, and this k-dimensional is a completely new orthogonal feature, rather than simply selecting k-dimensional features from n-dimensional. After the dimension is reduced, the SVM is used for multi-classification. The final test results of the three methods are shown in Table 4.4. As can be seen from the table, the CNN-based sign language recognition method achieves significantly higher accuracy compared with the LSTM-based method and the PCA+SVM method among all data types. In our data collection, Kinect-based data has no occlusion and insufficient observation, so the accuracy of keypoint data and glove data is close. The accuracy of multi-modal fusion is higher than in the single mode in all three methods. And in loss curve in Fig. 4.12, orange is the
loss
2.00 1.60 1.20 0.800 0.400 0.00 0.000
100.0
200.0 300.0
400.0
500.0
600.0
Fig. 4.12 The loss of multi-modal CNN-based methods
700.0
800.0
900.0 1.000k 1.100k
4.1 Gesture Recognition
107
accurancy
1.00 0.800 0.600 0.400 0.200 0.00 0.000 100.0
200.0 300.0
400.0
500.0
600.0 700.0
800.0 900.0 1.000k 1.100k
Fig. 4.13 The accuracy of multi-modal CNN-based methods
loss of training set, and blue is the loss of verification set. Because of the small batch training method used in the training process, the training loss jitter is more serious and the validation set loss is more gentle. In accuracy curve in Fig. 4.13, the same orange is the accuracy of training, and blue is the accuracy of validation set, which has not been used in training. The training set is trained, so the network can recognize the training set better. It can be found that with the progress of training, the training error is decreasing and the training accuracy is rising. For the multi-modal classification effect based on CNN-based method, in addition to the accuracy rate of 99.2%, there is a confusion matrix indicating more detailed results as shown in Fig. 4.14. The CNN-based method achieves good
Fig. 4.14 The confusion matrix of multi-modal CNN-based methods
108
4 Applications of Developed Wearable Devices
recognition results on multi-modal data. There is only a recognition error in one sentence, “I am dizzy” (Class 8), and a small number of samples are identified as “I am very hot” (Class 6) or “I feel very lonely” (Class 9). It is also reasonable that the hand has a similar position in the sign language “I am dizzy” and “I am very hot.”
4.2 Tactile Interaction The proposed capacitive tactile sensor is worn on the human fingertip to sense the motion states for human–computer interaction as shown in Fig. 4.15. We define eight operation states, which are click, double click, left press, right press, upper pressure, lower pressure, positive rotation, and negative rotation, as shown in Fig. 4.16. In order to make the fluent interaction, the sampling circuit of equipment is working at 10 kHz speed. And the sampling data is sent to the computer via a serial port, then the data is decoded and saved. We associate the value of pressure with the three primary colors and use hue saturation value (HSV) color space. Generally, the greater the pressure, the more the color goes from green to red. After a large number of experiments, these eight status graphs can be shown in Fig. 4.17.
Fig. 4.15 The wearable device of fingertip
Fig. 4.16 Eight operation states of fingertip
4.2 Tactile Interaction
109
Fig. 4.17 The color signal of the tactile sensor
In the four states of left pressure, right pressure, upper pressure, and lower pressure, it is obvious that the corresponding direction color becomes redder. The sensing points of clockwise rotation and counterclockwise rotation are mirrored. And there is no discernible difference in the sensing point distribution of the click and double click process, but we can distinguish between the forces and the time. It can be seen that there are obvious differences in the color graph distribution. These images can directly distinguish a state from another. After a large number of datasets are obtained, the classification algorithm is deduced. It is assumed that the value of each sensing point is pi (t) where i is the number of sensing point. Then the measurement of sensor is s(t) = [p1 (t), . . . , pi (t), . . . , p25 (t)]
(4.26)
If each dataset is sampled at N times, and every time a 25-dimensional column vector can be got at each sampling time. Then we can use a matrix to represent this pressure process: Rm = [sm (0), . . . , sm (t), . . . , sm (n)]T
(4.27)
where Rm is the dataset in the m sampling time. Because the time duration of each sample is different at each time, the data needs to be processed appropriately. Since the data is relatively simple, PCA method is used for reducing the dimensions of Rm . After reducing the dimensions, the
. The 90% of consistent data is used sampling data becomes a consistent dataset Rm for training set in classic SVM, KNN, and decision tree, while the rest of consistent data is used for test set, as shown in Table 4.5. It shows that the classifiers have
110 Table 4.5 Performance comparison
4 Applications of Developed Wearable Devices
Training set Test set Accuracy
SVM 21,600 240 83%
KNN 21,600 240 99%
Decision tree 21,600 240 97%
high accuracy. The experiments are demonstrated that petal array structure is more suitable for fingertip sensing and tactile interaction.
4.3 Tactile Perception In our daily life, the information such as tactile, vision, auditory is widely used to recognize the object when we execute grasping. Research has shown that the human brain makes use of multisensory integration and cross-modal [11]. By using this kind of shared model, human can transfer the information of an object from a sensory modal into another modal. This kind of shared information is particularly useful. Allen proposes an algorithm used for scanning the surface of the object, which combines the information of vision and tactile [12]. Then Allen makes an extension of the algorithm by building the assume-verify system which is composed of visual and tactile modalities for recognizing objects, and achieves success [13]. Yamada et al. [14] then introduce an approach using visual and tactile modalities to describe the shape feature of the object. This approach uses global and localtransformation internal model to combine visual and tactile sensors. The approach, firstly, uses visual data to build the rough expression of the shape of the object, and then obtains the information by tactile sensors to describe the object’s shape feature by making the internal modal transform locally. Prats et al. [15] bring up a framework concentrating on visual, tactile, and force information, which is used to control the operation of robot arm, and demonstrate an experiment of using robot arm to open the door. In the experiment, the visual control is used to locate the handle of the door, and the tactile control is used to ensure the accurate alignment between the handle and the door. The experiment environment is composed of mobile robot, external camera, and cabinet with mobile door which can move from the right side to the left. Ilonen et al. [16] propose the method that fusion is formalized into the problem of state estimation, and solved by using Iterated Extended Kalman Filter. Because of the complementarity of two modalities, obtained model is more similar to the real form of the object compared with the one only using visual or tactile information. The author uses experimental platform built by Kuka robot arm and Schunk dexterous hand to gather the visual and tactile information. Kroemer et al. [17] study on using vision to assist the tactile feature extraction, and it is useful in texture material classification. The author points out a crucial problem existing in the fusion of vision and tactile, which is difficult to match each pair of visual image and tactile data. To solve this problem, the author
4.3 Tactile Perception
111
proposes joint reduction method based on weak pair. During the real application process, only the tactile data is used to analyze material, and visual data only works as an auxiliary. Bjorkman et al. [18] firstly use the visual information to make rough three-dimensional modeling of the object and then correct the model using the visual information obtained by multiple contact. Bhattacharjee et al. [19] assume that the scene of the environment with similar visual feature also has similar tactile feature, so that RGB-D image and tactile signal of sparse label can be used to produce the dense haptic map of the environment to guide the operation of the robot. Research on visual-tactile fusion object recognition is still very limited. In general, vision is suitable for dealing with color, shape, etc. While tactile is suitable for dealing with the temperature, hardness, etc. As for the feature of surface material, both of them can be used. The former is usually used to deal with rough material and the latter is used to deal with fine material. Newell et al. [20] make a detailed discussion about this question. Recently, Yang et al. [21] investigate the joint learning method of visual image and tactile data by using the CNN. Guler et al. [22] study the internal state of the container focusing on transformable object with the information of visual-tactile fusion. This article uses Kinect camera in a fixed position to detect the object transformation state after the extrusion. With the data from the tactile sensors, the internal state of the object can be detected by classification. However, this approach needs a 3D model built previously, hence facing big limit in the real application. In this section, a glove-based system is proposed for object recognition. The tactile glove is capable of jointly collecting data of tactile on fingertips and palm. It can reliably perform simultaneous tactile sensing in real time for the purpose of collecting human hand data during fine manipulation actions. Then the algorithms of data representation and fusion classification are deduced. Finally, we develop the visual-tactile dataset for the experimental verification.
4.3.1 Tactile Glove Description The tactile glove is developed to collect tactile data when human grabs the object. As shown in Fig. 4.18, the tactile glove consists of six tactile sensors on the front and MCU board on the back of the tactile glove. The five sensors are worn on the five fingertips and one is on the palm of the hand. The MCU board is used to collect and send tactile information, and the adapter board is used to switch the serial port to USB port. The sensors’ measurements are transmitted to the computer. The demonstration is shown in Fig. 4.19.
112
4 Applications of Developed Wearable Devices
Fig. 4.18 Proposed tactile glove
Fig. 4.19 The demonstration of the grasping
4.3.2 Visual Modality Representation A major concern in visual modality representation is the lack of a competent similarity criterion that captures both statistical and spatial properties; most approaches depend on either color distributions or structural models. Here the covariance descriptor [23] is used for the visual modality representation. Covariance descriptor is the integration of a variety of feature channels, calculating their direct correlation coefficient and producing a kind of low-dimensional description of the visual feature. For an image of target object, each pixel in the image is extended into a five-dimensional feature vector, fi = I, ∂Ix , ∂Iy , ∂ 2 Ix , ∂ 2 Iy
(4.28)
4.3 Tactile Perception
113
where I is gray image information; ∂Ix is partial derivative of the gray scale on the X axis; ∂Iy is partial derivative of the gray scale on the Y axis; ∂ 2 Ix is the second partial derivative of the gray scale on the X axis; ∂ 2 Iy is the second partial derivative of the gray scale on the Y axis; {fi }i−1,...,d is the expression of all five-dimensional feature pixel in that image. For each image, a 5 × 5 covariance matrix is called the covariance descriptor of the image 1 (fi − μ)(fi − μ)T d −1 d
P =
(4.29)
i=1
where d is the number of pixels in the image, and μ is the average value of fivedimensional feature pixel. Covariance matrix is a positive definite matrix. A key problem of the study based on positive definite matrix is the modeling and calculation of the positive definite matrix. Though the R of positive definite matrix is not in linear space, it is a Riemannian manifold cluster. Since the sub-space of Riemannian manifold is made of positive definite matrix, each matrix can be regarded as a point of the manifold. Nonsingular covariance matrix is a symmetric positive definite matrix, so it is available to transform the calculation of the similarity between two images into the calculation of the distance between two points of the manifold while using the covariance matrix to describe the image. The mathematical model in this space is different to the one in Euclidean space, so the Log–Euclidean distance is used to approximately calculate distance. Hence, covariance distance between any pair of covariance matrix Pi and Pj : dCovD (Pi , Pj ) = logm(Pi ) − logm(Pj )F
(4.30)
where Pi and Pj are two symmetric positive definite matrices (Pi and Pj is the covariance descriptor of image-i and image-j); i, j = 1, . . . , N, where N is the number of the images; logm is the logarithm of a matrix, and ·F is F-norm. Figure 4.20 is a schematic diagram of the covariance matrix extracted from the original image (each column of the feature matrix represents a 5D feature vector of
Fig. 4.20 Schematic diagram of the covariance matrix extracted from original image
114
4 Applications of Developed Wearable Devices
a pixel, d is the total number of pixels in image). Then the feature matrix can be transformed into 5 × 5 covariance descriptor.
4.3.3 Tactile Modality Representation Since grasping an object is a dynamical process and tactile information gathered varies with time, the DTW method is used to deal with the tactile feature. It is assumed that the feature vector sequence in reference template is R = {r1 , r2 , . . . , ri }, and feature vector sequence waiting recognition is T = {t1 , t2 , . . . , tj }, where i = j . In this section, both R and T are the tactile sequence of the target object. The primary goal of DTW algorithm is to minimize the total matching distance of distribution of time between feature vector sequence waiting recognition and reference template, to make the length of time sequence between them become the same. Then, the matching-pair sequence C = {c(1), c(2), . . . , c(N)} is used to measure the matching state of every element between two feature vector sequence. N represents the length of matching path, matching-pair sequence c(n) = (i(n), j (n)), where i(n) represents the i(n)th feature vector of reference template, j (n) represents the j (n)th feature vector of feature vector sequence waiting recognition, and the matching distance is d = (ri(n) , tj (n) ). DTW algorithm uses the theory of dynamic programming to get the best matching path, by which total matching distance between two sequence reaches the minimum d(R, T ) = min
N
[d(ri(n) , tj (n) ]
(4.31)
n=1
Then, the path found in reality could get better fitting results by adding some constraint conditions to the matching-pair sequence between two feature vector sequence. If Eq. 4.31 is directly used, there is high probability that the matching distance comes from two different feature vector sequences matching, which makes the algorithm show inaccurate situation during subsequent classification. Hence, matching-pair sequence must meet the two requirements as follows: (1) Since the sequence of feature vector happens in turn in time domain, the matching path of matching-pair sequence must conform to the time order, i.e., matching-pair sequence must conform to non-decreasing constraint: ci+1 ≥ ci
(4.32)
(2) There exist some feature vectors in feature vector sequence, which makes big influence on classification. To ensure the accuracy of classification, matchingpair sequence is asked to never neglect any sequence element.
4.3 Tactile Perception
115
Fig. 4.21 DTW finding path
Fig. 4.22 DTW locally constrained path
Figure 4.21 shows a circumstance of a kind of specific matching in matchingpair sequence, the X axis represents feature vector waiting recognition, Y axis represents template feature vector, and the broken line is the path accumulated by local matching distance d(ri(n) , tj (n) ). In order to satisfy these two restrains, local path needs a limitation. Figure 4.22 shows common constraint solution. The process of how the DTW algorithm searches the global best path: search the best path from matching pair(1,1), according to the requirements for path searching in Fig. 4.22, the matching pair before (ri(n) , tj (n) ) could only be one of (ri(n−1) , tj (n) ), (ri(n−1) , tj (n−1) ), and (ri(n) , tj (n−1) ), in order to find the global best path, matching pair (ri(n) , tj (n) ) needs to pick the one which minimizes the length of local path as its former matching pair D(ri(n) , tj (n) ) = d(ri(n) , tj (n) ) + min(D(ri(n−1) , tj (n) ), D(ri(n−1) , tj (n−1) ), D(ri(n) , tj (n−1) )) (4.33) Starting from (1,1), it searches according to Eq. 4.33, and gets the global best path, dDTW(R, T ) = D(ri(n) , tj (N) )
(4.34)
116
4 Applications of Developed Wearable Devices
4.3.4 Visual-Tactile Fusion Classification In this section, we deal with the problem of visual-tactile fusion classification by using the kernel ELM method. Then the details of training process and recognition process of the visual-tactile fusion algorithm for object recognition based on kernel ELM are introduced. (1) Training process of visual-tactile fusion object recognition algorithm Input: training set of visual-tactile pairs and the relevant label Y, Gaussian variance γ , regularization coefficient C. Output: W (and save the training set of visual-tactile pair simultaneously) Step1: extract visual feature: For each image of visual training set, calculate the 5 × 5 covariance matrix. Calculate the covariance distance between covariance matrix Pi and Pj between any two images, which is dConvD (Pi , Pj ). Get the covariance distance matrix DCovD . Step2: extract tactile feature: calculate DTW distance once between each tactile sequence in the tactile training set and all kinds of training sample (including itself) according to Eq. 4.34, which is dDTW (R, T ), obtaining DTW matrix DDTW . Step3: Visual feature kernel: feed into Gaussian kernel function and get 2 KConvD = exp[−γ DCovD ]
(4.35)
Step4: Tactile feature kernel: feed into Gaussian kernel function and get 2 KDTW = exp − γ DDTW
(4.36)
Step5: Visual-tactile fusion: we define the Kernel(KConvD , KDTW ) of visualtactile fusion as the product of kernel KConvD and KDTW , which is Kernel(KConvD , KDTW ) = KCovD ∗ KDTW
(4.37)
Step6: train kernel ELM classifier: put Kernel(KCovD , KDTW ) into formula W = ( CI + ΩELM )−1 Y . (2) Recognition process of visual-tactile fusion object recognition algorithm Input: testing sample of visual-tactile pair, W, which is the training set of visual-tactile pair. Output: relevant label label(x) of testing sample. Step1: extract visual feature: For each image of visual training set and visual image information of testing sample, calculate the 5 × 5 covariance matrix. According to Eq. 4.30, we calculate the covariance distance between covariance matrix Pi and Pj between any two images, which is dCovD (Pi , Pj ); and produce 1 × N covariance matrix DCovD of testing sample. Step2: extract tactile feature: calculate DTW distance once between each tactile sequence of tactile training set and all kinds of training samples
4.3 Tactile Perception
117
(including itself) according to Eq. 4.34, which is dDTW(R, T ). We get the 1 × N covariance matrix DDTW of test sample(N is the number of samples) Step3: Visual feature kernel: same to Eq. 4.35 Step4: Tactile feature kernel: same to Eq. 4.36 Step5: Visual-tactile fusion: same to Eq. 4.37 Step6: classification: substitute Kernel(KCovD , KDTW ) and W into f (x) = ⎤T ⎡ K(x, x1 ) ⎥ ⎢ .. ⎦ W to determine f (x). Then according to label(x), we get the ⎣ . K(x, xN ) label label(x) of testing sample.
4.3.5 Experimental Results In this section, the built dataset and the experimental validation results are introduced. We select fifteen objects that are common in life as experimental items. From left to right and from top to bottom, the order is package of tea, package of milk tea, package of coffee, Jierou paper towel, tea box, empty water bottle, full water bottle, cylindrical box, rectangular biscuit box, tiger doll, human doll, bottle, Xinxiangyin paper towel, tennis ball, and blue ball, as shown in Fig. 4.23. We fix the camera in the same position, place the object on the electronic turntable, and capture the video, as shown in Fig. 4.24. Then we convert the video into pictures, we can get the pictures of the object at different angles. We randomly extract one every 10–20◦ in the pictures, hence 30 pictures of one object are extracted as the visual dataset.
Fig. 4.23 Experiment objects
118
4 Applications of Developed Wearable Devices
Fig. 4.24 Place an object on an electronic turnable and collected images
Fig. 4.25 Accuracy under different training set/test set division ratios
In the tactile data collection process, in order to ensure the objectivity of the experiment, we have a total of 6 individuals involved in the collection of tactile data. Each person wears tactile gloves and grabs each experimental object 5 times. For each object, there are 30 tactile sequence information. This experiment will randomly select the training set and the testing set from the 30 tactile information. We compare the classification performance of the ELM classification model under three modes of tactile, image, and image-tactile fusion. We list accuracy at five different division ratios, namely training sets/testing sets = 5/5 (15:15), 6/4 (18:12), 7/3 (21:9), 8/2 (24:6), and 9/1 (27:3). The histogram is shown in Fig. 4.25. The dark blue bar represents the classification result based on the visual image information. The light blue bar represents the classification result based on the tactile information. The green bar represents the classification results based on visual-tactile fusion algorithm. The histogram shows that the fusion algorithm has higher recognition accuracy than the tactile or image single-modality algorithm. Figure 4.26 shows the Kernel ELM classification result confusion matrix with the use of image features alone in the case of a 5:5 split between the training set and the testing set. As it can be seen from the figure, the classification result of the Jierou paper towel, blue ball, tennis ball, human doll, tiger doll, bottle, coffee,
4.3 Tactile Perception
119
Fig. 4.26 The kernel ELM classification result confusion matrix diagram using the image features in the case of 5:5 dividing the training sets and test sets
rectangular biscuit box, and cylindrical box featuring visual images is 100%. One of the main reasons is that the visual image covariance descriptors of the nine kinds of objects have obvious differences in their sub-characteristics, and therefore they are easy to be classified. It can be seen that the objects can be classified better by using image information. However, the accuracy rate of empty bottles is only 40%. Since the information of empty water bottles and full water bottles are very similar, the extracted covariance descriptors are also very similar. Therefore, 40% of them are mistakenly divided into panda toys, and the classification of milk tea is accurate. The rate of 0% is very confusing. 40% of them are misclassified as tea, 40% are misclassified as printed napkins, and the remaining 20% are misclassified as humanoid dolls. Figure 4.27 shows the Kernel ELM classification result confusion matrix using the haptic features alone in the case of a 5:5 split between the training set and the testing set. From the confusion matrix, it can be seen that the accuracy rate of identifying milk tea is 100%, indicating that milk tea is easily classified by touching. However, due to the similar materials of clean soft tissue and printed tissue, they all belong to relatively soft plastic materials, and the tactile information is similar, as shown in Fig. 4.27. Therefore, it is easy to misclassify the two objects through tactile information. The accuracy of the final wipe is 20%, of which 80% is misclassified as a printed tissue. From the confusion matrix chart, it can be found that only 40% of the classification accuracy of the green cylindrical box has caused serious confusion. Among them, 20% are mistakenly divided into tiger dolls, 20% are mistakenly
120
4 Applications of Developed Wearable Devices
Fig. 4.27 The kernel ELM classification result confusion matrix diagram with train set and test set percentile allocation of 5:5 tactile features
divided into tea boxes, and the remaining 10% are mistakenly divided into empty bottles. The reason is that the green cylindrical box is lighter, and the contact force between different grabbing and tactile glove tactile sensors is relatively small, so the value of the collection feature information is relatively small, so the tactile single mode is easily misclassified. The fusion with haptic modality has become a way to improve the classification accuracy. It is worth noting that the accuracy of tennis in the tactile confusion matrix map is only 40%. However, the accuracy of tennis in the visual image confusion matrix is as high as 100%; the classification accuracy of tea in the tactile confusion matrix is 100%, and the classification accuracy of package of milk tea is 0%. It can be seen that each of tactile and visual modes has its own advantages. Figure 4.28 shows the fusion modal kernel ELM categorization confusion matrix with the ratio of training set and testing set being 5:5. From the confusion matrix, it can be clearly seen that the classification results based on image-tactile information are better than the single-modality classification effect of the tactile or visual image. From the figure, the classification accuracy of Jierou paper towel, bottle, coffee, milk tea, tea box, rectangular biscuit box, and cylindrical box can be 100%. Among them, an empty bottle with a tactile modal classification accuracy rate of 60% and a visual modal classification accuracy rate of 40% are obtained. The accuracy of the imagetouch fusion classification is as high as 80%. It can be seen that the combination of visual and tactile modal fusion can improve the single modal classification accuracy.
4.4 Summary
121
Fig. 4.28 The ratio of training set and test set is 5:5 divided visual-tactile information is a feature of the fusion mode kernel ELM classification result confusion matrix diagram
4.4 Summary The applications of the developed wearable devices are introduced in this chapter. The inertial sensors based wearable device is applied for gesture recognition. The ELM-based static gestures method and dynamic gestures method are proposed. Performance evaluations verify that the proposed data glove can accurately capture the 3D gestures and the ELM-based methods can accurately recognize the gestures. Meanwhile, the CNN-based sign language recognition method is deduced and the multi-modal dataset is built for proving the effectiveness of the proposed method. Furthermore, the wearable tactile device of fingertip is introduced and the eight operation states are recognized for interaction. Finally, the tactile glove is designed to conveniently collect tactile information for grasping and the visualtactile information fusion algorithm is proposed to establish the object recognition system. Multi-layer time series model is used to express tactile time series, and covariance descriptors are used to characterize the image features. Kernel ELM classification algorithm is deduced to fuse two kinds of modal information and classify objects. The tactile-visual information pair dataset consisting of 15 objects is established. Experimental results show the superior performance of the system based on the proposed tactile glove.
122
4 Applications of Developed Wearable Devices
References 1. Belgioioso G, Cenedese A, Cirillo GI, Fraccaroli F, Susto GA (2015) A machine learning based approach for gesture recognition from inertial measurements. In: Proceedings of the 53rd IEEE conference on decision and control 2015, pp 4899–4904. https://doi.org/10.1109/CDC.2014. 7040154 2. Luzanin O, Plancak M (2014) Hand gesture recognition using low-budget data glove and cluster-trained probabilistic neural network. Assembly Autom 34. https://doi.org/10.1108/AA03-2013-020 3. Jiang L, Han F (2010) A hand gesture recognition method based on SVM. Comput Aided Drafting Design and Manuf 20(2):85–91 4. Huang G-B, Wang DH, Lan Y (2011) Extreme learning machines: a survey. Int J Machine Learn Cyb 2(2):107–122 5. Huang G-B, Ding X, Zhou H (2010) Optimization method based extreme learning machine for classification. Neurocomputing 74:155–163. https://doi.org/10.1016/j.neucom.2010.02.019 6. Zhang L, Zhang D (2015) Evolutionary cost-sensitive extreme learning machine and subspace extension. IEEE Trans Neural Netw Learn Syst 28:3045–3060. https://doi.org/10.1109/ TNNLS.2016.2607757 7. Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression and multiclass classification. IEEE Trans Syst Man Cyb Part B Cyb 42(2):513–29 8. Zhang L, Zhang D (2016) Robust visual knowledge transfer via extreme learning machine based domain adaptation. IEEE Trans Image Proc 25:1–1. https://doi.org/10.1109/TIP.2016. 2598679 9. Zhang L, Zhang D (2014) Domain adaptation extreme learning machines for drift compensation in E-Nose systems. IEEE Trans Instrum Meas 64:1790–801. https://doi.org/10.1109/TIM. 2014.2367775 10. Bartlett P (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans Inf Theory 44:525–536. https://doi.org/10.1109/18.661502 11. Lacey S, Campbell C, Sathian K (2007) Vision and touch: Multiple or multisensory representations of objects? Perception 36:1513–21. https://doi.org/10.1068/p5850 12. Allen P (1984) Surface descriptions from vision and touch. In: 1984 IEEE international conference on robotics and automation, vol 1, pp 394–397. https://doi.org/10.1109/ROBOT. 1984.1087191 13. Allen P (1988) Integrating vision and touch for object recognition tasks. Int J Robot Res 7(6):15–33. https://doi.org/10.1177/027836498800700603 14. Yamada Y, Ishiguro A, Uchikawa Y (1993) A method of 3D object reconstruction by fusing vision with touch using internal models with global and local deformations. In: 1993 IEEE international conference on robotics and automation, pp 782–787. https://doi.org/10.1109/ ROBOT.1993.291939 15. Prats M, Sanz P, del Pobil AP (2009) Vision-tactile-force integration and robot physical interaction. In: 2009 IEEE international conference on robotics and automation, pp 3975–3980. https://doi.org/10.1109/ROBOT.2009.5152515 16. Ilonen J, Bohg J, Kyrki V (2013) Fusing visual and tactile sensing for 3-D object reconstruction while grasping. In: Proceedings of the IEEE international conference on robotics and automation, pp 3547–3554. https://doi.org/10.1109/ICRA.2013.6631074 17. Kroemer O, Lampert C, Peters J (2011) Learning dynamic tactile sensing with robust visionbased training. IEEE Trans Robot 27:545–557. https://doi.org/10.1109/TRO.2011.2121130 18. Bjorkman M, Bekiroglu Y, Hogman V, Kragic D (2013) Enhancing visual perception of shape through tactile glances. In: IEEE international conference on intelligent robots and systems. https://doi.org/10.1109/IROS.2013.6696808
References
123
19. Bhattacharjee T, Shenoi A, Park D, Rehg J, Kemp C (2015) Combining tactile sensing and vision for rapid haptic mapping. In: 2015 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 1200–1207. https://doi.org/10.1109/IROS.2015.7353522 20. Woods A, Newell F (2004) Visual, haptic and cross-modal recognition of objects and scenes. J Physiol Paris 98:147–59. https://doi.org/10.1016/j.jphysparis.2004.03.006 21. Gao Y, Hendricks, Lisa, Kuchenbecker, Katherine, Darrell, Trevor. (2016). Deep learning for tactile understanding from visual and haptic data. In: IEEE international conference on robotics and automation, pp 536–543. https://doi.org/10.1109/ICRA.2016.7487176 22. Guler P, Bekiroglu Y, Gratal X, Pauwels K, Kragic D (2014) What’s in the container? Classifying object contents from vision and touch. In: IEEE international conference on intelligent robots and systems, pp 3961–3968. https://doi.org/10.1109/IROS.2014.6943119 23. Tuzel O, Porikli F, Meer P (2006) Region covariance: a fast descriptor for detection and classification. In: 2006 European conference on computer vision (ECCV), pp 589–600. https:// doi.org/10.1007/11744047_45
Part III
Manipulation Learning from Demonstration
This part of the book focuses on the manipulation learning problems. It comprises three chapters. Chapter 5 tackles the problem of manipulation learning from teleoperation demonstration of wearable device using dynamical movement primitive. Chapter 6 addresses the problem of manipulation learning from visual-based teleoperation demonstration by developing deep neural networks methodology. Chapter 7 focuses on learning from wearable-based indirect demonstration.
Chapter 5
Learning from Wearable-Based Teleoperation Demonstration
Abstract In order to effectively transfer human’s manipulation skills to a robot, in this chapter, we investigate skill-learning functions via our proposed wearable device. The robotic teleoperation system which is implemented by directly controlling the speed of the motors is developed. Then the rotation-invariant dynamical-movement-primitive method is presented for learning interaction skills. Imitation learning experiments are designed and implemented. The experimental results verify the effectiveness of the proposed method.
5.1 Introduction The robotic teleoperation provides an alternative to employ human intelligence in the control of the robot remotely. Human intelligence is used to make a decision and control the robot especially when it is in unstructured dynamic environments. It is very useful in situations where people’s presence in the working space is unavailable, e.g., in nuclear power plants, in space, and at the bottom of the sea. It is promising that the robotic teleoperation will play a more significant role in a broader range of areas. In the robotic teleoperation, the robot receives instructions from human operator far away through some sensors or control mechanism by a communicating network. Meanwhile, the robot sends its status and environment information to the human operator as feedback. Only the efficient interaction between the robot and the human operator can make the robotic teleoperation system perform well. There are a variety of human–machine interfaces for robotic teleoperation system. Joysticks, dials, and robot replicas are commonly used [1]. However, for completing a teleoperation task, these traditional mechanical devices always require unnatural hand or arm motion. Recently, with the rapid development of the motion sensors and vision-based technique, it is believed to provide a more natural and intuitive alternative for robotic teleoperation. Vision-based teleoperation is easy for visual systems to capture contours of the human, but the captured images cannot provide enough information to track hands in the 3D space. It is because the derivation of the spatial position information will lead to multiple 2D-3D mapping solutions. The well-developed RGBD camera © Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_5
127
128
5 Learning from Wearable-Based Teleoperation Demonstration
has been widely used in robotic systems nowadays [2]. This camera consists of a laser projector and a CMOS sensor, which enables the sensor to monitor 3D motions [3]. Du et al. [4] and Al-Faiz and Shanta [5] have successfully applied the Kinect sensor to teleoperate the robotic manipulator. However, vision-based system is often affected by varying light conditions, changing backgrounds, and a large clutter. This makes extracting relevant information from visual signals difficult and computationally intensive. Furthermore, the occlusion of fingers results in a nonobservable problem that leads to a poor estimation of the hand pose [6]. Another popular way is to resort to device based on motion sensors. Inertial sensors, electromagnetic tracking sensors, EMG sensors, Leap Motion (LM) devices, or gloves instruments with angle sensors are used to track the operator’s hand or arm’s motion for teleoperation [7–9]. Especially, the inertial and magnetic sensors that are low-cost, small-size, compact, and low-energy consumption have been widely used. Miller et al. [10] develops the modular system for untethered realtime kinematic motion capture using IMUs. It has been successfully applied in the system for the NASA Robonaut teleoperation. Kobayashi et al. [11] uses IMUs for the motion capture of the human arm, and the CyberGlove is utilized for the hand motion capture. The hand/arm robot is teleoperated according to the motion of the operator’s hand and arm. Du et al. [12] proposes a system which uses a position sensor and an IMU to track the operator’s hand and then control the robot with the obtained orientation and position of the hand. This method is more effective for the operator to focus on the manipulation task instead of thinking which gesture should be used. It can be seen that the multi-modal sensing information is widely used in the robotic teleoperation system with the development of sensing technology. However, inertial and magnetic sensing based motion capture device that can simultaneously capture the motion of the arm, the hand, and fingers is never presented in the previous research work. Learning by demonstration has been proved to be an efficient way for robots to learn new tasks. Using a mapping approach, the robot needs no reprogramming to observe a person’s behavior. Thus, this is the most intuitive method for capturing human movement trajectories and mapping them to the robot. However, human and robotic arms have different mechanical structures. Accordingly, the kinematic model of a robot differs from that of a human. Hence, sensory information must be mapped to the movement space of the devices. A number of kinematic mapping methods have been proposed, including position mapping [13], joint angle mapping [14], and pose mapping [15]. The authors of [16] propose a mapping method that uses locality-preserving projections and KNN regression, which achieves relatively good results. Bosci et al. [17] use a Procrustes analysis algorithm to resolve linear mappings. An efficient multi-class heterogeneous domain adaptation method has also been proposed [18]. However, the calculations of these methods are slow. Furthermore, position control in robotic movement planning takes a lot of time to generate the inverse kinematics solution and to perform continuouspoints tracking [19]. We propose the method that the robot’s movements are directly controlled by the robotic articulation speed. In this way, the robotic teleoperation system would gain low latency.
5.2 Teleoperation Demonstration
129
It is important to recognize that skill learning [20, 21] is the key to the whole system. Numerous skill-learning methods have been proposed in recent years. Metzen et al. [22] develop hierarchical and transfer learning methods that enable robots to learn a repertoire of versatile skills. Their work provides a framework for robot learning of human behavior. Robot learning based on human demonstrations employs behavior segmentation methods. There are also methods that employ imitation learning [23, 24] and reinforcement learning [25]. In this chapter, to learn movement primitives for robotic interaction, we construct a model based on dynamic movement primitives (DMPs), which provides a generic framework for motor representation based on nonlinear dynamic systems. DMPs can model both discrete and rhythmic movements, and have been successfully applied to a wide range of tasks, including biped locomotion, drumming, and tennis swings. In [26], Herzog et al. presented the trajectory generation method, which used probabilistic theory to modify DMPs. Fanger et al. [27] propose a new approach to combine DMPs with Gaussian processes to enable robots to adapt their roles and cooperation behavior depending on their individual knowledge. In this chapter, any recorded movement trajectory in the system can be represented by a set of differential equations. Then, a trajectory of human motion can be expressed with fewer characteristic variables, thereby realizing the learning of motion characteristics. Finally, we provide a movement that has been learned within certain start and end points, and realize manipulation learning for human–robot interaction using a wearable device.
5.2 Teleoperation Demonstration The proposed wearable device can capture more information of gestures than the traditional wearable devices. Hence, the motions of human arm and hand can be fully utilized by the robotic teleoperation systems. Especially the robotic arm-hand systems can realize imitative motions of human’s arm and hand. The proposed systems can avoid complex motion planning, while the robots perform complex tasks. They are more effective and more convenient to fulfil the tasks.
5.2.1 Teleoperation Algorithm The teleoperation of anthropomorphic robotic system is generated by a large number of involved DOFs, e.g., up to 11 for the case of a 7-DOF arm plus 4 of a hand with 3 fingers. Therefore, the most intuitive way to control these robots is to capture the movements of a human operator and map them to the robotic system. But the kinematic structure of robots is different from that of the operator. Hence, the approach that the mapping of the sensory information obtained from the operator movements onto the movement space of the devices is important. Three methods
130
5 Learning from Wearable-Based Teleoperation Demonstration
have been commonly used for kinematic mapping. They are joint angle mapping (joint-to-joint mapping) [28], pose mapping [29], and position mapping (point-topoint mapping) [30]. Joint-to-joint mapping is that each sensor of the operator movements is directly associated with a joint of the robot. Joint mapping is used when the slave robot has similar kinematics to the human. If the human and robot joints have a clear correspondence, the human joint angles can be imposed directly onto the robot joints with little or no transformation [31, 32]. This mapping is most useful for power grasps, and is limited if the robot is non-anthropomorphic [33, 34]. The joint mapping is also named direct angle mapping, which aims at retrieving angle value of certain joints from human posture and directly using these joint values for controlling the robot joints [35, 36]. Pose mapping is that each pose of the operator is associated with a predefined pose of the robot. Pose mapping attempts to replicate the pose of the human with a robot, which is appealing because, unlike joint mapping, it attempts to interpret the function of the human grasp rather than replicate hand position. Pao and Speeter define transformation matrices relating human and robot poses using least squared error compensation when this transformation is not precise [37]. Others use neural networks to identify the human pose and map the pose to a robot either through another neural network [38] or pre-programmed joint-to-joint mapping [39]. Outside of a discrete set of known poses, pose mapping can lead to unpredictable motions and is usually used when only simple grasping poses are required. Furthermore, the above mappings that use neural networks require classification of the human pose before it is mapped to the robotic hand. If this classifier misidentifies the human pose, the robot hand will move in undesirable ways. Our method also attempts to replicate human shape, rather than joint positions, and we do not require discrete classification of human pose before mapping. Point-to-point mapping is that the positions of particular points of the operator arm or hands are replicated by predefined points on the robot. When a robot is involved in precision position, position mapping is generally employed. The endpoint position is found using the joint angle reading from the data and computation of the forward kinematics of the human. The endpoint positions are achieved in respect of the robot by computing the inverse kinematics of the robot. The widely used forward-inverse kinematics approach, which maps the position of the end effector of the human to the workspace of the robot and then calculates the joints value of the robot via its inverse kinematics, may lead to some negative aspects because of the heterogeneous configuration between the human and robot [40, 41]: (1) the teleoperation is unintuitive, and extensive training is needed for the operator to correctly operate the system; (2) the operator may become error-prone under heavy work pressure, especially when the teleoperation robot system is more likely working in an unstructured and constrained environment with many obstacles.
5.2 Teleoperation Demonstration
131
5.2.2 Demonstration In this subsection, we address the problem of mapping from the wearable device to the Baxter robot. This robot has 7-DOF and seven force-torque modules that can rotate around their axes. However, although the arm of the human body also has seven DOFs, the kinematic structure of the robot arm differs from that of human beings. For this reason, it is important to construct a map between the movement information of the wearable device and that of the robot arm. Figure 5.1 shows the structure of the mapping system. The data collected by the wearable device uses the Earth coordinate system, but the movement of the robotic arm uses its own base coordinate system. We utilize a joint-to-joint mapping method [42] for the upperarm. Table 5.1 shows our proposed mapping structure, which converts the collected data from the device to the 7-DOF robotic system. The traditional robotic control system uses the position mode. Although this control method is simple, it has a slow response time and other shortcomings. For example, when the target point is determined, the robot can run only at the fixed safe speed to the target point by solving the inverse kinematics, which makes its actions slow and somewhat rigid. We propose a novel method in which we control the speed of each articulated motor of the robot. We use a proportional-integralderivative (PID) controller [43] to improve the performance of the control system: 2 k T u(k) =Kp err(k) + err(n)+ Ti n=0 3 Td (err(k) − err(k − 1)) T
Fig. 5.1 Teleoperation scheme of 7-DOF robotic system
(5.1)
132
5 Learning from Wearable-Based Teleoperation Demonstration
Table 5.1 Mapping commands of 7DOF robotic teleoperation system
Human arm Upper-arm
Forearm
Palm
Baxter robotic arm Robotic joints Data glove 1 Yaw 2 Yaw 3 Yaw 4 Yaw of upper and fore arm relative angle 5 Roll of fore and upper arm relative angle 6 Pitch of palm and fore arm relative angle 7 Roll of palm and fore arm relative angle
where the proportionality coefficient is Kp , the sampling interval is T , then the deviation at k is err(k), i.e., the difference between the current position and the target position. The integral is e(k) + e(k − 1) + . . . + e(0), and the differential is (err(k) − e(k − 1))/T . Finally, we adjust the speed via speed feedback. In this way, our teleoperation system minimizes the delay by human imitation.
5.3 Imitation Learning Trajectory planning is one of the primary problems in the fine operation of a manipulator. Trajectory planning based on movement primitives requires complex calculations and accurate modeling. In this section, we combine human-teaching trajectory information with DMPs and propose a trajectory learning control method based on DMPs.
5.3.1 Dynamic Movement Primitives DMPs have been often employed to solve robot learning problems because of their flexibility. In [44], DMPs are modified to allow a robot to learn cooperative table tennis from physical interaction with a human. Another study uses reinforcement learning to combine DMP sequences so that the robot could perform more complex tasks [45]. A style-adaptive trajectory generation approach is proposed based on DMPs, by which the style of the reproduced trajectories can change smoothly as the new goal changes [46]. As mentioned in [47], optimal demonstration is difficult to obtain and multiple demonstrations can encode the ideal trajectory implicitly.
5.3 Imitation Learning
133
Complex motion can be considered as consisting of original-action building blocks executed either in sequence or in parallel. DMPs are mathematical formal expressions of these primitive movements. The difference between a DMP and the previously proposed building block is that each DMP represents a nonlinear dynamic system. The basic idea is that combine a dynamic system with good rules and stable behavior, and add another system to it to make it follow some trajectories. DMPs can be divided into two types: discrete and rhythmic. Here we use discrete DMPs. They must be expressed using a convenient and stable dynamic system and one that has a nonlinear control term. Here, we use one of the simplest dynamic systems, the damping spring model, in which a damped spring vibrator is used as a skeleton to fit any trajectory. The damping spring oscillator model is as follows: m
d 2x dx = −kx − r dt 2 dt
(5.2)
To simplify the system, this formula is rewritten as follows: τ y¨ = αz (βz (g − y) − y) ˙
(5.3)
where y is the state of our system, τ is a constant about time, y˙ is the velocity of the joint trajectory, y¨ is the acceleration, g is the final arrival point of the trajectory, αz and βz are the gain terms. Then, we add a forced item to restrain the repair of our trajectory, as follows: τ y¨ = αz (βz (g − y) − y) ˙ +f
(5.4)
After transformation, we can build the conversion system: τ z˙ = αz (βz (g − y) − z) + f z = τ y˙
(5.5)
The key to the DMP framework is the use of an additional nonlinear system to define the change in the forcing function over time. Here, we import a canonical system, as follows: x˙ = −αx x
(5.6)
The forcing function is defined as a canonical system function, which is similar to a radial basis function: "N f (t) =
i=1 Ψi (x)wi
"N
i=1 Ψi (t)
(5.7)
134
5 Learning from Wearable-Based Teleoperation Demonstration
By importing the normative power system, the forced function is as follows: "N i=1 f (x) = " N
Ψi (x)wi
i=1 Ψi (x)
x(g − y0 )
where w is the weight of each kernel function. And Ψi (x) = exp(−
(5.8) 1 (x 2σi2
− ci )2 ).
In the process of fitting the whole trajectory, a scale attribute g − y0 is provided. The initial value of x is 1, and becomes 0 as the time reaches infinity. It means the forced term converges at the stage close to the target point. Next, we obtain the initial kinetic equation solution: ft arget = τ 2 y¨demo − αz (βz (g − ydemo ) − τ y˙demo )
(5.9)
Finally, we perform a local linear regression to the loss function: Ji =
P
Ψi (t)(ft arget (t) − ωi ξ(t))2
(5.10)
t =1
where ξ(t) = x(t)(g − y0 ). By optimizing the loss function with locally weighted regression, we obtain the following: wi =
s T Γi ft arget s T Γi s
(5.11)
This process can be used with different weights for multiple cores to fit any track. If a 20-core function is used to represent a trajectory, then the trajectory can be uniquely represented by a 20-dimensional vector.
5.3.2 Imitation Learning Algorithm With the proposed teleoperation system, an imitation learning system based on the DMP is designed. Firstly, we use the teleoperation method of learning. In this approach, the robotic learning action has human action characteristics, unlike other teaching methods. Therefore, it is a good way to avoid rigidness of the actions making the behavior of the robot more consistent with human behaviors. Then we record the movement trajectory of the robotic arm, based on the teleoperation system. Combined with DMPs, we can extract and learn the trajectoryinvariant features. This system can learn and reproduce any movement primitives by setting start and end points. In addition, the DMP is preprocessed to achieve rotational invariance, even though it has the characteristics of convergence to the
5.3 Imitation Learning
135
attractor. We record the beginning and end points of the instruction. Then, the system calculates the rotation matrix of the positions of the reproducing and teaching operations: P Q = |P ||Q|cosθ θ = arccos(
PQ ) |P ||Q|
(5.12)
where P and Q is the orientation vector for the starting point and the end point, respectively, θ is the angle. Then, we can obtain the rotation matrix: R(θ ) = I + ωsinθ ˜ + ω˜ 2 (1 − cosθ ) ⎡
(5.13)
⎤ 0 −ωz ωy where I is unit matrix, and ω˜ = ⎣ ωz 0 −ωx ⎦. −ωy ωx 0 Finally, the motion direction and rotation matrix are combined to realize rotation invariance of the DMP. Figure 5.2 shows a flowchart of the DMP-based skill learning and generation process.
Fig. 5.2 DMP-based skill learning and generation process
136
5 Learning from Wearable-Based Teleoperation Demonstration
5.4 Experimental Results Based on our previous work, we design the following experiments to test the skill-learning performance. For the experimental platform, we use the standard Robot Operating System (ROS) platform. The system is written in the node form, which performs the following rules: a node for wearable device acquisition and analysis, one for data receiving mapping conversion, one for trajectory collection and trajectory execution, and one for trajectory learning generation. We use C++ to implement the main algorithm, and implement the operation and execution of the robot in Python.
5.4.1 Robotic Teleoperation Demonstration To test the performance of the teleoperation system, the Baxter robot is used. Figure 5.3 shows our experimental results, in which we can see that by utilizing the proposed speed control mode, the delay of the teleoperation system is nearly zero and the robotic trajectories are smoother than those when using the point control mode.
5.4.2 Imitation Learning Experiments We design an experiment to verify the performance of the modified DMP. First, we teach the robot a movement and record the trajectory by teleoperation, as shown in Fig. 5.4a. In our previous work, the robot can reproduce movement trajectories according to the trajectory characteristics of the teaching, as shown in Fig. 5.4b. Figure 5.5 shows an analysis of the data results, in which the red curve indicates the original trajectory corresponding to Fig. 5.4a and the black curve indicates the reproduced trajectories corresponding to Fig. 5.4b. In the experiments, we used 20 kernel functions to express the trajectory of each dimension.
Fig. 5.3 Teleoperation system
5.4 Experimental Results
137
Fig. 5.4 Circular movement process based on imitation learning. (a) Demonstrate teaching movement. (b) Generated movement after study
0.7 0.6 0.5 z
0.4 0.3 0.2 0.1 0 1.15
1.1
1.05
1
0.95
x
0.9
0.85
0.8
0.4
0.5
0.7
0.6
0.8
0.9
1
y
Fig. 5.5 Analysis of results
In order to verify the performance of the movement characteristics learning, we design another experiment in which we changed the initial state to reproduce new trajectories based on the DMP model. Figure 5.6a shows the original demonstrated movement. Figure 5.6b shows the generated movement using the learned skill (with the same initial state as that used in Fig. 5.6a. Figure 5.6c shows the generated movement when the different initial state. Figure 5.7 shows the trajectories of the three situations. The results confirm the effectiveness of the proposed imitation learning method.
138
5 Learning from Wearable-Based Teleoperation Demonstration
Fig. 5.6 Imitation learning results for a different original state. (a) Demonstrate teaching movement. (b) Generated movement after study. (c) Generated movement with a different start point
0.35 0.3 0.25 z
0.2 0.15 0.1 0.05 0 –0.05 1.1
1.05
1
0.95
0.9
x
Fig. 5.7 Analysis of results
0.85
0.8
0.75
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
5.4 Experimental Results
139
5.4.3 Skill-Primitive Library In imitation learning, the characterization of movement primitives is the key for gaining robotic skills. Any complicated task consists of smaller tasks, which can split into many sub-actions. Therefore, in order to perform complex tasks, robots must learn many skills before they can intelligently select the required primitive actions. Then, we build a small motion primitive library of dynamic movement primitives, some of which are shown in Fig. 5.8. To determine the skill generation performance, we extract the movement primitives and find the generated and original trajectories to be the same. The red line in the figure is the original trajectory, and the blue line is the generated trajectory after robotic learning. Thus, in our experiments, the robot learns skills from the teaching action via the teleoperation system, and then converts them into their own skills. Using DMPs, our results confirm that we can express complex skill actions in multidimensional matrices of 20 parameters for each dimension, and we can reproduce the applied skills in any situation, thus realizing robotic skills learning. The robot can then use these learned skills in any situation. In addition, after learning the skills, the robot can store them as experience. In order to evaluate the overall performance, we then build a set of verification and testing systems based on the completion of the robot learning system. The nature of this learning system is the study of movement primitives, which describes the position of the arm in time series. Therefore, we use the DTW [48] algorithm to calculate the trajectory distance before and after the robotic skill learning. The function of the DTW algorithm is to achieve the minimum overall matching distance between the measured and template eigenvectors, and then to calculate the distance of the optimal matching path with respect to the matching state. Finally, we estimate the degree of similarity between the two traces by the distance between them. Since the distance between two movement trajectories before and after learning cannot be used to intuitively obtain an expression of their similarity, we design the following method for evaluating the similarity between trajectories: Algorithm: input: output: Step1: Step2: Step3:
Trajectory similarity evaluation method Template trajectory and test trajectory similarity We sample the two trajectories by percentage area, where 50 samples are taken Calculate the relative distance of sampling points in each region and take the average of all distances Combining with the average of relative distance, we select the exponential distribution model to represent the similarity model and finally obtain the similarity evaluation model P = exp((−1) ∗ dism )
Using the above process, we can evaluate the learning results of the motivation primitive obtained through skill learning, as stored in the motion primitive
140
5 Learning from Wearable-Based Teleoperation Demonstration
Fig. 5.8 Schematic of movement primitive library. (a) Write “a”. (b) Write “8”. (c) Draw a triangle. (d) Draw a rectangle. (e) Take something. (f) Draw a circle. (g) Greetings. (h) Knocking
5.5 Summary Table 5.2 Experimental analysis of movement primitive library
141 Movement primitives Write “a” Write “8” Draw a triangle Draw a rectangle Take something Draw a circle Greetings Knocking Average
Distance (cm) 2.2674 2.2403 1.4883 1.4765 2.2067 2.6711 7.3754 3.4330 2.8948
Similarity (%) 98.65% 98.76% 98.85% 98.90% 98.72% 98.36% 95.69% 97.59% 98.19%
library. Table 5.2 shows the experimental evaluation results. The results show the effectiveness of the proposed trajectory learning system, which is developed by the extraction of movement features. After mastering the primitive library skill, the robot can rebuild the trajectory according to the requirement. The table shows that the overall average distance between original movement and generated movement based on DTW is 2.8948 cm, and the average similarity of the skills can reach 98.19%. Therefore, the proposed learning system shows strong learning ability.
5.5 Summary This chapter presents the design, implementation, and experimental results of robotic teleoperation system using a wearable device which consists of accelerometers, angular rate sensors, and magnetometers. The proposed device covers the whole segments of arm and hand. We deduce the 3D arm and hand motion estimation algorithms according to the proposed kinematic models of the arm, hand, and fingers, and the attitude of the gesture and positions can be determined. Superior to other teleoperation system, it can not only capture the motion of the hand but also the motion of the arm simultaneously. With the proposed teleoperation system, we can transfer human skills to robot efficiently. Additionally, we teleoperate the robot movement using the wearable device to directly control the speed of the motors, which effectively reduces the delay time of teleoperation. We then propose an imitation learning system based on the rotationinvariant dynamical-movement primitive method. We perform robotic teleoperation demonstrations and imitation learning experiments, and build a human–robot interaction system. Experimental results verify the effectiveness of the proposed method and illustrate that the operator can teleoperate the robots in a natural and intuitive manner easily.
142
5 Learning from Wearable-Based Teleoperation Demonstration
References 1. Cho K-B, Lee B (2012) Intelligent lead: a novel HRI sensor for guide robots. Sensors 12:8301– 8318. https://doi.org/10.3390/s120608301 2. Guo D, Kong T, Sun F, Liu H (2016) Object discovery and grasp detection with a shared convolutional neural network. In: IEEE international conference on robotics and automation (ICRA), pp 2038–2043. https://doi.org/10.1109/ICRA.2016.7487351 3. Yao Y, Fu Y (2014) Contour model-based hand-gesture recognition using the Kinect sensor. IEEE Trans Circ Syst Video Technol 24:1935–1944. https://doi.org/10.1109/TCSVT.2014. 2302538 4. Du G, Zhang P, Mai J, Li Z (2012) Markerless Kinect-based hand tracking for robot teleoperation. Int J Adv Robot Syst 9:1. https://doi.org/10.5772/50093 5. Al-Faiz M, Shanta A (2015) Kinect-based humanoid robotic manipulator for human upper limbs movements tracking. Intell Control Autom 06:29–37. https://doi.org/10.4236/ica.2015. 61004 6. Erol A, Bebis G, Nicolescu M, Boyle R, Twombly X (2007) Vision-based hand pose estimation: a review. Comput Vis Image Underst 108:52–73. https://doi.org/10.1016/j.cviu. 2006.10.012 7. Kobayashi F, Hasegawa K, Nakamoto H, Kojima F (2014) Motion capture with inertial measurement units for hand/arm robot teleoperation. Int J Appl Electromagn Mech 45:931– 937. https://doi.org/10.3233/JAE-141927 8. Zhang P, Liu X, Du G, Liang B, Wang X (2015) A markerless human-manipulators interface using multi-sensors. Ind Robot 42:544–553. https://doi.org/10.1108/IR-03-2015-0057 9. Vogel J, Castellini C, van der Smagt P (2011) EMG-based teleoperation and manipulation with the DLR LWR-III. In: 2011 IEEE/RSJ international conference on intelligent robots and systems (IROS). https://doi.org/10.1109/IROS.2011.6094739 10. Miller N, Jenkins O, Kallmann M, Mataric M (2004). Motion capture from inertial sensing for untethered humanoid teleoperation. In: 2004 4th IEEE-RAS international conference on humanoid robots, vol 2, pp 547–565. https://doi.org/10.1109/ICHR.2004.1442670 11. Kobayashi F, Kitabayashi K, Nakamoto H, Kojima F (2013) Hand/arm robot teleoperation by inertial motion capture. In: Second international conference on robot, vision and signal. https:// doi.org/10.1109/RVSP.2013.60 12. Du G, Zhang P, Li D (2014). Human–manipulator interface based on multisensory process via Kalman filters. IEEE Trans Ind Electron 61(10):5411–5418. https://doi.org/10.1109/TIE.2014. 2301728 13. Peer A, Einenkel S, Buss M (2008) Multi-fingered telemanipulation – mapping of a human hand to a three finger gripper. In: Proceedings of the 17th IEEE international symposium on robot and human interactive communication, RO-MAN, pp 465–470. https://doi.org/10.1109/ ROMAN.2008.4600710 14. Rosell J, Suarez R, Rosales Gallegos C, Pérez Ruiz A (2011). Autonomous motion planning of a hand-arm robotic system based on captured human-like hand postures. Auton Robots 31:87– 102. https://doi.org/10.1007/s10514-011-9232-5 15. Pao L, Speeter T (1989) Transformation of human hand positions for robotic hand control. In: Proceedings of the IEEE international conference on robotics and automation, pp 1758–1763. https://doi.org/10.1109/ROBOT.1989.100229 16. Lin Y, Sun Y (2013) Grasp mapping using locality preserving projections and kNN regression. In: Proceedings – IEEE international conference on robotics and automation, pp 1076–1081. https://doi.org/10.1109/ICRA.2013.6630706 17. Bocsi B, Csató L, Peters J (2013). Alignment-based transfer learning for robot models. In: Proceedings of the international joint conference on neural networks, pp 1–7. https://doi.org/ 10.1109/IJCNN.2013.6706721
References
143
18. Zhou J, Tsang I (2014) Heterogeneous domain adaptation for multiple classes. In: Proceedings of the seventeenth international conference on artificial intelligence and statistics, PMLR, vol 33, pp 1095–1103 19. Schaal, Stefan. (2006). Dynamic movement primitives–a framework for motor control in humans and humanoid robotics. In: Adaptive motion of animals and machines. Springer, Tokyo. https://doi.org/10.1007/4-431-31381-8_23 20. Pastor P, Hoffmann H, Asfour T, Schaal S (2009) Learning and generalization of motor skills by learning from demonstration. In: International conference on robotics and automation (ICRA 2009), pp 763–768. https://doi.org/10.1109/ROBOT.2009.5152385 21. Ijspeert AJ, Nakanishi J, Hoffmann H, Pastor P, Schaal S (2012) Dynamical movement primitives: learning attractor models for motor behaviors. Neural Comput 25:328. https://doi. org/10.1162/NECO_a_00393 22. Metzen J, Fabisch A, Senger L, Fernundez J, Kirchner E (2013) Towards learning of generic skills for robotic manipulation. KI - Künstliche Intelligenz 28:15. https://doi.org/10.1007/ s13218-013-0280-1 23. Yu T, Finn C, Dasari S, Xie A, Zhang T, Abbeel P, Levine S (2018) One-shot imitation from observing humans via domain-adaptive meta-learning. https://arxiv.org/abs/1802.01557. https://doi.org/10.15607/RSS.2018.XIV.002 24. Hussein A, Gaber M, Elyan E, Jayne C (2017) Imitation learning: a survey of learning methods. ACM Comput Surv 50:1. https://doi.org/10.1145/3054912 25. Bagnell J (2014) Reinforcement learning in robotics: a survey. Int J Robot Res 32(11):1238– 1274. https://doi.org/10.1007/978-3-319-03194-1_2 26. Ijspeert AJ, Nakanishi J, Schaal S (2002) Movement imitation with nonlinear dynamical systems in humanoid robots. In: Proceedings – IEEE international conference on robotics and automation, vol 2, pp 1398–1403. https://doi.org/10.1109/ROBOT.2002.1014739 27. Herzog S, Worgotter F, Kulvicius T (2016) Optimal trajectory generation for generalization of discrete movements with boundary conditions. In: IEEE/RSJ international conference on intelligent robots and systems, pp 3143–3149. https://doi.org/10.1109/IROS.2016.7759486 28. Rosell J, Suarez R, Rosales Gallegos C, Pérez Ruiz A (2011) Autonomous motion planning of a hand-arm robotic system based on captured human-like hand postures. Auton Robots 31:87– 102. https://doi.org/10.1007/s10514-011-9232-5 29. Pao L, Speeter T (1989) Transformation of human hand positions for robotic hand control. In: IEEE international conference on robotics and automation, vol 3, pp 758–1763. https://doi.org/ 10.1109/ROBOT.1989.100229 30. Peer A, Einenkel S, Buss M (2008) Multi-fingered telemanipulation – mapping of a human hand to a three finger gripper. In: Proceedings of the 17th IEEE international symposium on robot and human interactive communication, RO-MAN, pp 465–470. https://doi.org/10.1109/ ROMAN.2008.4600710 31. Aracil R, Balaguer C, Buss M, Ferre M, Melchiorri C (2007) Book advances in telerobotics. Springer, Berlin. https://doi.org/10.1007/978-3-540-71364-7 32. Argall B, Chernova S, Veloso M, Browning B (2009) A survey of robot learning from demonstration. Robot Auton Syst 57:469–483. https://doi.org/10.1016/j.robot.2008.10.024 33. Liarokapis M, Artemiadis P, Bechlioulis C, Kyriakopoulos K (2013) Directions, methods and metrics for mapping human to robot motion with functional anthropomorphism: a review. School of Mechanical Engineering, National Technical University of Athens, Technical Report. https://doi.org/10.13140/RG.2.1.4075.2405 34. Hong JW, Tan XN (1989) Calibrating a VPL DataGlove for teleoperating the Utah/MIT hand. In: IEEE international conference on robotics and automation, vol 3, pp 1752–1757. https:// doi.org/10.1109/ROBOT.1989.100228 35. Kim D, Kim J, Lee K, Park C, Song J, Kang D (2009) Excavator tele-operation system using a human arm. Autom Constr 18:173–182. https://doi.org/10.1016/j.autcon.2008.07.002 36. Cerulo I, Ficuciello F, Lippiello V, Siciliano B (2017) Teleoperation of the SCHUNK S5FH under-actuated anthropomorphic hand using human hand motion tracking. Robot Auton Syst. https://doi.org/10.1016/j.robot.2016.12.004
144
5 Learning from Wearable-Based Teleoperation Demonstration
37. Pao L, Speeter T (1989) Transformation of human hand positions for robotic hand control. In: IEEE international conference on robotics and automation. https://doi.org/10.1109/ROBOT. 1989.100229 38. Ekvall S, Kragic D (2004) Interactive grasp learning based on human demonstration. In: Proceedings – IEEE international conference on robotics and automation, vol 4, pp 3519–3524. https://doi.org/10.1109/ROBOT.2004.1308798 39. Wojtara T, Nonami K (2004) Hand posture detection by neural network and grasp mapping for a master slave hand system. In: IEEE/RSJ international conference on intelligent robots and systems. https://doi.org/10.1109/IROS.2004.1389461 40. Pierce R, Kuchenbecker K (2012) A data-driven method for determining natural human-robot motion mappings in teleoperation. In: Proceedings of the IEEE RAS and EMBS international conference on biomedical robotics and biomechatronics, pp 169–176. https://doi.org/10.1109/ BioRob.2012.6290927 41. Stoelen MF, Tejada VF, Huete AJ, Balaguer C, Bonsignorio F (2015) Distributed and adaptive shared control systems: methodology for the replication of experiments. IEEE Robot Autom Mag. https://doi.org/10.1109/MRA.2015.2460911 42. Fang B, Sun F, Liu H, Guo D, Chen W, Yao G (2017) Robotic teleoperation systems using a wearable multimodal fusion device. Int J Adv Robot Syst 14:1–11. https://doi.org/10.1177/ 1729881417717057 43. Ang K, Chong G, Li Y (2005) PID control system analysis, design, and technology. IEEE Trans Control Syst Technol 13:559–576. https://doi.org/10.1109/TCST.2005.847331 44. Mulling K, Kober J, Kroemer O, Peters J (2013) Learning to select and generalize striking movements in robot table tennis. Int J Robot Res 32:263–279. https://doi.org/10.1177/ 0278364912472380 45. Stulp F, Theodorou E, Schaal S (2012) Reinforcement learning with sequences of motion primitives for robust manipulation. IEEE Trans Robot 28:1360–1370. https://doi.org/10.1109/ TRO.2012.2210294 46. Zhao Y, Xiong R, Li F, Xiaohe D (2014) Generating a style-adaptive trajectory from multiple demonstrations. Int J Adv Robot Syst 11:1. https://doi.org/10.5772/58723 47. Coates A, Abbeel P, Ng A (2008) Learning for control from multiple demonstrations. IEEE international conference on machine learning, pp 144–151. https://doi.org/10.1145/1390156. 1390175 48. Vakanski A, Mantegh I, Irish A, Janabi-Sharifi F (2012) Trajectory learning for robot programming by demonstration using hidden Markov model and dynamic time warping. IEEE Trans Syst Man Cybern B 42. https://doi.org/10.1109/TSMCB.2012.2185694
Chapter 6
Learning from Visual-Based Teleoperation Demonstration
Abstract This chapter proposes the deep neural networks to enhance visual teleoperation considering human–robot posture consistence. Firstly, the teacher–student network (TeachNet) is introduced, which is a novel neural network architecture for intuitive and markerless vision-based teleoperation of dexterous robotic hands. It is combined with a consistency loss function, which handles the differences in appearance and anatomy between human and robotic hands. Then the multistage structure of visual teleoperation network is designed for robotic arm. Finally, imitation experiments are carried out on the robots to demonstrate that the proposed visual-based posture-consistent teleoperation is effective and reliable.
6.1 Introduction Robotic dexterous hands and arms provide promising alternatives for replacing human hands and arms in the execution of tedious and dangerous tasks. Teleoperation is superior to intelligent programming when it deals with fast decisions and corner cases. Unlike contacting or wearable device-based teleoperation, markerless visionbased teleoperation [1] offers the advantages of showing natural human-limb motions and less invasive. Analytical vision-based teleoperation falls into two categories: model-based and appearance-based approaches. Model-based approaches [2] provide continuous solutions but are computationally costly and typically depend on the availability of a multicamera system [3]. Conversely, appearancebased approaches [4] recognize a discrete number of hand poses that correspond typically to the method training set without high computational cost and hardware complexity. Recently, an increasing number of researchers have focused on the datadriven vision-based teleoperation methods which get the 3D hand pose or recognize the class of hand gestures using the deep CNN and then mapping the locations or the corresponding poses to the robot. However, all these solutions not only strongly depend on the accuracy of the hand pose estimation or the classification but also suffer from the long post-processing time.
© Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_6
145
146
6 Learning from Visual-Based Teleoperation Demonstration
Compared to analytical methods, data-driven techniques based on deep neural network pay more attention on object representation and perceptual processing such as pose estimation, object detection, and image classification [5–7]. The effect of convolutional networks is investigated in large-scale image recognition setting [8, 9]. In [10], deep CNN methods are proposed for 3D human pose estimation from monocular images. A deep CNN is trained by a spherical part model for hand pose estimation [11]. Inspired by the competence of neural network methods in human posture estimation, an end-to-end deep neural network which receives the depth image of a human operator arm and outputs the corresponding joint angles of a robot arm is proposed for visual teleoperation. An issue of neural network methods is that direct mapping from a depth image to the joint angles of a robot arm is highly nonlinear, which causes difficulties in the learning procedure. Therefore, we propose a multi-stage neural network which accelerates the convergence rate of visual teleoperation network training. Besides the structure of a visual teleoperation network, the performance of the visual teleoperation network training is also influenced by the quantity and quality of dataset directly. A dataset named UTDMHAD which consists of RGB images, depth images, and skeleton positions of human is provided in [12]. It has been shown from [13] that 3.6 million accurate 3D human poses are given for training realistic human sensing systems and evaluating human pose estimation algorithms. However, a human–robot posture-consistent tuple dataset which contains a large number of pairs of consistent human and robot postures needs to be established to train the visual teleoperation network. Therefore, it is meaningful to develop a mapping method to calculate the consistent human arm and robot arm postures for the establishment of the training tuple database. Human teleoperation of robots has usually been implemented through contacting devices such as tracking sensors [14], gloves equipped with angle sensors [15], inertial sensors [16], and joysticks [17]. Stanton et al. [18] suggest an end-to-end teleoperation on a 23 DOFs robot by training a feed-forward neural network for each DOF of the robot to learn the mapping between sensor data from the motion capture suit and the angular position of the robot actuator. However, wearable devices are customized for a certain size of the human hand or the human body, and contacting methods may hinder natural human-limb motion. Compared with these methods, markerless vision-based teleoperation is less invasive and performs natural and comfortable gestures. Visual model-based methods, such as [19], compute continuous 3D positions and orientations of thumb and index finger from segmented images based on a camera system and control a parallel jaw gripper mounted on a six-axis robot arm. Romero [20] classifies human grasps into grasp classes and approaches based on human hand images then maps them to a discrete set of corresponding robot grasp classes following the external observation paradigm. Compared with analytical methods, data-driven techniques pay more attention to object representation and perceptual processing, e.g., feature extraction, object recognition or classification, and pose estimation. Michel et al. [21] provide a teleoperation method for a NAO humanoid robot that tracks human body motion from markerless visual observations and then calculates the inverse kinematics process. But this method does not consider the physical constraints and joint limits of the
6.2 Manipulation Learning of Robotic Hand
147
robots, so it easily generates the poses that the robot cannot reach. Nevertheless, these methods strongly depend on the accuracy of the hand pose estimation or the classification and require much time for post-processing. In this chapter, we aim to design an end-to-end vision-based CNN which generates continuous robot poses and provides the fast and intuitive experience of teleoperation. 3D hand pose estimation typically is one of the essential research fields in vision-based teleoperation. Although the technology has developed rapidly, the estimation only achieves low mean errors [22]. According to the representation of the output pose, the 3D hand pose estimation methods consist of detection-based and regressionbased methods. Detection-based methods [23] give the probability density map for each joint, while regression-based methods [24, 25] directly map the depth image to the joint locations or the joint angles of a hand model. Regardless of whom the output joint pose belongs to, the regression-based network is similar to our end-toend network. We instead seek to take a noisy depth image of the human hands and arm as input and produce joint angles of the robot hands and arms as output by training a deep CNN. The end-to-end vision-based teleoperation can be a natural and an intuitive way to manipulate the remote robot and is user-friendly to the novice teleoperators. Therefore, it is essential to design an efficient network which could learn the corresponding robot pose feature in human pose space. Since the endto-end method depends on massive human–robot teleoperation pairings, we aim to explore an efficient method which collects synchronized hand and arm data both from the robot and the human.
6.2 Manipulation Learning of Robotic Hand In order to learn the pose feature of the robot from the images of the human hand, the problem of how to get a vast number of human–robot pairings should be considered. Prior works in [26, 27] acquire the master-slave pairings by demanding a human operator to imitate the robot motion synchronously. The pairing data is costly to collect and typically comes with noisy correspondences. Also, there is no longer an exact correspondence between the human and the robot because physiological differences make the imitation non-trivial and subjective to the imitator. In fact, the robot state is more accessible and is relatively stable comparing to the human hand, and there are many existing human hand datasets. Since training on real images may require huge data collection time, an alternative is to learn on simulated images and adapt the representation to real data [28]. We propose a novel criterion of generating human–robot pairing from these results by using an existing dataset of labeled human hand depth images, manipulating the robot and recording corresponding joint angles and images in simulation, and performing extensive evaluations on a physical robot. In this chapter, we present a novel scheme for teleoperating the Shadow dexterous hand based on a single depth image (see Fig. 6.1). Our primary contributions are: (1)
148
6 Learning from Visual-Based Teleoperation Demonstration
Fig. 6.1 Our vision-based teleoperation architecture. (Center) TeachNet is trained offline to predict robot joint angles from depth images of a human hand using our 400k pairwise human– robot hand dataset. (Left) Depth images of the operator’s hand are captured by a depth camera then feed to TeachNet. (Right) The joint angles produced by TeachNet are executed on the robot to imitate the operator’s hand pose
We propose an end-to-end TeachNet, which learns the kinematic mappings between the robot and the human hand. (2) We build a pairwise human–robot hand dataset that includes pairs of depth images, as well as corresponding joint angles of the robot hand. (3) We design an optimized mapping method that matches the Cartesian position and the link direction of shadow hand from the human hand pose and properly takes possible self-collisions into account. During the network evaluation, TeachNet achieves higher accuracy and lower error compared with other end-to-end baselines. As illustrated in our robotic experiments, our method allows the Shadow robot to imitate human gestures and to finish the grasp tasks significantly faster than state-of-the-art data-driven visionbased teleoperation. Solving joint regression problems directly from human images is quite challenging because the robot hand and the human hand occupy two different domains. Specifically, imagine that we have image IR of a robotic hand and image IH of a human hand, while the robotic hand in the image acts exactly the same as the human hand. The problem of mapping the human hand image to the corresponding robotic joint could be formulated as below: ff eat : IH ∈ R2 → zpose fregress : zpose → Θ.
(6.1)
To better process the geometric information in the input depth image and the complex constraints on joint regression, we adopt an encode–decode style deep neural network. The upper branch in Fig. 6.2 illustrates the network architecture we use. However, the human hand and shadow hand basically come from different domains, thus it could be difficult for ff eat to learn an appropriate latent feature zpose in pose space. In contrast, the mapping from IR to joint target Θ will be more natural as it is exactly a well-defined hand pose estimation problem. Intuitively, we
6.2 Manipulation Learning of Robotic Hand
149
Fig. 6.2 TeachNet Architecture. Top: human branch, Bottom: robot branch. The input depth images IH and IR are fed to the corresponding branch that predicts the robot joint angels ΘH , ΘR . The residual module is a convolutional neural network with a similar architecture as ResNet. FC denotes a fully connected layer, BN denotes a batch normalization layer, R denotes a Rectified Linear Unit
believe that for a paired human and robotic image, their latent pose features zpose should be encouraged to be consistent as they represent the same hand pose and will be finally mapped to the same joint target. Also, based on the observation that the mapping from IR to Θ performs better than IH (these preliminary results can be found in Fig. 6.9), the encoder ff eat of IR could extract better pose features, which could significantly improve the regression results of the decoder. With these considerations above, we propose a novel TeachNet to tackle the vision-based teleoperation problem (6.1) in an end-to-end fashion. TeachNet consists of two branches, the robot branch which plays the role of a teacher and the human branch as the student. Each branch is supervised with a mean squared error (MSE) loss Lang : Lang = Θ − J 2
(6.2)
where J is the groundtruth joint angles. Besides the encoder–decoder structure that maps the input depth image to joint prediction, we define a consistency loss Lcons between two latent features zH and zR to exploit the geometrical resemblance between human hands and the robotic hand. Therefore, Lcons forces the human branch to be supervised by a pose space shared with the robot branch. To explore the most effective aligning mechanism, we design two kinds of consistency losses and two different aligning positions: The
150
6 Learning from Visual-Based Teleoperation Demonstration
most intuitive mechanism for feature alignment would be providing an extra MSE loss over the latent features of these two branches: Lcons_h = zH − zR 2 .
(6.3)
Sometimes, (6.3) could distract the network from learning hand pose representations especially in the early training stage. Inspired by Villegas et al. [29], we feed zH and zR into a discriminator network D [30] to compute a realism score for real and fake pose features. The soft consistency loss is basically the negative of this score: Lcons_s = log (1 − D(zH )) .
(6.4)
As for the aligning position, we propose early teaching and late teaching, respectively. For the former, we put the alignment layer after the encoding and embedding module, while in the latter the alignment layer is positioned on the last but one layer of the whole model (which means that the regression module will only contain one layer). In the following, we will refer to early teaching by Lcons_s as Teach Soft-Early, late teaching by Lcons_s as Teach Soft-Late, early teaching by Lcons_h as Teach Hard-Early, and late teaching by Lcons_h as Teach Hard-Late. We also introduce an auxiliary loss to further improve our teleoperation model: The physical loss Lphy which enforces the physical constraints and joint limits is defined by: Lphy (Θ) =
[max(0, (θmax − Θi )) + max(0, (Θi − Θmin ))].
(6.5)
i
Overall, the complete training objective for each branch is Lt each (Θ) = Lang + Lphy
(6.6)
Lst ud (Θ) = Lang + α ∗ Lcons + Lphy
(6.7)
where α = 1 for hard consistency loss and α = 0.1 for soft consistency loss. Training the TeachNet which learns the kinematic mapping between the human hand and the robot hand strongly relies on a massive dataset with human– robot pairings. We achieve this by using the off-the-shelf human hand dataset BigHand2.2M Dataset [31] and an optimized mapping method using the pipeline of Fig. 6.3. With this pipeline, we collect a training dataset that contains 400K pairs of simulated robot depth images and human hand depth images, with corresponding robot joint angles and poses. The Shadow Dexterous Hand [32] used in this work is motor-controlled and equipped with five fingers. Its kinematic chain is shown in the right side of Fig. 6.10. Each of these fingers has a BioTac tactile sensor attached which replaces the last phalanx and the controllability of the last joint. Each finger has four joints, the distal, middle, proximal, and the metacarpal joint, but the first joint of each finger is
6.2 Manipulation Learning of Robotic Hand
151
Fig. 6.3 Pipeline for dataset generation. (Top left) The human hand model has 21 joints and moves with 31 degrees of freedom in the BigHand2.2M dataset. (Bottom left) A depth image example from the BigHand2.2M dataset. (Middle) Optimized mapping method from the human hand to the Shadow hand. (Top right) The Shadow hand with BioTac sensors has 24 joints and moves with 19 degrees of freedom. (Bottom right) The corresponding RGB and depth images of Shadow gestures obtained from Gazebo. The colored circles denote the joint keypoint positions on the hand, and the green triangles denote the common reference frame F
stiff. The little finger and the thumb are provided with an extra joint for holding the objects. To sum up, there are 19 DoFs after plusing the two DoFs in the wrist. In contrast to the robot hand, the human hand model from BigHand2.2M dataset has 21 joints and can move with 31 DoFs, as shown in Fig. 6.3. The main kinematic differences between the robot hand and the human hand are the limited angle ranges of the robot joints and the structure of the wrist joints. Simplifying the dissimilarity between the Shadow hand and the human hand, two wrist joints of the Shadow at 0 rad are fixed and only 15 joint keypoints which are TIP, PIP, MCP in each finger of the Shadow are considered. Effectively mapping the robot pose from the human hand pose plays a significant role in our training dataset. In order to imitate the human hand pose, we propose an optimized mapping method integrating position mapping, orientation mapping, and properly taking into account possible self-collisions. Firstly, we use the common reference frame F located at the human wrist joint and 34 mm above the z-axis of the robot wrist joint. Note that 34 mm is the height from the wrist joint to the base joint of the thumb. These locations are chosen because they lie at locations of high kinematic similarity. Secondly, we enforce position mapping to the fingertips with a strong weight ωpf and to PIP joints with minor weight ωpp . Thirdly, direction mapping with weight ωd is applied to five proximal phalanges and distal phalanges of thumb. In our dataset, we set {ωpf , ωpp , ωd } = {1, 0.2, 0.2}.
152
6 Learning from Visual-Based Teleoperation Demonstration
Fig. 6.4 The Shadow depth images from nine viewpoints corresponding to one human gesture in our dataset
Taking advantage of BioIK solver to determine the robot joint angles Θ ∈ R 17 , the robot executes movements in Gazebo and checks self-collision by MoveIt. In case BioIK gives a self-collision output, we define a cost function Fcost which measures the distance between two links Fcost = max(0, R_col − Pi − Pj 2 )
(6.8)
where Pi , Pj , respectively, denote the position of link i, link j, R_col is the minimum collision free radius between two links. Considering that the BigHand2.2M dataset spans a wide range of observing viewpoints to the human hand, it is indispensable to increase the diversity of viewpoints of the robot data. Thus we collect visual samples of the robot through nine simulated depth cameras with different observing positions in Gazebo and record nine depth image for each pose simultaneously. As an example, in Fig. 6.4 we present the depth images of the robot from nine viewpoints corresponding to the human hand pose at the bottom left in Fig. 6.3.
6.3 Manipulation Learning of Robotic Arm In this section, a visual teleoperation framework based on deep neural networks is proposed for human–robot posture-consistent teleoperation. A tuple database of
6.3 Manipulation Learning of Robotic Arm
153
human–robot consistent postures is generated by a novel mapping method which maps human arm posture to the corresponding joint angles of a robot arm. A multistage visual teleoperation network is trained by the human–robot posture-consistent tuple dataset and then used to teleoperate a robot arm. An illustrative experiment is conducted to verify the visual teleoperation scheme developed in this chapter. The main contributions of this chapter are shown as follows: • A visual teleoperation framework is proposed, which features a deep neural network structure and a posture mapping method. • A human–robot posture-consistent dataset is established by a data generator, which is able to calculate the corresponding robot arm joint angle data from the human body data. • A multi-stage network structure has been proposed to increase flexibility in training and using of the visual teleoperation network. A procedure of training a visual teleoperation network is shown in Fig. 6.5. Notation Throughout this chapter, Rn is the n-dimensional Euclidean space. · 2 n n n T represents the 2-norm of vectors. Pmn = xm ,y ,z is the coordinates of point m n n n T m m n in frame n. Note that Qm = xm , ym , zm , 1 is homogeneous coordinates of point m in n frame. Training a deep neural network to solve a robot joint angle regression problem from a human depth image is challenging because the regression problem is a highly nonlinear mapping which causes difficulties in the learning procedure. To overcome these difficulties, a multi-stage visual teleoperation network which contains multistage losses is designed to generate robot joint angle from human depth image. The proposed multi-stage visual teleoperation network consists of three stages which are a human arm keypoint position estimation stage, a robot arm posture estimation stage, and a robot arm joint angle generation stage. The structure of the multi-stage visual teleoperation network is shown in Fig. 6.6. To supervise human arm keypoint positions from a human depth image, a pixelto-pixel part and a pixel-to-point part are proposed in designing the human arm keypoint position estimation stage. Three kinds of building blocks are given in the pixel-to-pixel part. The first one is a residual block which consists of convolution
Fig. 6.5 The procedure of training the visual teleoperation network
154
6 Learning from Visual-Based Teleoperation Demonstration
Fig. 6.6 The structure of the multi-stage visual teleoperation network
layers, batch normalization layers, and an activation function. The second block is a downsampling block which is identical to a max pooling layer. The last block is an upsampling block which contains deconvolution layers, batch normalization layers, and an activation function. The kernel size of the residual blocks is 3 × 3 and that of the downsampling and upsampling layers is 2 × 2 with stride 2. Furthermore, max pooling layers, fully connected layers, batch normalization, and an activation function are proposed in the pixel-to-point part. Robot arm directional angles are estimated from human arm keypoint positions in the robot arm posture estimation stage. Max pooling layers, fully connected layers, batch normalization, and an activation function are proposed in this stage. In this stage, robot arm joint angles are generated from robotic arm directional angles by fully connected layers, batch normalization layer and an activation function. An overall loss function for training the multi-stage visual teleoperation network consists of a human arm keypoint position estimation loss, a robot arm posture estimation loss, a robot arm joint angle generation loss, and a physical constraint loss. A MSE function is adopted as the human arm keypoint position loss Lse as follows: Lse =
N
n 2 Xn − X
(6.9)
n=1
n = ( where Xn = (xn , yn , zn ) ∈ R 3 and X xn , yn , zn ) ∈ R 3 denote the groundtruth coordinates and the estimated coordinates of the n-th keypoint of human arm, respectively. N denotes the number of keypoints selected on the human arm. A MSE function is given for the robot arm posture loss Lrp as follows: Lrp =
N
2 R − R
(6.10)
n=1
= ( where R = (rxn , ryn , rzn ) ∈ R 3 and R rxn , ryn , rzn ) ∈ R 3 are groundtruth robot arm directional angles and estimated robot arm directional angles of nth robot keypoints, respectively. Note that N denotes the number of the robot keypoints. The
6.3 Manipulation Learning of Robotic Arm
155
robot joint angle generation loss Lrg for the robot angle generation stage supervised by a MSE function is shown as follows: 2 Lrg = Θ − Θ
(6.11)
= ( θ1 , · · · , θn ) ∈ R n are the groundtruth where Θ = (θ1 , · · · , θn ) ∈ R n and Θ robot joint angles and estimated robot joint angles of robot arms, respectively, and n is the number of the robot arm joint. The physical loss Lphy which enforces the physical constraints and joint limits is defined by 2 + min(0, (Θ − Θmin ))2 Lphy = max(0, (Θmax − Θ)) where Θmax and Θmin are joint limits of robot joints. Finally, the overall loss function L for the multi-stage visual teleoperation network is defined as follows: L = λse Lse + λrp Lrp + λrg Lrg + λphy Lphy
(6.12)
where λse , λrp , λrg , and λphy are weights to balance the skeleton point estimation loss, robot arm posture loss, robot joint angle generation loss, and physical constraints loss, respectively. Parameter selections of skeleton point estimation loss Lse , robot arm posture loss Lrp , and robot joint angle generation loss Lrg depend on the accuracy of label data (human arm keypoint position data, robot arm posture data, and robot arm joint angle data). Parameter selection of physical loss λphy depends on the λse , λrp , and λrg . When λphy is too large, the overall loss L will decrease rapidly if neural network outputs robot arm joint angle in constraints which limits the neural network from learning label data. When λphy is too small, the neural network performance is also poor because outputs are not in constraints. λ +λ +λ Therefore, the λphy is given as λphy = se 3rp rg . Training a visual teleoperation network learns a mapping from a human depth image to a set of robot joint angles relying on a human–robot posture-consistent tuple dataset. In this chapter, a novel posture-consistent mapping method is given as a generator to generate the human–robot tuple data. Therefore, the generation for human–robot tuple dataset is achieved by using a UTD-MHAD dataset [12] and the human–robot tuple data generator. In this section, a robot arm of Baxter is taken as an example to show the human–robot posture-consistent mapping method. A transform frame (TF) model is shown in Fig. 6.7 to describe robot arm posture of Baxter. Points Sr , Er , Wr , and Tr are defined as robot arm shoulder point, robot arm elbow point, robot arm wrist point, and robot arm tip point, respectively. Moreover, parameters of Baxter arm TF model are shown in Table 6.1. Transform matrix of Baxter from arm joint i to joint i + 1 is shown as follows: i (θi+1 ) = Rx (δix )Ry (δiy )Rz (δiz )Txyz (Pi )Rz (θi+1 ) Ti+1
i ∈ {0, 1, · · · , 6}
(6.13)
156
6 Learning from Visual-Based Teleoperation Demonstration
Fig. 6.7 TF model of Baxter arm Table 6.1 Parameters of Baxter arm TF model
Joint i 1 2 3 4 5 6 7
δix 0 l1 0 0 0 0 0
δiy 0 0 0 l3 0 l5 0
δiz l0 0 l2 0 l4 0 l6
pix 0 −π/2 π/2 −π/2 π/2 −π/2 π/2
piy 0 0 π/2 0 π/2 0 π/2
piz 0 0 0 −π/2 0 −π/2 0
where Ra (δia ) is the rotation matrix around axis a, the δia is an angle of rotation around axis a, and Txyz (Pi ) is a translation matrix, where Pi = [pix , piy , piz ] is a displacement vector. The θi is the ith joint angle of the Baxter arm. In Table 6.1, the unit for δix ,δiy ,δiz and pix ,piy ,piz , are millimeter and radian, respectively. A direct mapping between human arm joint angles and robot arm joint angles is unable to be created since kinematic models of human arm and robot arm are totally different. Nevertheless, a human–robot posture-consistent mapping is established by matching predefined keypoints (shoulder point, elbow point, wrist point, and handtip point) of a human arm and a robot arm. A set of robot arm joint angle is generated, which guarantees the consistence of posture between the human arm and robot arm. n n n T Note that Pmn = xm , ym , zm is the coordinate of point m in frame n. Qnm = n n n T xm , ym , zm , 1 is homogeneous coordinate of point m in n frame. Obviously,
6.3 Manipulation Learning of Robotic Arm
157
Fig. 6.8 Keypoints of human in robot arm 0 coordinate
T Qnm = Pmn T , 1 . As shown in Fig. 6.8, human arm keypoint position data are transferred from camera homogeneous coordinate to robot 0 homogeneous coordinate through coordinate transformation. Q0m = Tc0 Qcm
(6.14)
T T T T T where Q0m = Q0sm , Q0em , Q0wm , Q0tm , with Q0sm , Q0em , Q0wm and Q0tm are homogeneous coordinates of human arm shoulder, elbow, wrist, and handtip points in robot 0 frame. Similar to the Q0m , the Qcm includes homogeneous coordinates of human arm keypoint in camera frame. Note that Tc0 is transform matrix from camera homogeneous coordinate to robot 0 homogeneous coordinate. Note that Ps0m , Pe0m , Pw0m , and Pt0m coordinates of human arm shoulder, elbow, wrist, and handtip points in robot 0 frame are obtained from Q0sm , Q0em , Q0wm , and Q0tm , respectively. Baxter arm joint angles θ1 ,θ2 are calculated by directional vector P0sm P0em of human. Moreover, Baxter arm joint angles θ3 ,θ4 and θ5 ,θ6 are calculated by directional vector P0em P0wm and P0wm P0tm of human, respectively. Distance dsem between points Ps0m and Pe0m of human is obtained as follows: dsem =
#
(xe0m − xs0m )2 + (ye0m − ys0m )2 + (ze0m − zs0m )2
(6.15)
Directional angle set Asem of human direction vector P0sm P0em is shown that T Asem = cos(αsem ), cos(βsem ), cos(γsem ) xem − xsm yem − ysm zem − zsm T = , , dsem dsem dsem
(6.16)
158
6 Learning from Visual-Based Teleoperation Demonstration
Note that Ps0r = [0, 0, l0 ]T is shoulder point of Baxter arm. Elbow point Pe0r (θ1 , θ2 ) of Baxter arm is obtained as follows: T T Q0er (θ1 , θ2 ) = Pe0r (θ1 , θ2 ), 1 = T10 (θ1 )T21 (θ2 )T32 (θ3 )Q3er
(6.17)
i where Q3er = [0, 0, 0, 1]T , and transform matrices Ti+1 , i ∈ {0, 1, 2} are obtained by Eq. 6.13 and parameters in Table 6.1. Furthermore, directional angle set Aser of Baxter arm directional vector P0sr P0er is shown as follows:
Aser (θ1 , θ2 ) = cos(αser (θ1 , θ2 )), cos(βser (θ1 , θ2 )), T cos(γser (θ1 , θ2 ))
(6.18)
Baxter arm joint angles θ¯1 ,θ¯2 are obtained by solving constrained nonlinear matrix equality (6.19). Asem (θ1 , θ2 ) = Aser s.t.
(6.19)
θ1 ∈ Θ1 θ2 ∈ Θ2
where Θ1 ,Θ2 are constraints of Baxter arm joint angles θ1 ,θ2 , respectively. Directional angle set Aewm of human directional vector P0em P0wm is obtained by the same method as Asem . Note that wrist point Pw0r (θ3 , θ4 ) of Baxter arm is shown as T T Q0wr (θ3 , θ4 ) = Pw0r (θ3 , θ4 ), 1 = T10 (θ¯1 )T21 (θ¯2 )T32 (θ3 )T43 (θ4 )T54 (θ5 )Q5wr
(6.20)
i where Q5wr = [0, 0, 0, 1]T , transform matrices Ti+1 , i ∈ {0, 1, · · · , 4} are obtained by Eq. 6.13 with solution θ¯1 ,θ¯2 of Eq. 6.19 and parameters in Table 6.1. Elbow 0 (θ¯ 1, θ2) ¯ of Baxter arm is given by Eq. 6.17 with solution θ¯1 ,θ¯2 of Eq. 6.19. point Per Then directional angle set Aewr (θ3 , θ4 ) of Baxter arm directional vector P0er P0wr is 0 (θ1, ¯ θ¯ 2) and wrist point Q0w (θ3 , θ4 ) of Baxter arm. Baxter obtained by elbow Per r arm joint angles θ¯3 ,θ¯4 are given by solving constrained nonlinear matrix equality (6.21) as follows:
Aewm (θ3 , θ4 ) = Aewr s.t.
θ3 ∈ Θ3 θ4 ∈ Θ4
(6.21)
6.4 Experimental Results
159
where Θ3 ,Θ4 are constraints of Baxter arm joint angles θ3 , θ4 , respectively. Baxter arm joint angles θ¯5 ,θ¯6 are obtained by Awtm (θ5 , θ6 ) = Awtr as the same way as equalities (6.19) and (6.21). Following definitions are proposed to illustrate this mapping further. For different robot arms, we first define the shoulder, elbow, wrist, and handtip points. The part between shoulder and elbow points is defined as robot sub-arm 1. Same as defining robot sub-arm 1, the part between elbow and wrist points is defined as sub-arm 2, the part between wrist and handtip is defined as robot sub-arm 3. Joint angles θ1 · · · θi , joint angles θi+1 · · · θj , and joint angles θj +1 · · · θn are on the robot subarm 1, robot sub-arm 2, and robot sub-arm3, respectively. As defined on robot arm, we also defined human sub-arm 1, human sub-arm 2, and human sub-arm 3. 1. Joint angles θ1 · · · θi of robot sub-arm 1 are obtained by inverse kinematics according to the same direction angle of shoulder and elbow points in the robot sub-arm 1 and human sub-arm 1. Then, elbow point of robot arm is fixed based on joint angles θ1 · · · θi . 2. Joint angles θi+1 · · · θj of robot sub-arm 2 are obtained by inverse kinematics according to the same direction angle of elbow and wrist points in the robot subarm 2 and human sub-arm 2. Then, wrist point of robot arm is fixed based on joint angles θ1 · · · θj . 3. Joint angles θj +1 · · · θn of robot sub-arm 3 are obtained by inverse kinematics according to the same direction angle of wrist and handtip points in the robot sub-arm 3 and human sub-arm 3. The procedure of human–robot posture-consistent mapping method is abstracted in Algorithm 1. Algorithm 1 Calculate Asem , Aewm and Awtm of human arm. Calculate Aser (θ1 · · · θi ) of robot arm. Solve θ¯1 · · · θ¯i by Aser (θ1 · · · θi ) = Asem . Calculate Aewr (θi+1 · · · θj ) of robot arm with θ¯1 · · · θ¯i . Solve θ¯i+1 · · · θ¯j by Aewr (θi+1 · · · θj ) = Aewm . Calculate Awtr (θj +1 · · · θn ) of robot arm with θ¯1 · · · θ¯j . Solve θ¯j +1 · · · θ¯n by Awtr (θj +1 · · · θn ) = Awtm .
6.4 Experimental Results 6.4.1 Experimental Results of Robotic Hand We examine whether the TeachNet could learn more indicative visual representations that were the kinematic structure of the human hand. The proposed TeachNet is evaluated on our paired dataset with the following experiments: (1) To explore
160
6 Learning from Visual-Based Teleoperation Demonstration
the appropriate position of the alignment layer and the proper alignment method, we compare the proposed four network structures: Teach Soft-Early, Teach SoftLate, Teach Hard-Early, and Teach Hard-Late. (2) To validate the significance of the alignment layer, we design an ablation analysis by removing consistency loss Lcons and separately training the single human branch and the single robot branch. We, respectively, refer to these two baselines as Single Human and Single Robot. (3) We compare our end-to-end method with the data-driven vision-based teleoperation method which maps the position of the robot from the human joint locations based on the 3D hand estimation. We refer to this baseline as HandIK solution. There are three evaluation metrics used in this work: (1) the fraction of frames whose maximum/average joint angle errors are below a threshold; (2) the fraction of frames whose maximum/average joint distance errors are below a threshold; (3) the average angle error over all angles in Θ. The input depth images of all network evaluations are extracted from the raw depth image as a fixed-size cube around the hand and resized to 100 × 100. Note that although we have nine views of Shadow images which correspond to one human pose, during the training process of the TeachNet we randomly choose one view of Shadow images to feed into the robot branch. For the HandIK solution, we train the DeepPrior++ network on our dataset, and the reason we choose DeepPrior++ is that its architecture is similar to the single branch of TeachNet. We obtain the 21 × 3 human joint locations from DeepPrior++ then use the same mapping method in Sect. 6.2 to acquire the joint angles of the Shadow hand. The comparative results, shown in Figs. 6.9 and 6.10, indicate that the Single Robot method is the best considering all evaluation metrics and has the capability of the training “supervisor.” Meanwhile, the Teach Hard-Late method outperforms the other baselines, which verifies that the single human branch is enhanced through an additional consistency loss. Especially, regarding to the high-precision condition, only the Teach Hard-Late approach shows an average 3.63% improvement of the accuracy below a maximum joint angle which is higher than that of the Single Human method (Table 6.2). We refer that the later feature space ff eat of the depth images contains more useful information and the MSE method displays the stronger supervision in our case. And the regression-based HandIK method shows the worst performance among our three metrics. The unsatisfying outcome of the HandIK
Fig. 6.9 The fraction of frames whose maximum/average joint angle/distance error is below a threshold between Teach Hard-Late approach and different baselines on our test dataset. These show that Teach Hard-Late approach has the best accuracy over all evaluation metrics
6.4 Experimental Results
161
Fig. 6.10 (Left) Comparison of average angle error on the individual joint between the Teach Hard-Late approach and different baselines on our test dataset. FF means the first finger, LF means the little finger, MF means the middle finger, RF means the ring finger, TH means the thumb. (Right) The kinematic chain of the Shadow hand. In this work, joint 1 of each finger is stiff Table 6.2 Accuracy under high-precision conditions Max err. 0.1 rad 0.15 rad 0.2 rad Max err. 0.1 rad 0.15 rad 0.2 rad
Single human 21.24% 45.57% 69.08% Teach hard-early 7.40% 24.67% 45.63%
Teach soft-early 12.31% 38.06% 63.18% Teach hard-late 24.63% 50.11% 72.04%
Teach soft-late 12.77% 10.37% 26.16% Hand IK 0.00% 0.14% 0.62%
solution is not only due to our network giving a better representation of the hand feature but also due to the fact that this method does not consider the kinematic structure and the special limitation of the robot. Furthermore, direct joint angle regression should have decent accuracy on angles since that is the learning objective. The missing Lphy also results in the poor accuracy. Moreover, Fig. 6.10 demonstrates that the second joint, the third joint, and the base joint of the thumb are harder to be learned. These results are mainly because that (1) the fixed distal joints of the robot in our work affect the accuracy of its second joint and third joint. (2) These types of joints have a larger joint range than other joints, especially the base joint of the thumb. (3) There is a big discrepancy between the human thumb and the Shadow thumb. To verify the reliability and intuitiveness of our method, real-world experiments are performed with five grown-up subjects. The slave hand of our teleoperation system is the Shadow dexterous hand whose first joint of each finger is fixed. The depth sensor is the Intel RealSense F200 depth sensor which is suitable for close-range tracking. The poses of the teleoperators’ right hand are limited to the viewpoint range of [70◦, 120◦] and the distance range of [15 mm, 40 mm] from the camera. Since the vision-based teleoperation is susceptible to the light situation, all the experiments are carried out under a uniform and bright light source as much as possible. The average computation time of the Teach Hard-Late method is 0.1051s
162
6 Learning from Visual-Based Teleoperation Demonstration
(Alienware15 with Intel Core i7-4720HQ CPU). Code and video are available at https://github.com/TAMS-Group/TeachNet_Teleoperation. The five novice teleoperators stand in front of the depth sensor performing 0–9 in American sign language and random common gestures in a disordered way, and then teleoperate the simulated Shadow robot. The operators do not need to know the control mechanism of the robot and just naturally implement the experiment. Qualitative results of teleoperation by the Teach Hard-Late method are illustrated in Fig. 6.11. We can see that the Shadow hand vividly imitates human gestures of different size of human hands. These experiments demonstrate that the TeachNet enables a robot hand to perform continuous online imitation of human hand without explicitly specifying any joint-level correspondences. Owing to the fact that we fix two wrist joints of the Shadow hand, we do not care whether the depth sensor captures the image of teleoperator’s wrist. However, visible errors occur mainly with the second joint, the third joint of the fingers, and the base joint of the thumb, which are probably caused by the special kinematic structure of the slave, occlusions, and the uncertain lighting conditions. We compare the Teach Hard-Late method with the deep prior++ HandIK method on a slave robot. To simplify our experiments, we set the control mode of the robot to be the trajectory control within a proper maximum force for each joint. We complete an in-hand grasp and release task as a metric for usability. We place a series of objects in the slave hand which is in the open pose one at a time to facilitate easier grasping with the robotic fingers and ask subjects to grasp them up and then release them. The objects used for the grasp and release tasks are: a water bottle, a small mug, a plastic banana, a cylinder can, and a plastic apple. We require the operators to use power grasp for the water bottle and the mug, and to use
Fig. 6.11 Teleoperation results using the Shadow hand on real-world data. (a) Successful teleoperation results. (b) Failed teleoperation results
6.4 Experimental Results
163
Table 6.3 Average time a novice took to grasp and release an object Methods Hand IK Ours
Bottle 44.15 23.67
Mug 46.32 18.82
Banana 35.78 25.80
Can 25.50 19.75
Apple 30.22 15.60
Average 36.394 20.728
precision grasp for other objects. If the users do not complete the task in 4 min, they are considered to be unable to grasp the object. Table 6.3 numerically shows the average time a novice takes to grasp an object using each of the control methods. We find that the low accuracy, especially for the thumb, and the post-processing of the HandIK solution results in a longer time to finish the task. The users need to open the thumb first and then perform proper grasp action, so HandIK solution shows worse performance for the objects with a big diameter. Besides that, grasping the banana takes the longest time on our method because the long and narrow shaped object needs more precise fingertip position.
6.4.2 Experimental Results of Robotic Arm A visual teleoperation network evaluation experiment and a robotic experiment are provided in this section to demonstrate effectiveness of the proposed visual teleoperation framework. In the visual teleoperation network evaluation experiment, the human–robot posture-consistent dataset which generated by the human–robot posture-consistent mapping method is used to evaluate the visual teleoperation network. Furthermore, a Baxter robot is used to verify the reliability and intuitiveness of the visual teleoperation framework in the robotic experiment. It is shown in Fig. 6.12, the performance of the multi-stage visual teleoperation network mainly depends on two parts: human body estimation part and human– robot posture-consistent mapping part. Advantages are given by the multi-stage network structure for the visual teleoperation. Firstly, any of the three types of data (human depth image, human arm keypoint position, and robot arm posture) are given to the multi-stage visual teleoperation network as input to output robot arm joint angle. In other words, human depth image sampled by depth camera or human arm
Fig. 6.12 Human body estimation part and human–robot posture-consistent mapping part
164
6 Learning from Visual-Based Teleoperation Demonstration
pose sampled by wearable sensor can be used as the multi-stage visual teleoperation network input to output robot arm joint angles. Secondly, the human–robot postureconsistent mapping part of the multi-stage visual teleoperation network can be replaced for a visual teleoperation of different robot arms. Moreover, the human body estimation part can be replaced by various pre-trained human body estimation networks to improve accuracy of human body estimation. Generalization of the human body estimation part in the proposed multi-stage visual teleoperation network should be enhanced because of insufficient data in training set. It is welcome that various human body estimation networks are added as human body estimation part in the multi-stage visual teleoperation network. Two methods are proposed to improve the generalization of the human–robot posture-consistent mapping part in the multi-stage visual teleoperation network. (1) 9573 human arm keypoint position data in the UTD-MHAD are selected as the dataset. Human arm keypoint position data are translated into robot arm posture data. Furthermore, robot arm joint angle data is obtained by solving constrained nonlinear matrix functions from robot arm posture data. The data pairs (human arm keypoint position, robot arm posture, robot arm joint angle) are obtained by the proposed human–robot posture-consistent mapping method. It takes 7 days to complete the generation of dataset using Matlab. (2) One million sets of robot arm joint angle data are generated by randomly extracting six robot arm joint angle with the range of joint constraints. Robot arm joint angle data is transformed into human arm keypoint position data and robot arm posture data by robot forward kinematics method. Therefore, one million data pairs (human arm keypoint position, robot arm posture, robot arm joint angle) are obtained to supplement the data in the dataset. It takes 2 days to complete generation of the dataset using python. The first method generates small amount but more anthropomorphic data. The second method generates a large amount data, but some postures cannot be achieved by human body. In this part, we analyze the regression prediction ability (i.e., the ability to establish a highly nonlinear mapping between a human depth image and robot arm joint angles) of the visual teleoperation network. The proposed visual teleoperation network is evaluated on the human–robot tuple dataset with following experiments: (1) To exhibit advantages and availability of the overall loss function, the human arm keypoint position loss and the robot arm posture loss are removed separately by the training process of the visual teleoperation network. (2) To validate the significance of skeleton point estimation stage, this stage is removed by the training process of the visual teleoperation network. Two evaluation metrics are used in this chapter: (1) A proportion of the test tuple dataset whose maximum/average joint angle errors are below a threshold with different loss functions. (2) An average joint angle error over all joint angles with different overall loss functions. A UTD-MHAD dataset which contains 27 actions performed by 8 subjects is applied to train the visual teleoperation network in this chapter. There are four data modalities (depth image, RGB image, skeleton points, inertial data) of human in the UTD-MHAD dataset. The 3 × 4 human arm keypoint positions are obtained from the 3 × 25 skeleton points of the UTD-HMAD dataset. Then, the 3 × 3 robot arm
6.4 Experimental Results
165
postures and the 1 × 6 robot joint angles are generated by the human–robot tuple dataset generator. A 512 × 424 depth image in UTD-MHAD dataset is cut down as a 252 × 252 one. Finally, the training input data (a 252 × 252 depth image), the local label data (3 × 4 human arm keypoint positions and 3 × 3 robot arm postures), the training label data (1 × 6 robot joint angles) are obtained for the training of the visual teleoperation network. In Figs. 6.13 and 6.14, it is noted that L = λse Lse +λrp Lrp +λrg Lrg +λphy Lphy , L1 = λse Lse + λrg Lrg + λphy Lphy , L2 = λrp Lrp + λrg Lrg + λphy Lphy , L3 = λrg Lrg + λphy Lphy , and L4 = λse Lse + λrp Lrp + λrg Lrg . It can be seen from Figs. 6.13 and 6.14 that different overall loss functions are given in the training of the visual teleoperation network. The effectiveness of the proposed overall loss function L verifies that the regression prediction ability of the visual teleoperation network is enhanced by a multi-stage loss function. The proposed overall loss function L has better performance than the others because the local label data supervised by the multi-stage loss function gives the teleoperation network a correct convergent direction. Furthermore, the comparative results shown in Figs. 6.13 and 6.14 indicate that the physical constrains loss Lphy is also helpful for training the visual teleoperation network. The skeleton point estimation stage is a key stage for the visual teleoperation network because of a highly nonlinear mapping between a 252 × 252 depth image and 3 × 4 human arm keypoint positions. As shown in Fig. 6.15, a teleoperation network which only contains robot arm posture estimation stage and robot arm joint angle generation stage is given to learn a mapping between 3 × 4 human arm keypoint positions and 1 × 6 robot joint angles. It is concluded that the error of
Fig. 6.13 The proportion of the test tuple dataset whose maximum joint angle errors are below a threshold with different training loss functions
166
6 Learning from Visual-Based Teleoperation Demonstration
Fig. 6.14 The proportion of the test tuple dataset whose average joint angle errors are below a threshold with different training loss functions
Fig. 6.15 The proportion of the test tuple dataset whose average joint angle errors are below a threshold by different networks
6.4 Experimental Results
167
Fig. 6.16 Comparison of average angle error on the individual robot arm joints with different training loss functions
the visual teleoperation network is mainly caused by the skeleton point estimation stage. It is shown in Fig. 6.16 that Baxter arm angles θ5 ,θ6 which depend on a posture between wrist point and handtip point are harder to be learned. Robot arm joint errors in the proposed visual teleoperation is mainly caused by three aspects. One is the input measurement errors caused by the sensor. The second is that there are errors in the mapping between sensor information and robot arm joint angle caused by the pre-trained multi-stage visual teleoperation network. The third is that the robot arm cannot track a desired angle due to the performance of the controller at dynamic level. The errors caused by the multi-stage visual teleoperation network are mainly analyzed. (1) The θ5 and θ6 are calculated from the direction angle between wrist and handtip points of human. Human arm keypoint position is obtained by human depth image in the SDK of kinect sensor. There may be errors in wrist and handtip points of train dataset for the reason of that very few pixels of the human hand are occupied in the human depth image. (2) The performance of the multi-stage visual teleoperation network mainly depends on two parts: human body estimation part and human–robot stage posture-consistent mapping part. Joint angle errors of the θ5 and θ6 are mainly caused by the human body estimation part. The max errors of the θ5 and θ6 are less than 0.05 rad when we only learn the human–robot postureconsistent mapping part. The average error is not large, so the maximum error is caused by a small amount of data on the test set. A mechanism is proposed to avoid the effect of maximum error due to insufficient generation of the multi-stage visual teleoperation network. The desired robot arm joint angles data are given by the multi-stage visual teleoperation network per 0.15 s for continuously controlling the robot arm. Note that θ = θt +1 − θt is the difference between desired robot arm joint angle at time t and time t + 1. The θt +1 is discarded when θ ≥ 0.2. A manipulation experiment is conducted to show the intuitiveness and effectiveness of the visual teleoperation frame. A 7 DoF robot arm of Baxter robot is used in this part, in which the 7th DoF of the Baxter arm is fixed. Only the left arm of Baxter robot is used for the reason that humans only use their left arms to do actions in the UTD-MHAD dataset. The human body data is obtained by a Azure Kinect sensor.
168
6 Learning from Visual-Based Teleoperation Demonstration
Any of three types of data (human depth image, human arm keypoint position, and robot arm posture) are given to the multi-stage visual teleoperation network as input to output robot arm joint angle. Robotic experiments are performed in two ways. 1. 80% of the dataset are used as training set and 20% test set. Human depth images of test set are continuously fed into the multi-stage visual teleoperation network to simulate the teleoperating process in Gazebo. In Fig. 6.17, qualitative results of the simulation experiment by the visual teleoperation network are shown. The Baxter arm imitates different postures of human arm, which demonstrates the visual teleoperation frame enables a robot arm to perform continuous, online imitation of human arm postures. Furthermore, the human depth images from different people are applied in the simulation experiment, which shows the generalization ability and reliability of the visual teleoperation. 2. Human keypoint position data are directly obtained by Azure Kinect DK, and then robot arm posture data is calculated by the human keypoint position data. Robot arm posture data are continuously fed into the multi-stage visual teleoperation network to teleoperate the Baxter arm in Gazebo. To simplify the manipulation experiments, a control mode of the Baxter arm is set to joint angle control with a proper maximum force for each joint. A teleoperator performs a series of tasks (i.e., swipe left arm, basketball shoot, and draw circle) which are collected as human keypoint position data in real time. The human keypoint position data are mapped to Baxter arm joint angle data by the visual teleoperation network. Timestamps are added at the beginning and ending of the code in order to record the time required to complete an action in teleoperation. It takes less than 0.15 s to complete an action of the visual teleoperation based on the multi-stage visual teleoperation network. A time-consuming part in the visual teleoperation is the process of translating human arm keypoint position into robot arm joint angle. It takes 30–60 s to translate human arm keypoint position data into robot arm joint angle data by solving constrained nonlinear matrix functions using Matlab if a neural network method is not used. However, it takes less than 0.1 s to translate human arm keypoint position data into robot arm joint angle data by a trained multi-stage visual teleoperation network. Unlike a process of solving constrained nonlinear matrix functions in reverse calculation, the neural network method is the forward calculation. The teleoperation method proposed in this chapter considers the impact of the elbow and wrist points of the robot arm while completing the task using the end point of the robot arm. When using end point of dual-arms to complete a task, the robot arms may collide if the elbow and wrist are not considered. Moreover, the proposed teleoperation method can avoid elbow and wrist hitting the box when using end point of robot arm to complete grasp task in a narrow box.
6.5 Summary
169
Fig. 6.17 Results of vision-based consistent posture teleoperation
6.5 Summary This chapter presents a systematic vision-based method for finding kinematic mappings between the anthropomorphic robot and the human. We develop the endto-end TeachNet and create the dataset containing 400K pairs of the human hand depth images, simulate robot depth images in the same poses and corresponding robot joint angles. Through the network evaluation and the robotic experiments, we verify the applicability of the Teach Hard-Late method to model poses and the implicit correspondence between robot imitators and human demonstrators. The experimental results also present that our end-to-end teleoperation allows novice teleoperators to grasp the in-hand objects faster and more accurately than
170
6 Learning from Visual-Based Teleoperation Demonstration
the HandIK solution. Then, a visual teleoperation frame has been raised for a robot arm considering human–robot posture unification. A multi-stage structure of a visual teleoperation network has been proposed to learn a highly nonlinear mapping between a human depth image and robot joint angles. A human–robot tuple data generator has been given by a novel consistent posture mapping method. A tuple dataset which contains training input data, local label data, and training label data has been generated by the human–robot tuple data generator. By the network evaluation and the robotic experiments, the applicability of the visual teleoperation frame has been demonstrated. Although the proposed visual teleoperation frame performs well in experiments, it has some limitation. Firstly, as much data as possible on tuple dataset is required for improving the performance of a visual teleoperation network. However, there is not enough suitable tuple dataset for training the visual teleoperation network. We would like to extend our dataset by a generative adversarial network in the future. Secondly, the error of the visual teleoperation network is mainly caused by the skeleton point estimation stage. Therefore, it is welcome that various human pose estimation networks are added in the skeleton point estimation stage of the visual teleoperation network to improve the performance of the visual teleoperation frame.
References 1. Aracil R, Balaguer C, Buss M, Ferre M, Melchiorri C (2007) Book advances in telerobotics. Springer, Berlin. https://doi.org/10.1007/978-3-540-71364-7 2. Du G, Zhang P, Mai J, Li Z (2012) Markerless kinect-based hand tracking for robot teleoperation. Int J Adv Robot Syst 9:1. https://doi.org/10.5772/50093 3. Oikonomidis I, Kyriazis N, Argyros A (2011) Efficient model-based 3D tracking of hand articulations using kinect. In: British machine vision conference, vol 1. https://doi.org/10.5244/ C.25.101 4. Pedro LM, Caurin GA, Belini VL, Pechoneri RD, Gonzaga A, Neto I, Nazareno F, Stucheli M (2011) Hand gesture recognition for robot hand teleoperation. In: Proceedings of the 21st international congress of mechanical engineering 5. Dogan E, Eren G, Wolf C, Lombardi E, Baskurt A (2017) Multi-view pose estimation with mixtures-of-parts and adaptive viewpoint selection. IET Comput Vis 12. https://doi.org/10. 1049/iet-cvi.2017.0146 . 6. Wang H, Dai L, Cai Y, Sun X, Chen L (2018) Salient object detection based on multi-scale contrast. Neural Netw 101. https://doi.org/10.1016/j.neunet.2018.02.005 7. Rawat W, Wang Z (2017) Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput 29:1–98. https://doi.org/10.1162/NECO_a_00990 8. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using depth motion maps-based local binary patterns. In: Proceedings – 2015 IEEE winter conference on applications of computer vision, WACV 2015, pp 1092–1099. https://doi.org/10.1109/WACV. 2015.150 9. Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X (2017) Residual attention network for image classification. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 6450–6458. https://doi.org/10.1109/CVPR.2017.683 10. Tome D, Russell C, Agapito L (2017) Lifting from the deep: convolutional 3D pose estimation from a single image. In: IEEE conference on computer vision and pattern recognition (CVPR). https://doi.org/10.1109/CVPR.2017.603
References
171
11. Chen T-Y, Ting P-W, Wu M-Y, Fu L-C (2018) Learning a deep network with spherical part model for 3D hand pose estimation. Pattern Recogn 80. https://doi.org/10.1016/j.patcog.2018. 02.029 12. Chen C (2015) UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: IEEE international conference on image processing (ICIP). https://doi.org/10.1109/ICIP.2015.7350781 13. Ionescu C, Papava D, Olaru V, Sminchisescu C (2013) Human3.6M: large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans Pattern Anal Mach Intell 36. https://doi.org/10.1109/TPAMI.2013.248 14. Cerulo I, Ficuciello F, Lippiello V, Siciliano B (2017) Teleoperation of the SCHUNK S5FH under-actuated anthropomorphic hand using human hand motion tracking. Robot Auton Syst. https://doi.org/10.1016/j.robot.2016.12.004 15. Fang B, Sun F, Liu H, Guo D, Chen W, Yao G (2017) Robotic teleoperation systems using a wearable multimodal fusion device. Int J Adv Robot Syst 14:1–11. https://doi.org/10.1177/ 1729881417717057 16. Cerulo I, Ficuciello F, Lippiello V, Siciliano B (2017) Teleoperation of the SCHUNK S5FH under-actuated anthropomorphic hand using human hand motion tracking. Robot Auton Syst. https://doi.org/10.1016/j.robot.2016.12.004 17. Cho S, Jin H, Lee JM, Yao B (2010) Teleoperation of a mobile robot using a force-reflection joystick with sensing mechanism of rotating magnetic field. IEEE/ASME Trans Mechatron 15:17–26. https://doi.org/10.1109/TMECH.2009.2013848 18. Stanton C, Bogdanovych A, Ratanasena E (2012) Teleoperation of a humanoid robot using fullbody motion capture, example movements, and machine learning. In: Australasian conference on robotics and automation, ACRA 19. Kofman J, Verma S, Wu X (2007) Robot-manipulator teleoperation by markerless vision-based hand-arm tracking. Int J Optomechatron 1:331–357. https://doi.org/10.1080/ 15599610701580467 20. Romero J (2011) From human to robot grasping. Ph.D. dissertation, KTH Royal Institute of Technology 21. Michel D, Qammaz A, Argyros A (2017) Markerless 3D human pose estimation and tracking based on RGBD cameras: an experimental evaluation. In: International conference on PErvasive technologies related to assistive environments. https://doi.org/10.1145/3056540. 3056543 22. Yuan S, Garcia-Hernando G, Stenger B, Moon G, Chang J, Lee K, Molchanov P, Kautz J, Honari S, Ge L, Yuan J, Chen X, Wang G, Yang F, Akiyama K, Wu Y, Wan Q, Madadi M, Escalera S, Kim T-K (2018) Depth-based 3D hand pose estimation: from current achievements to future goals. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2636–2645. https://doi.org/10.1109/CVPR.2018.00279 23. Moon G, Chang J, Lee K (2017) V2V-PoseNet: voxel-to-voxel prediction network for accurate 3D hand and human pose estimation from a single depth map. In: IEEE conference on computer vision and pattern recognition (CVPR) 24. Oberweger M, Lepetit V (2017) DeepPrior++: Improving fast and accurate 3D hand pose estimation. In: Workshop of IEEE international conference on computer vision (ICCV), pp 585–594. https://doi.org/10.1109/ICCVW.2017.75 25. Guo H, Wang G, Chen X, Zhang C, Qiao F, Yang H (2017) Region ensemble network: improving convolutional network for hand pose estimation. In: IEEE international conference on image processing (ICIP), pp 4512–4516. https://doi.org/10.1109/ICIP.2017.8297136 26. Shon AP, Grochow K, Rao R (2005) Robotic imitation from human motion capture using Gaussian processes. In: IEEE/RAS international conference on humanoid robots, pp 129–134. https://doi.org/10.1109/ICHR.2005.1573557 27. Sermanet P, Lynch C, Chebotar Y, Hsu J, Jang E, Schaal S, Levine S, Brain G (2018) Timecontrastive networks: self-supervised learning from video. In: IEEE international conference on robotics and automation (ICRA). https://doi.org/10.1109/ICRA.2018.8462891
172
6 Learning from Visual-Based Teleoperation Demonstration
28. Andrychowicz OM, Baker B, Chociej M, Jozefowicz R, McGrew B, Pachocki J, Petron A, Plappert M, Powell G, Ray A, Schneider J, Sidor S, Tobin J, Welinder P, Weng L, Zaremba W (2019) Learning dexterous in-hand manipulation. Int J Robot Res. 027836491988744. https:// doi.org/10.1177/0278364919887447 29. Villegas R, Yang J, Ceylan D, Lee H (2018) Neural kinematic networks for unsupervised motion retargetting. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 8639–8648. https://doi.org/10.1109/CVPR.2018.00901 30. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014). Generative adversarial nets. In: Advances in neural information processing systems (NIPS) 31. Yuan S, Ye Q, Stenger B, Jain S, Kim T-K (2017) BigHand2.2M benchmark: hand pose dataset and state of the art analysis. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2605–2613. https://doi.org/10.1109/CVPR.2017.279 32. Shadow Robot Company. Shadow dexterous hand e1 series. Available: http://www. shadowrobot.com/wpcontent/uploads/shadow. Dexterous hand technical specification
Chapter 7
Learning from Wearable-Based Indirect Demonstration
Abstract Comparing with the direct way for learning humans’ experience such as teleoperation, this chapter mainly focuses on indirect manipulation demonstration using wearable device. It is expected that robots could learn humans’ manipulation experience from the collected human manipulation data. In this chapter, we firstly introduce three kinds of approaches about collecting humans’ manipulation data using wearable device; then, an effective method is illustrated for building the experience model of human manipulation.
7.1 Introduction Grasp planning tries to solve the problem of finding a grasp configuration that ensures the stability of the grasped object during a manipulation task. Finding an appropriate grasp among an infinite set of candidates is still a challenging task and several approaches have been presented for robotic grasping. Usually, we categorize the grasping approaches as three different groups: grasping known, unknown, and familiar objects. In a grasp approach for known objects, the object has a database of grasp hypotheses that the robot can find a proper grasp pose by estimating the object pose and then filtering the hypotheses by reachability. The major drawback of using these approaches is that it is impossible to put the model of all objects to the robot’s database since the objects are partially visible to the robots. On the other hand, the approaches used for grasping unknown objects usually analyze heuristics that relate the structures extracted in a partial view of the object to the grasp candidates. Actually, in our household environment, it is rather difficult that the robot knows all the different situations. Thus, the desired robot system should be capable of continuously learning about new objects and how to manipulate them while it is operating in the environment. This requires the ability to learn a model that could generalize from previous experiences to new objects. The grasping approaches for familiar objects have been addressed in these scenarios. In this chapter, we aim to extract important grasp cues from humans’ dextrous manipulation, then generalize these grasp experiences to new familiar objects. Mainly, the following © Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_7
173
174
7 Learning from Wearable-Based Indirect Demonstration
three questions are essential to be resolved: What kind of data should be extracted regarding one object and a grasp? What are the features that encode object affordances better? How to build a grasp model that can generalize the grasp experience to new familiar objects? Several approaches have been developed for grasping familiar objects [1–5]. Let us see how they address above three questions. Geng et al. [6] extracts objects’ size, pose, and three synergies from human grasping experiments, which consist of the position and orientation of the wrist, the 3D positions of five fingertips, and all the 14 joints of the five fingers; then, inputs these recorded data to a neural network for feature training which outputs the coefficients of synergies for hand approaching vector and hand grasping posture; finally, a fingertip mapping model is built for transferring humans’ grasp experience to the robots’ hand. The disadvantage of this approach is that it lacks precise grasp point generalization. Diego et al. [7] extracts relevant multisensor information from the environment and the hand. They develop a probabilistic distribution approach to associate the grasps with the defined object primitives. Unfortunately, the object model is built by utilizing the superquadrics which is only suitable for objects with regular shapes. In [8], they propose a method for transferring grasps between objects of the same functional category. This transfer is achieved through warping the surface geometry of the source object onto the target object, and along with it the contact points of a grasp. The warped contacts are locally replanned. However, the method is only verified by using simulations on complete mug models. Guido et al. [9] builds a mapping function between the postural synergies of the human hand and synergistic control action in robotic devices by using a virtual object method. This approach avoids the problem of dissimilar kinematics between human-like hand and robotic hands. Although so many approaches have been presented for learning manipulation skills from demonstrations, there is still some problems remaining to be solved, which include that the model learned from demonstrations should be fit for different robot hands, the model should be able to imitate humans’ grasping postures and at the same time be suit for the kinematics of robot hands. This chapter tries to solve those problems by the following approaches. Instead of transferring all the contact points from humans’ demonstration to the robot hand, we prefer to learn key cues for grasp experience generalization. It is obvious that the opposable thumb plays an important role in humans’ dexterous manipulation skills. Cutkosky [10] developed a hierarchical tree of grasp types to facilitate the control of robotic hands for various manipulations. Almost all of their grasp types in the taxonomy are formed by the thumb and some other fingers. The only exception is the non-prehensile grasp, which does not require the thumb at all. In [11], both grasp type and thumb fingertip’s position and orientation are extracted from humans’ demonstrations for transferring grasp skills to robots’ grasp planning. The main advantage of their approach is that making a loose grasp constraint for keeping the flexibility of the robot hand and meanwhile conveying important human intentions to the robotic grasping. However, in [11], how to learn the recorded thumb contact points on various shapes of objects has not been considered, especially when the objects are partially observable. On the other hand, the planned grasp postures
7.1 Introduction
175
sometimes are not so normal as human grasps because of position variation of the wrist, particularly for the precision grasp which only uses fingertips for grasping (see the discussion in [11]). Given these advantages and disadvantages, this chapter proposes a grasp planning method by utilizing positions and orientations of both the wrist and the thumb learned from human experience for hand configuration. Specifically, we firstly present a grasp point learning approach for recognizing the grasp points among familiar objects; then, a hand configuration method is built based on the grasp position of the thumb fingertip, the approaching direction, and the position of the wrist. As for the problem of grasp point generalization, there have been many researches presented by using multiple kinds of sensing data. In [12], they learn the grasp point on 2D images by using Neural Networks. Redmon and Angelova [13] learn the grasp points employing deep learning method on RGB-D dataset. In [13], the CNN is applied to extract the features of the grasp rectangle on 2D RGB images. It shows that the CNN is suitable for recognizing objects or grasp regions on 2D images, but is difficult to directly extract effective features of the 3D point clouds [14–16]. However, when the humans grasp objects, the 3D geometry of objects is one of the key cues for selecting the grasp point. Therefore, effectively applying the 3D point clouds of objects for learning the grasp region is one important trend of robot grasping. Many researches have been trying to solve this hot problem. Some of them employ shape primitives or superquadrics in 3D space to represent objects [17]. The shortcoming is that it is difficult to exactly describe the shapes of objects, especially for objects with complex shapes. Recently, lots of 3D shape descriptors, e.g., Signature of Histograms of OrienTations (SHOT), Fast Point Feature Histograms (FPFH), SpinImage have been proposed for object classification. Most of these descriptors can cope with changes of viewpoint, noise, and size. The comparative studies of these descriptors in [18] have shown that SHOT performs the best when measuring against others in terms of both the object recognition accuracy and the cost time. Hence, in this chapter, a SHOT 3D descriptor is utilized for grasp point representation and classification among familiar objects. Additionally, in reality, the objects are only partially observable under singleview camera arrangement because of self-occlusion. This is a problem for the methods which extract the positions of all the contact points for constructing the mapping between the humans’ and the robots’ hands. Some of them reconstruct objects [19–21] and others registrate objects by using ICP (Iterative Closest Point) method with the full 3D object models in Columbia database [18, 22]. The shortcoming of the former approach is that it needs the shape symmetry hypothesis and the later one is that it needs having the similar objects’ in Columbia database. To avoid these disadvantages, we present an approach that only detects the thumb fingertip position on the partial point cloud by representing the 3D points with a SHOT descriptor. Thus, corresponding to the three questions in the end of the third paragraph, we extract the positions and orientations of the thumb fingertips and the wrist; then, we present a method about how to recognize the grasp point given objects’ shape
176
7 Learning from Wearable-Based Indirect Demonstration
affordance; finally, we propose an effective approach to grasp various shapes of objects by integrating human experience of the thumb and the wrist. Notice that the grasped objects are not so large and heavy and precision grasps are mainly considered (performing the grasps by using fingertips). The main contributions of the work include the following ones: (i) proposing a grasp point learning method by employing a 3D SHOT descriptor based on a sparse set of examples of objects, which achieves better grasp point generalization; (ii) presenting a grasp planning approach by extracting key cues (thumb fingertip and the wrist) from human demonstration. This approach has multiple advantages (ii-1) significantly decreasing the DOFs for multi-fingers’ configuration; (ii-2) reducing the searching space of the wrist; (ii-3) making a loose grasp constraint for keeping the flexibility of the robot hand; (ii-4) no reconstruction or registration, being able to directly perform on the partial point cloud of objects. Figure 7.1 demonstrates the pipeline of the proposed grasp planning method. The modules in object recognition are responsible to recognize objects’ category and the grasp point. Several references [13, 23] have verified that the grasp points on objects have important relation with object categories. The familiar objects belonging to the same shape categories have similar relative grasp positions. The flow of Fig. 7.1 is as follows: firstly, the point clouds of objects captured by a Microsoft Kinect are served as an input to the object detection module; then, the feature extractor module represents the point clouds with SHOT shape descriptors; taking in these descriptors and the object categorization model from the object database, the object recognition module performs categorizing objects; afterwards, the grasp point recognition module realizes detecting the grasp point of the object under the object’s category label; transferring the classification results to the modules of hand
Fig. 7.1 The pipeline of the proposed method
7.2 Indirect Wearable Demonstration
177
configuration, the optimum stable grasp configuration is achieved by minimizing the objective function built with the inverse kinematics of the “thumb” finger, the identified graspable point, and the wrist constraints associated with the object grasp database; finally, grasp execution module accomplishes the experiments by utilizing a 7-DOF Schunk arm and a 3-finger Barrett Hand. This chapter is laid out as follows. Section 7.3 illustrates the main proposed grasp planning approach in which the Sect. 7.3.1 presents the “thumb” graspable point recognition and generalization approach by utilizing a modified SHOT descriptor; the Sect. 7.3.2 is the grasp model based on “thumb” kinematics and wrist constraints; the Sect. 7.3.3 further introduces the crucial problems about the wrist constraints. In Sect. 7.4, the experimental results verify the effectiveness of the proposed method and finally, Sect. 7.5 gives the conclusion and future work.
7.2 Indirect Wearable Demonstration In this section, we introduce three kinds of ways for collecting data of human manipulation with wearable devices. The collected data will be utilized for building the models of human manipulation experience, hence, the robot grasping operations based on these experience models are named indirect wearable demonstration. The First Way In robotics, the analysis of human movements has been applied in research areas related with task learning by imitation of human demonstrations. In [24], they capture the different types of data from humans, multiple data acquisition devices are equipped in their experimental area. The data acquired will be used to model and extract the relevant aspects of the human demonstration. The system records human gaze, 6D pose of hand and fingers, finger flexure, tactile forces distributed on the inside of the hand, color images, and stereo depth map. Figure 7.2 shows the experimental area, data acquisition devices (stereo and monocular cameras, magnetic tracker, hand flexure sensor, tactile sensor, eye tracker), and the available objects. In their approach, the data of the hand trajectories are acquired by the tracker devices which perform to track 6D pose data at 30 Hz. The device is attached to the fingertips and on the back of the hand. Then, in order to identify the different phases in the human manipulation demonstrations, these pose data are utilized to calculate the hand trajectory curvatures and hand orientation along each phase for feature extraction. In detail, the trajectory directions are divided into 8 kinds depending on the curvatures, i.e., C ∈ {up, down, lef t, right, up − lef t, up − right, down − lef t, down − right}. And the hand orientation is represented as O ∈ {top, side, hand − out}. In [24], they hold that on the different action phase (e.g., reach, load, lift, hold, replace, unload, release), different types of signals (position/orientation of the fingers, distal phalanges and wrist, joints flexure level, tactile sensing) dynamically change their role and importance on the control of the object manipulation strategies.
178
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.2 The experimental area, data acquisition devices, and objects available
Fig. 7.3 (a) Representation of the 15 tactile regions defined in the human hand. (b) Six of the predefined grasp configurations used to estimate the static contact templates
Hence, it is possible to categorize the typical prehensile patterns of the hand that the humans use to hold daily life objects. In their work, the tactile intensity is considered as one of the fundamental variables to distinguish the different stages of a manipulation task. Therefore, by using tactile sensor information, they achieve a symbolic level generalization of manipulation tasks by human demonstrations. As shown in Fig. 7.3a, the tactile sensing device consists of 360 sensing elements which are distributed along the hand palm and fingers surface. The sensing elements are grouped in 15 regions, corresponding to different areas of the hand. They address tactile signatures of some manipulation primitives for classification and identification of manipulation tasks. A variable Ti is assigned to each of these regions: T = {T1 , T2 , . . . , T15 }. The domain for each variable can be defined as Ti ∈ {NotActive, LowActive, HighActive}, which expresses the level of activation of that region during the in-hand manipulation task. The three levels of the tactile contact are employed to distinguish the different stages of interaction with the object. For example, NotActive: pre-grasp,
7.2 Indirect Wearable Demonstration
179
Fig. 7.4 Conditional probability density distribution of the primitives. (a) primitive 1; (b) primitive 2; (c) primitive 3; (d) primitive 4; (e) primitive 5; (f) primitive 6; and (g) primitive 7
transition between successive re-grasps; LowActive: initial contact with the object, hand region partially involved in this stage of the task; HighActive: hand region highly involved in this stage of the task. The seven tactile primitives are selected as a subset of tactile primitives that can be used to demonstrate the proposed concept in the type of manipulation tasks. Notice that during their experiments, the tactile sensing device was attached permanently to the glove, so all the subjects wore the tactile sensors attached to the glove in the same positions. The contact state templates primitives are estimated from seven different grasp configurations. Six of those seven grasp configurations are demonstrated in Fig. 7.3b. The remaining one corresponds to when there is no contact between the hand and the object. In order to estimate the parameters of the template, they accomplish several human demonstrations of the different static contact configuration by a subject (See Fig. 7.4). Segmenting a task in action phases is useful to characterize each movement of a task as well as to understand the behavior of the hand in each phase. In [24], it is determined that simple tasks (e.g., object displacement) can be composed of the following action phases: reach, load, lift, hold/transport, unload, and release. Figure 7.5 illustrates an example of action phases in a simple homogeneous manipulation task, where the in-hand manipulation is replaced by a fixed grasp transport phase. By observing those multi-modal data, some assumptions can be made to find those phases during a task. For example, in the reaching phase, there is no object movement, the load phase is active when there is no object movement, the load phase is active when there is tactile information, and the transport phase is active when the object is moving. By using the timestamps, they analyze the multi-modal data to know the state of each sensor in a specific time. The Second Way In [6], they aim to transfer human grasping skills to a robot handarm system via the use of synergies. Their transfer involves two stages. Firstly, they extract the synergies from human grasping data and train a neural network
180
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.5 Example of action phases in a simple homogeneous manipulation task, where the same grasp is employed during the manipulation
Fig. 7.6 Grasp example
with the data. Secondly, they design a mapping method that can map the fingertip positions of the human hand to those of the robot hand. About extracting synergies from data on human grasping, the subject uses two or three fingers (thumb, index, middle) to make 60 grasps of the objects. The positions of the hand joints in the grasping postures are recorded with a Shapehand data glove. The position and orientation of the wrist in a fixed world frame are recorded with a Polhemus Patriot magnetic sensor, which includes a magnetic sensor that is fixed on the wrist of the subject, and a source that is fixed at the origin of the world frame on a table. The system calculates online the 3D position and orientation of the magnetic sensor by measuring the relative magnetic flux of three orthogonal coils on both the sensor and the source. The sensor output is 6 DoFs: the X, Y, and Z Cartesian coordinates and the orientation (azimuth, elevation, and roll) (Fig. 7.6).
7.2 Indirect Wearable Demonstration
181
The data is recorded during the whole process (from the start of reaching until the finishing of grasping, including reaching, pre-shaping, and grasping) of each grasping trail of various objects. At the same time, the following features of the objects are recorded by measuring position of the center, size, and pose. The Shapehand data glove recorded the 3D positions of the following 19 points of the hand: five fingertips and all the 14 joints of the five fingers. Using these positions, they calculate the angular positions of all the joints of the hand. In the grasping experiments, the human subject grasps the objects using three fingers (thumb, index, middle) with the data glove since the three fingers (thumb, index, middle) are mapped to the robot hand. The total DOFs of these fingers is 12. They employ PCA to extract synergies of human grasp postures. PCA is a common technique for finding patterns in data that has a high dimensionality. It transforms a number of correlated variables into a small number of uncorrelated variables called principal components. The data from the grasp postures and the hand approaching movement are processed with the PCA in [6]. Correspondingly, two groups of synergies are extracted from these two datasets: one for the hand grasping postures, the other for the hand approaching (i.e., an approaching vector at the wrist). They directly use the synergies extracted from the wrist position and orientation data on the robot hand for the approach control. Then, a feed-forward neural network is trained with the grasping data using a batch gradient descent algorithm. The inputs of the neural network are the features of the objects used in the human grasping experiments. The outputs are the coefficients of the first three principal components extracted from the grasping posture data and the hand approaching data, which can reconstruct the reaching movements and the grasping postures of 60 recorded human grasps. In [6], they use the trained neural network and the synergies to represent the correlation between the object features and the grasp synthesis(reaching and grasping) in human grasping. The Third Way In this chapter, we aim to extract important grasp cues from humans’ dextrous manipulation. Different from the above two approaches, here, we extract the humans’ grasp point, the approaching direction, and the thumb angle by using a Kinect camera and a data glove (See Fig. 7.7). The data glove [25] is developed by IMMUs, which are made up of three-axis gyroscopes, three-axis accelerometers, and three-axis magnetometers. There are three IMMUs on each finger. The orientations of the finger (including the thumb) can be estimated by the data glove. The following section gives the specific learning algorithm for building the model of the humans’ manipulation experience by using the grasp points, approaching direction, and the thumb angle.
182
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.7 Extracting important grasp cues from humans’ dextrous manipulation
7.3 Learning Algorithm 7.3.1 Grasp Point Generalization on Incomplete Point Cloud Based on the assumption that grasp points on the familiar objects have similar relative positions on the objects when the same tasks are executed, in this subsection, we devote to identify the grasp points on the 3D point cloud which are partially observable. Express the point cloud set of the object as P C and randomly select n points from the set P C. Then, define these n points as interest points’ set I P = {Ip i ∈ R 3 }, where i = 1, 2, . . . , n and I P ⊂ P C. We describe each point of these interest points employing a 3D SHOT shape descriptor and utilize these shape features to discriminate the grasp points. The SHOT shape descriptor (See Fig. 7.8) is built by creating a robust local reference at each interest point Ip i based on the surface normal and two directions which define the tangent plane. Under each local reference frame, an isotropic spherical grid centered on the sampled point divides its neighborhood space into partitions along the radial, azimuth, and elevation axes. For one interest point, the SHOT descriptor is a 352-dimensional vector (m = 352 in Fig. 7.8) that represents the 11-bin histograms of cos(θ ) at each partition, where θ is the angle between the surface normals at the interest point and another point in its neighborhood. In [18, 26], they compare several 3D shape descriptors, which shows that SHOT descriptor has better shape discriminating ability compared with many other 3D descriptors. Therefore, we choose the SHOT descriptor to represent the geometry of objects. The number of the interest points is 2000 which are randomly selected
7.3 Learning Algorithm
183
Fig. 7.8 Signature of histograms of OrienTations (SHOT)
Fig. 7.9 Learning the grasp point from the demonstration
from the point cloud. The size of the given point cloud d is defined as the maximum distance between two points in the cloud. Each local reference frame is constructed by using points that are within 0.05d from this interest point. To construct the SHOT, the neighborhood range is set to be within 0.5d from each interest point, which precisely describes where this interest point lies relative to the entire object and the surface variations. Figure 7.9 shows the recorded grasp points/grasp regions. The blue color points express the 3D point cloud directly captured by RGB-D Kinect sensor. The red color points display the nearest 30–50 neighbors of the recorded grasp points (the centers of the red regions). The green color points represent the grasp regions utilized for building repeatable reference frame on the objects (See the Sect. 7.3.3). In the process of transferring the grasp experience from the human to the robot, we
184
7 Learning from Wearable-Based Indirect Demonstration
represent all the grasp information under the object reference frame. For regular objects, their coordinate frames are established on the point cloud; while for complex object composing multiple components, we select a region on the point cloud of the object with a rectangular box and the object coordinate frame will be built by using the point cloud in this rectangular box (the green color region in Fig. 7.9). Note that the grasps happen in the green color regions. For grasp points generalization (learning of the red color grasp points), we represent each point on the incomplete point clouds by employing the SHOT descriptor and accumulate these descriptors within an object category as the training features. Then, by defining the labels of points in the grasp region as 1 and others as 0, an SVM classifier is directly learned which takes the SHOT features and returns the binary decision on whether a point in the visible point cloud is in the graspable region or not (the region on the object which is suitable for grasping). Thus, based on objects’ geometry feature, the grasp points’ identification is generalized among familiar objects in the same category. The approach for grasp region generalization (learning of the green color regions) is same as the one for grasp points (learning of the red color grasp points). And the only difference is that the points in the green color region are labeled as 1 and others as 0. In order to eliminate isolated misclassification, a spatial median filter is used to smooth the labels over the partial clouds. Specifically, for each point p with predicted label lp in point cloud, we find the set of neighboring points N p within a small radius rsmf from p and modify the class label by using F =
1 lq |Np | q∈Np
l p = 1{F > 0.5}
(7.1)
where 1{·} is the indicator function which equals to 1 if its argument is true and 0 otherwise. Here, F > 0.5 means that for each point p, if there are more than half of its neighboring points having Label 1, the label of the point p will be modified as 1, otherwise 0. The idea of using this filter comes from the assumption that graspable points are closely packed together. Here, the number of neighboring points N p affects the efficiency of the filter. Therefore, it is set as a variable changing depending on the number of multiple predicted candidate graspable points. Then, expressing the number of filtered candidate grasp points as Cp and a number threshold as Ct h , if Cp > Ct h , the final grasp point Gp is recognized as the average position of the filtered candidate grasp points; otherwise, Gp will be determined by the center of the bounding box Cbox together with the filtered candidate grasp points. In the proposed approach, firstly, the partial point clouds are categorized into one category and then, the category label leads to the corresponding grasp points/regions identifiers. In [27], we propose an effective object classification method based on the following three steps: (1) collecting the SHOT features (3D shape descriptor), the
7.3 Learning Algorithm
185
LR features (local frame features), and RGB features of interest points selected from the partial cloud; (2) reducing the feature dimensionality along the point dimension by calculating some simple statistics, e.g., the range, the mean, standard deviation, and entropies; (3) feeding the features into the ELM for object classification. It has been verified that the classification rate can be up to 97% when the training dataset contains 8–10 classes of objects. For each category, 4 instances are used for training and 2 rest ones for testing. In this chapter, the objects tested in the experiments have been correctly classified by using this object categorization method.
7.3.2 Grasp Model Built by “Thumb” Finger The Barrett Hand has three fingers F1 , F2 , and F3 as demonstrated in Fig. 7.10. With the joint parameters of the real Barrett Hand, the hand model in Fig. 7.10 is constructed by modifying the three-finger hand model available in the SynGrasp package. Hence, the kinematics relation between the wrist and the fingertip is same for the model in Fig. 7.10 and for the real Barrett Hand model. In this subsection, we present an approach to map the position and orientation of the human thumb tip to the finger F3 of the Barrett hand. Thus, Finger F3 is defined as the thumb of the robot hand. In Fig. 7.10, the coordinate frames of the object and the wrist of the robotic hand are represented with O and fw , respectively. On the finger 3, the frames f31 , f32 , f33 , and f3T express the origins of each joint and the end effector,
Fig. 7.10 Kinematics of Barrett hand
186
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.11 Coordinate system of the human hand
separately. And z3i and θ3i are defined as the rotational axis and the corresponding rotational angle of the joint i, where i = 1, 2, 3. Figure 7.11 demonstrates the status of the human thumb tip uh = [ph , Th ] ∈ R 6 relative to the object coordinate frame O in which ph ∈ R 3 represents the position of the contact point; Th = (Thα , Thβ , Thγ ) ∈ R 3 shows the orientation of the human thumb tip in Euler angle. Correspondingly, set ur = [pr , Tr ] ∈ R 6 as the status of the robot’s “thumb” fingertip relative to the object coordinate system O (See Fig. 7.10), where pr ∈ R 3 and Tr = (Trα , Trβ , Trγ ) ∈ R 3 denote the position and the orientation, respectively. How to transfer uh (human thumb experience) to ur is an important problem in the proposed grasp planning model. For new familiar objects, the grasp points pr can be recognized by using the trained model in the Sect. 7.3.1. The rotation angle Trα (See Fig. 7.10) is set as 0 since for stable grasping, the normal pressures exerted by the human thumb tip are always vertical to the contacted surface. This also has been verified in our humans’ demonstration wearing a data glove, which shows that Thα = 0. Trβ describes the rotation angle when the thumb rotates around the y-axis of the object coordinate system O. Instead of directly using Thβ as Trβ , we classify Thβ as two categories (see Eq. 7.2) given the different positions and orientations between the bases of the human arm and the robot arm, 4 Trβ =
0, if 0 ≤ Thβ < 45 90, if 45 ≤ Thβ ≤ 90
(7.2)
where the unit of Trβ is degree. As for Trγ , it is not reasonable to directly use the humans’ demonstration data Thγ because of the kinematics difference between the robots and human hands. We left Trγ as a variable for determining ur . Hence, ur can be simplified as a function with a variable Trγ . Representing ur with a homogenous equation Ur , it is Ur =
U Tr (Trγ ) prT 0 1
(7.3)
7.3 Learning Algorithm
187
where U Tr (Trγ ) is a 3 × 3 rotation matrix with a variable Trγ . U Tr (Trγ ) is calculated based on the YXZ-Euler angle. For the Barrett Hand model, set i−1 Ti as the homogeneous transformation matrix between Frame i and i − 1, i = 1, 2, 3. Then, the forward kinematics are determined by using the following equation: O
TT =O Tww T1 (θ31)1 T2 (θ32)2 T3 (θ33)3 TT (θ3T )
(7.4)
where O TT ≡ Ur expresses the homogeneous transformation matrix between the frames of the thumb fingertip f3T and the object O. O Tw is the homogeneous transformation matrix between the frames of the wrist fw and the object O. Therefore, O
T w = Ur
w
T11 T22 T33 TT
−1
.
(7.5)
According to Denavit–Hartenberg (D-H) parameters for the forward kinematics, θ31 is a fixed value and θ3T equals to 0◦ . The joint angles θ32 and θ33 are driven by the same motor that θ33 = 2θ32. Thus, in Eqs. 7.4 and 7.5, the status of the wrist O Tw can be represented as a function of θ32 and Trγ given by the transformation matrix O T (θ , T ). Supposing that the status of the wrist and the adduction/abduction w 32 rγ angle between Finger F1 and F2 have been determined by a group of (θ32 , Trγ , θ4 ), the robot closes the three fingers of the hand and stops until the fingers contacting the objects with a certain grasp force. Then, the joint angles (θ1 , θ2 ) for Finger 1 and Finger 2 in Eq. 7.6 can be determined by this close action. Thus, three variables (θ32 , Trγ , θ4 ) are enough for determining a hand configuration. Its quality function can be written as G = f (θ32 , Trγ , θ4 ).
(7.6)
The optimum value of θ32, Trγ is accomplished by resolving an optimization problem associated with the wrist status, which will be illustrated in the next subsection. After deciding θ32 and Trγ , we search the most suitable θ4 based on the approximated surface normals at the contact points and the grasp shape [28].
7.3.3 Wrist Constraints Estimation As displayed in the above subsection, there are only two variables (θ32, Trγ ) needed for deciding the wrist position and orientation O Tw (See Eq. 7.5), which is a typical stable grasping optimization problem. Normally, the optimization process consists of three steps: (1) generating multiple candidate groups of (θ32 , Trγ ) for establishing the wrist status; (2) in the simulations, closing the fingers until the fingertips contact the objects’ surface for evaluating the grasp stability; (3) deciding the optimum
188
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.12 The searching space of the Barrett hand with thumb constraint ∗ , T ∗ ), which performs the best grasp stability. However, this approach group, (θ32 rγ has its disadvantages: (1) the searching space for (θ32, Trγ ) is large, which reduces the success rate and is time-consuming; (2) the final established wrist position and orientation sometimes are not so usual as human grasps; (3) for single-view camera setting, it is difficult to close the fingers and find the correct contact points for stability evaluation because of the partial object models. We believe that although there are kinematics differences between the human hand and the robot hand, some key grasp cues can be extracted and transferred to the robot for stable grasping. One important cue (See Fig. 7.14) is the approaching direction from the hand to the object. Further, when self-occlusion happens, it is difficult to decide the wrist precise position along the visual occlusion direction because of no information about the backside of objects. Inspiring from humans’ prediction way, controlled scene continuation [29], we approximate a most likely wrist position along the visual occlusion direction. Building the object frame as the right figure in Fig. 7.12, the y-axis is the visual ∗ and occlusion direction. Supposing that a possible wrist position in y-axis pyw a desired approaching direction d ∗w ∈ R 3 of the wrist have been determined, the optimum solutions of (θ32 , Trγ ) can be obtained by minimizing the following objective function: ∗ || + m arccos (d w .d ∗w ) min ||pyw − pyw
θ32 ,Trγ
(7.7)
where m is a weight. d w ∈ R 3 and pyw are the estimated approaching direction and the wrist position in y-axis, respectively. They are all under the object reference frame. In the wrist position pw = (pxw , pyw , pzw ) ∈ R 3 , pxw and pzw are left as variables for compensating the kinematics difference between the human hand and the object hand. To avoid local optimization, the above objective function is minimized by the global search method “patternsearch” to acquire the global optimum solution of (θ32 , Trγ ). Here, the “patternsearch” function can be directly called in Matlab.
7.3 Learning Algorithm
189
The above approach for hand configuration has the following obvious advantages. Firstly, the searching space of the wrist position becomes much smaller than that in [11], which improves the efficiency of successful hand configuration. Figure 7.12 shows the searching space of the Barrett Hand (the black color line) that the wrist position pw (the red point) varies along the approaching direction d w (the green color direction), where d w = [0, 0, −1]. Secondly, the optimum hand configuration keeps the humans’ grasp intention by key cues including the grasp point, the approaching direction, and the estimated position along the vision occlusion direction. And at the same time, the hand configuration resolves the problem of different kinematics between the human hand and the robot hand. Thirdly, the proposed approach copes with the stability evaluation problem for partial object model by integrating the humans’ stable grasp experience. There are two issues when executing the optimization process with Eq. 7.7. The first one is about defining a repeatable object reference frame on the partial object model. In this chapter, we suppose that the familiar objects have similar but different shapes and thus, they can be grasped in similar ways. Then, when learning the grasp experience in the training stage, especially for the grasp points and the approaching directions, all the training data are recorded under the object reference frames. Certainly, when testing the proposed approach on the new familiar objects, the grasp planning will execute on the object reference frames as well. Therefore, the built object reference frames for training and testing should have relatively similar directions with regard to the objects’ geometry. That is to say, the built object reference frames need to be repeatable on the tested new familiar objects. In this chapter, the PCA (Principal Component Analysis) method is utilized firstly for obtaining the initial coordinate frame of the object. Then, a minimum bounding box is searched based on this initial coordinate frame for enclosing the partial point cloud. For complex objects composing of two or several components, the minimum bounding box is constructed on the graspable component for building the objects’ grasp reference frame (the green color point cloud of the brush in Fig. 7.13). The graspable component identification has been introduced in the Sect. 7.3.1. And the X, Y, and Z axes on the object frame have been shown in Fig. 7.13 with the axes of red, yellow, and black color, respectively.
Fig. 7.13 Grasp component identification and constraints estimation
190
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.14 Learning grasps from human experience
Fig. 7.15 The approaching direction dw ∗ and d ∗ . The second issue is about determining the desired wrist constraints pyw w Figure 7.14 demonstrates some examples of human grasping objects in which the red points represent the recorded contact points and the green color arrows express the orientation of the human thumb. We learn the grasp points by training the shape features of these recorded contact points (See the Sect. 7.3.1). On the other hand, we have found that the orientation of the thumb has important relation with the approaching direction d w of the robot hand. Based on the Barrett Hand kinematics, Trβ is relative to the approaching direction d w . Figure 7.15 shows two examples that in the left figure, the purple color arrow, d ∗w = [1, 0, 0] and Trβ = 0◦ , while in the right figure, d ∗w = [0, 0, −1] and Trβ = 90◦ under the object coordinate frame. According to Eq. 7.2, we estimate d ∗w by approximating the directions of the green color arrows that some reasonable direction vectors close to the green color arrows are utilized to represent d ∗w by considering the relative pose difference between the human upper limb and the robot manipulator. For single-view camera arrangement, the point clouds feedbacks are incomplete ∗ because of self-occlusion. This incompleteness makes the position estimation pyw
7.4 Experimental Results
191
difficult along the vision occlusion direction. Here, we decide it by using an empirical way employing the center of the minimum bounding box in y-axis Cy ∗ . Therefore, whether this approach works well or not depends on to estimate pyw the minimum bounding box built. The core idea is that if the camera feeds back (P ymin +P ymax )
p the whole size of the object in y-axis, then Cy = p ; otherwise, it is 2 inferred that the object size in y-axis is half of the whole size because of occlusion, Cy = Pp ymax , where (Pp ymin , Pp ymax ) express the minimum and maximum positions under the reference frame of the bounding box in y-axis. Utilizing the surface normals in the visible point cloud of the object is one way to judge the situation of the feedback size in y-axis.
7.4 Experimental Results 7.4.1 Object Classification Based on Shape Descriptors In our previous papers [28, 30], an object classification method has been presented which categorizes the objects by a shape feature modified from the SHOT descriptor. The shape feature firstly represents each point on the objects by the SHOT descriptor and then reduces the dimensionalities of these SHOT descriptors along the point direction. This algorithm has been tested on two publicly available datasets independently. The first one contains complete point cloud data synthesized from the Princeton Shape Benchmark. The number of objects ranges from 6 to 21 over ten chosen categories. The second database is from the Large RGBD dataset containing partial point clouds recorded by Kinect camera from objects on turntables. Out of the original 51 categories (object names), ten are chosen which have dissimilar overall geometries (ball, coffee mug, food bag, food box, food jar, glue stick, plate, pliers, stapler, and water bottle). Six different objects are selected from each category. Training point clouds are taken from four randomly selected objects, and the point clouds from the other two objects form the test set. The classification results are 95.2 and 90.2% for the databases of L-RGBD and PSB, respectively [28, 30]. This chapter employs this approach for classifying objects as displayed in Fig. 7.16, which includes seven shape categories: carrot, cup, bottle, brush, mug, cube box, facial cleanser. Their 2D and 3D data are directly captured from Kinect sensor. In a category, the instances have similar but different shapes and five objects are utilized for training and two for testing. The comparison of different shape descriptors has been accomplished in [28]. In these experiments, we build the shape features proposed in [28] by two SHOT descriptors, SHOT-1, SHOT-2 which have dimensions (azimuth, elevation, radial divisions, and 11 bins in each division) of 594 (6, 3, 3, 11) and of 352 (8, 2, 2, 11), respectively. Figures 7.17 and 7.18 display the confusion matrixes for each. It can be seen that SHOT-1 has better classification results than SHOT-2 that only the tested carrot is misclassified as a bottle.
192
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.16 Objects used in the experiments
7.4.2 Comparisons of Shape Descriptors for Grasp Region Detection Supposing the category label of the object has been correctly estimated, the corresponding grasp point identifier is able to be found based on the label. This subsection evaluates the effectiveness of local shape descriptors SHOT for discriminating between graspable and non-graspable regions. We firstly compare
7.4 Experimental Results
193
Fig. 7.17 Confusion matrix of the category test for SHOT-1
Fig. 7.18 Confusion matrix of the category test for SHOT-2
SHOT-1 descriptor with another popular shape descriptor FPFH (Fast Point Feature Histograms) for grasp region recognition. FPFH descriptor computes the histogram of the three angles between a point p and its k-nearest neighbors, called Simplified Point Feature Histogram (SPFH) and then, for each point p, the values of the SPFH of its k neighbors are weighted by their distance to p to produce the FPFH at p [18]. The partial point clouds of objects in Fig. 7.16 are utilized for the training and testing of the grasp regions. Figure 7.19 demonstrates examples of the training data. The blue color points are the objects’ incomplete point clouds, each of which have smaller than 2000 points randomly selected from the original dataset. And the red color points are the recorded grasp region for the thumb finger. Their centers are recognized as the grasp points. For each category, three instances are utilized for training and one for testing. Labeling the recorded points (the red points in Fig. 7.19) as 1 and the other points as 0, the shape descriptors of those interest points are sent into SVM for training. Table 7.1 demonstrates the classification accuracies of the tenfold cross-validation results by using SHOT and FPFH descriptors. It can be seen that SHOT descriptor gets obvious better results than FPFH except the carrot category. The classification accuracy of FPFH (96.07%) is similar to that of SHOT
194
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.19 Examples of the training data Table 7.1 Comparison results between descriptors of SHOT and FPFH (Unit, %)
Database Cup Carrot Cleanser Box Brush Bottle Mug
SHOT 97.18 96.48 96.46 96.82 96.36 97.56 98.32
FPFH 80.46 96.07 91.39 80.17 75.68 86.46 87.99
(96.48%) for the carrot category since the geometry shapes of the carrots’ point clouds are simpler than that of other categories of objects. For the shape descriptors of SHOT and FPFH, their representation dimensions of one point are 594 and 33, respectively. Therefore, the superior differentiation ability of the SHOT descriptor is able to be seen clearly on the objects with complex shapes. In order to make clear the effects of different dimensions of SHOT descriptors, we compare 4 descriptors: SHOT-1, SHOT-2, SHOT-3, and SHOT-4, which have dimensions (azimuth, elevation, radial divisions, and 11 bins in each division) of 594 (6, 3, 3, 11), 352 (8, 2, 2, 11), 176 (4, 2, 2, 11), and 88 (2, 2, 2, 11), respectively. 5 categories of objects (cup, carrot, cleanser, box, and bottle) are chosen for this experiment and each category has 4 instances. Selecting each instance as the testing data and the other three ones as the training data, four models have been trained for each shape descriptor. Table 7.2 displays the experimental results in which the data represent the mean Euclidean distances and the standard deviations of 10 times testing between the estimated grasp points and the recorded grasp points.
7.4 Experimental Results
195
Table 7.2 Comparison results of SHOT descriptors (Unit, mm) Database SHOT-1
SHOT-2
SHOT-3
SHOT-4
Cup 3.34 ± 1.61 4.43 ± 0.25 2.51 ± 0.28 2.15 ± 0.46 4.07 ± 1.48 6.69 ± 0.88 2.96 ± 0.52 4.21 ± 0.50 4.89 ± 1.34 5.24 ± 0.11 2.46 ± 0.49 4.64 ± 0.38 3.51 ± 1.76 6.34 ± 0.53 5.25 ± 2.48 4.85 ± 0.24
Carrot 21.50 ± 0.26 18.72 ± 0.21 7.97 ± 0.08 15.38 ± 0.18 3.86 ± 0.13 7.13 ± 0.07 7.02 ± 0.11 13.00 ± 0.00 2.05 ± 1.07 13.71 ± 0.87 11.03 ± 4.70 13.30 ± 0.84 10.13 ± 5.80 8.28 ± 0.81 11.80 ± 5.42 15.23 ± 1.65
Cleanser 3.47 ± 0.80 7.40 ± 1.10 2.49 ± 1.06 13.10 ± 1.92 5.53 ± 1.57 6.98 ± 0.64 0.00 ± 0.00 6.06 ± 0.75 12.55 ± 5.10 12.49 ± 7.13 6.43 ± 2.87 6.41 ± 1.02 10.03 ± 7.12 6.51 ± 0.52 5.78 ± 0.70 6.11 ± 1.58
Box 36.86 ± 4.63 9.65 ± 4.37 6.98 ± 1.27 6.36 ± 2.71 33.76 ± 4.73 9.80 ± 0.0 12.70 ± 0.82 6.40 ± 1.10 40.59 ± 3.95 9.80 ± 0.00 7.56 ± 0.62 6.20 ± 5.00 43.97 ± 4.64 9.80 ± 0.00 10.15 ± 0.91 24.78 ± 4.16
Bottle 11.68 ± 4.39 9.08 ± 3.28 8.57 ± 1.70 8.71 ± 1.04 13.79 ± 6.46 6.04 ± 1.63 8.13 ± 1.70 6.66 ± 2.29 12.86 ± 3.81 3.64 ± 1.31 9.23 ± 4.02 6.26 ± 1.92 8.20 ± 2.93 9.30 ± 2.18 8.26 ± 1.61 6.08 ± 1.94
Bold values show the best estimation results for each category
We can see from Table 7.2 that the best estimation results for each category (the overstriking data) are not limited to one dimension of the shape descriptor that for the category from left to right, the suitable descriptors are SHOT-1, SHOT-2, SHOT2, SHOT-1 and SHOT-2, respectively. Summarily, for objects with different shapes, the SHOT shape descriptors with dimensions ≥ 352 have higher success rates for grasp points identification than the ones with dimensions ≤ 352. This is because more subdivisions around each interest point have stronger shape description ability. Occasionally, the estimations fail, for example, the first line of SHOT-1, box category. This may be caused by the human labeling or the less number of object instances for training. Figure 7.20 shows the testing results by using SHOT-1 in which the red color triangles are the decided grasp points and the green color points represent the estimated grasp regions. Since the labeled grasp points are point clusters (the red points in Fig. 7.19), most of the estimated green points (see Fig. 7.20) clump into clusters on different objects. However, for the carrot category, some estimations disperse on the blue point cloud, which may be caused by slight relative position differences of the human labeling on the training objects. Another important way to evaluate the estimated grasp point is to verify whether or not it is effective for stably grasp planning, which will be executed in the Sect. 7.4.3.
196
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.20 Testing results of different objects
Fig. 7.21 Various objects for grasping
7.4.3 Grasp Planning By utilizing the proposed approach in Sect. 7.3.2, this subsection performs hand configuration through simulations and experiments. Figure 7.21 shows various objects for grasping. The robot consists of a dextrous hand Barrett Hand BH8-280 and a 7-DOF SCHUNK arm. Notice that the experiments are executed under the following two assumptions: (1) the table area where the objects put on is in the range of the robot arm’s manipulability; (2) the sizes of the objects used for grasping are not so large that the learned grasp points can be reached by the robot hand.
7.4 Experimental Results
197
Fig. 7.22 Grasp planning by using the previous method
The hand configurations are compared with our previous developed grasp planning method in [27]. Figure 7.22 demonstrates the results by using the previous method that the hand grasps a bottle and a gun. The point clouds in red color are the recognized graspable component. The method in [27] extracts the desired three contact points from the graspable component, which satisfies a set of stability and feasibility criterion. Then, the optimums of 10 variables (the wrist position and orientation and the joints of each finger) are obtained by optimizing the inverse kinematics. This approach [27] has the disadvantages that the optimal hand configuration in most times is not so normal as human grasps and the optimization process is time-consuming, which easily goes into a local minimum. Comparing with the previous method [27], the relations between the thumb direction and the wrist constraints are integrated into the grasp planning. Hence, the proposed method reduces the variables from 10 to 3, which greatly shortens the iteration times and improves the success rate of optimization. The costing time of the hand configuration by the proposed method is smaller than 1s by using the Matlab code in PC Intel(R) Core(TM) i5-3470 CPU, RAM 4.00GB. The computational times of the proposed object classification and grasp point estimation methods have relation with the 3D point number of objects. Normally, if the point number is smaller than 2000, the computation time is less than 3s under the same PC condition. Since these learning/recognition and planning processes are performed before robot executing the grasping, the robot system is implementable. Figures 7.23 and 7.24 demonstrate the grasp planning results by employing the proposed approach in this chapter, which build the object coordinate frames on the incomplete point cloud and on one component of the incomplete point cloud, respectively. All of the used partial viewpoint clouds of objects are directly captured from a Kinect camera. Supposing that the objects have been classified to the correct categories, the labels of categories lead to the corresponding identifiers for recognizing the grasp points. From left to right, Figs. 7.23 and 7.24 display the RGB images, the identification results of the grasp points, the minimum bounding
198
7 Learning from Wearable-Based Indirect Demonstration
Fig. 7.23 Grasp planning results by building the bounding boxes on the simple objects
boxes of objects, and the hand configuration results. In the identification results of the grasp points (the second column), the green color points express the identified graspable regions after utilizing the spatial median filter and the red color points are the final recognized grasp points. They seem reasonable and their usefulness will be examined in the following grasp planning experiments. The third columns in Figs. 7.23 and 7.24 show the repeatable reference frames built on the incomplete point clouds employing the minimum bounding boxes. The axes of the red color, yellow color, and black color express the X, Y, and Z coordinate axes, respectively. Notice that for complex objects (compose of several components) in Fig. 7.24, the component identifiers (see the Sect. 7.3.1)
7.4 Experimental Results
199
Fig. 7.24 Grasp planning results by building the bounding boxes on the complex objects
recognize the component for grasp planning firstly by classifying the objects’ point cloud with SHOT descriptors. Then, the object coordinates are built on the identified components by using the minimum bounding boxes. Since the approximate approaching directions and the wrist constraints are all described under the object coordinate frame, the third columns in Figs. 7.23 and 7.24 are very important steps for successful grasps that if the object coordinates are not repeatable, the grasp planning cannot get acceptable results. Finally, the hand configuration is accomplished as shown in the fourth columns of Figs. 7.23 and 7.24 with the predicted “thumb” grasp points, the approximate approaching directions and the wrist constraints. In the experiments, it is found that the hand configuration is able to quickly converge to the optimum solution by applying a global search method (Matlab function: patternsearch https://cn. mathworks.com/help/gads/patternsearch.html). An evaluation approach of grasp stability has been presented in our previous paper [27], which estimates the stability by three grasp properties: (a) force closure FC, (b) grasp shape C, and (c) area of grasp polygon A. It is defined that the grasp is acceptable if the grasp quality G is smaller than 2; otherwise, an unacceptable one. Define a triplet PN as the set of three
200
7 Learning from Wearable-Based Indirect Demonstration
candidate contact points for three fingers of the Barrett Hand and the corresponding normal vectors at the three contact points on objects’ surfaces. The grasp evaluation function G can be formulated as G = FC(PN) + wC(PN) + A(PN). (a) If the grasp constructing by the three contact points is not a force-closure grasp (the details can be seen in [27]), FC(PN) will return a large value bigger than 10; otherwise, FC(PN) will return 1. (b) Grasp shape has relation with the number of friction cone pairs N that counteroverlap. If N = 1 or 2, C(PN) = 0; If N = 0, C(PN) = min(An), where An is the set of the angles of the triangle formed by the three points. It has been shown that the grasp shape is more likely to be stable when the number of counter-overlapping friction cone pairs is 1 or 2. (c) A(PN) means that the force-closure triplet must be reachable by the fingertips. If Area(PN) is in the minimum and maximum triangular areas, the returned value of A(PN) is 0; otherwise, A(PN) returns a big value which is bigger than 10. Therefore, G = FC(PN) + wC(PN) + A(PN) = 1 means FC(PN) = 1, C(PN) = 0, A(PN) = 0, which is a reasonable grasp. Some results are larger than 1.0 since C(PN) is larger than 0 when N = 0. In Fig. 7.23, some point clouds are reconstructed (the yellow color point cloud) for stability evaluation. Supposing these grasps adhering to the hard finger contact model, Table 7.3 demonstrates the grasp stability evaluation results by using the evaluation approach in [27]. “bottle1,” “bottle2,” “brush1,” and “brush2” represent from top to down, the first bottle, the second bottle, the first brush, and the second brush in Fig. 7.24, respectively. We can see that all the results are smaller than 2, which means the grasp planning results in Figs. 7.23 and 7.24 are stable grasps. Figure 7.25 demonstrates the experiment results executing by using the real robot which consists of a BH8-280 Barrett Hand and a 7-DOF SCHUNK arm. In the realtime experiments, the robot firstly plans the wrist position and orientation employing the proposed approach; then, the end effector of the manipulator approaches to the planned wrist posture; finally, the dexterous hand closes until contacting with the objects and the grasp torque increasing to a determined value. Most of the object categories in Figs. 7.23 and 7.24 have been tested for real robot grasping. For each object, the grasping experiments are carried on 5 times and it is recognized as a successful grasp if the robot is able to grasp the object stably, which means the robot grabs up the object in a certain grasp force and the grasped object keeps still (no rotation, no dropping, and no slipping happen). The results show that 90% of Table 7.3 Grasp stability evaluation
Objects in Fig. 7.23 Carrot Cup Facial cleanser Box Pen
Results 1.0 1.0 1.0 1.0 1.0
Objects in Fig. 7.24 Bottle1 Mug Brush1 Bottle2 Brush2
Results 1.0 1.0 1.1 1.0 1.2
7.5 Summary
201
Fig. 7.25 Executing grasps by using the real robot
the tested experiments perform successful grasps. The orange pen (see in Fig. 7.23) grasping fails due to the inadequate calibration accuracy and the few freedoms of the Barrett Hand. The pinch grasp is more suitable for the objects with too thin width size like pens. In a word, the proposed approach is effective for grasp planning by learning human experience and utilizing the objects’ shape affordance.
7.5 Summary Transferring grasp experience among objects within the same category is a reasonable way for robots autonomously performing grasps on various shapes of objects. In this chapter, an effective method is presented for realizing grasp planning on familiar objects with incomplete point clouds. Firstly, we present a grasp point learning method by employing a 3D SHOT descriptor based on a sparse set of examples of objects. Under the assumption that objects with similar shapes have similar relative grasp positions, this approach is able to identify the grasp points of objects with multiple kinds of shapes and has the ability to achieve grasp point generalization among familiar objects. Then, by integrating key grasp strategies from human experience, a hand configuration method is developed by optimizing the inverse kinematics of the “thumb” finger. Here, the key grasp experience consists of the position and orientation of the human thumb and the wrist. This approach has several advantages which include (1) greatly reducing the search space of the grasp configuration, which improves the success rate of hand configuration and saves the costing time; (2) building a mapping from the humans’ hands to the robots’ hands with a loose grasp constraint by considering the kinematics difference between the humans and the robots. Last
202
7 Learning from Wearable-Based Indirect Demonstration
but not least, the presented method is able to be used for grasping objects with partial point clouds of objects. In our near future work, the manipulative requirements in the tasks will be considered for selecting the grasps intelligently. Moreover, the manipulation motion [31] is also quite important for planning a robotic grasp that the grasp should enable the manipulator to carry out the task most efficiently with little motion effort. Additionally, more modalities will be integrated for object categorization, e.g., tactile.
References 1. Sun Y, Ren S, Lin Y (2013) Object-object interaction affordance learning. Robot Auton Syst 62:487–496. https://doi.org/10.1016/j.robot.2013.12.005 2. Detry R, Piater J (2013) Unsupervised learning of predictive parts for cross-object grasp transfer. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems, pp 1720–1727. https://doi.org/10.1109/IROS.2013.6696581 3. Li M, Bekiroglu Y, Kragic D, Billard A (2014) Learning of grasp adaptation through experience and tactile sensing. In: IEEE international conference on intelligent robots and systems, pp 3339–3346. https://doi.org/10.1109/IROS.2014.6943027 4. Steffen J, Haschke R, Ritter H (2007) Experience-based and tactile-driven dynamic grasp control. In: IEEE international conference on intelligent robots and systems, pp 2938–2943. https://doi.org/10.1109/IROS.2007.4398960. 5. Welschehold T, Dornhege C, Burgard W (2016) Learning manipulation actions from human demonstrations. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3772–3777. https://doi.org/10.1109/IROS.2016.7759555 6. Geng T, Lee M, Huelse M (2011) Transferring human grasping synergies to a robot. Mechatronics 21:272–284. https://doi.org/10.1016/j.mechatronics.2010.11.003 7. Faria D, Trindade P, Lobo J, Dias J (2014) Knowledge-based reasoning from human grasp demonstrations for robot grasp synthesis. Robot Auton Syst 62:794–817. https://doi.org/10. 1016/j.robot.2014.02.003 8. Hillenbrand U, Roa M (2012) Transferring functional grasps through contact warping and local replanning. In: IEEE international conference on intelligent robots and systems. https://doi.org/ 10.1109/IROS.2012.6385989 9. Gioioso G, Salvietti G, Malvezzi M, Prattichizzo D (2013) Mapping synergies from human to robotic hands with dissimilar kinematics: an approach in the object domain. IEEE Trans Robot 29:825–837. https://doi.org/10.1109/TRO.2013.2252251 10. Cutkosky M (1989) On grasp choice, grasp models, and the design of hands for manufacturing tasks. IEEE Trans Robot Autom 5:269–279. https://doi.org/10.1109/70.34763 11. Lin Y, Sun Y (2015) Robot grasp planning based on demonstrated grasp strategies. Int J Robot Res 34:26–42. https://doi.org/10.1177/0278364914555544 12. Saxena A, Driemeyer J, Ng A (2008) Robotic grasping of novel objects using vision. Int J Robot Res 27:157–173. https://doi.org/10.1177/0278364907087172 13. Redmon J, Angelova A (2014) Real-time grasp detection using convolutional neural networks. In: 2015 IEEE international conference on robotics and automation. https://doi.org/10.1109/ ICRA.2015.7139361 14. Gualtieri M, Pas A, Saenko K, Platt R (2016) High precision grasp pose detection in dense clutter. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 598–605. https://doi.org/10.1109/IROS.2016.7759114
References
203
15. Sundberg M, Litwinczyk W, Grimm C, Balasubramanian R (2016) Visual cues used to evaluate grasps from images. In: IEEE international conference on robotics and automation (ICRA), pp 1965–1971. https://doi.org/10.1109/ICRA.2016.7487343 16. Pinto L, Gupta A (2016) Supersizing self-supervision: learning to grasp from 50K Tries and 700 robot hours. In: IEEE international conference on robotics and automation (ICRA), pp 3406–3413. https://doi.org/10.1109/ICRA.2016.7487517 17. El-Khoury S, Sahbani A (2010) A new strategy combining empirical and analytical approaches for grasping unknown 3D objects. Robot Auton Syst 58:497–507. https://doi.org/10.1016/j. robot.2010.01.008 18. Aldoma A, Marton Z, Tombari F, Wohlkinger W, Potthast C, Zeisl B, Rusu R, Gedikli S, Vincze M (2012) Tutorial: point cloud library: three-dimensional object recognition and 6 DOF pose estimation. IEEE Robot Autom Mag 19:80–91. https://doi.org/10.1109/MRA.2012.2206675 19. Bohg J, Johnson-Roberson M, León B, Felip J, Gratal X, Bergström N, Kragic D, Morales A (2011) Mind the gap - robotic grasping under incomplete observation. In: IEEE international conference on robotics and automation, pp 686–693. https://doi.org/10.1109/ICRA.2011. 5980354 20. Kroemer O, Ben Amor H, Ewerton M, Peters J (2012) Point cloud completion using extrusions. In: IEEE-RAS international conference on humanoid robots, pp 680–685. https://doi.org/10. 1109/HUMANOIDS.2012.6651593 21. Ben Amor H, Kroemer O, Hillenbrand U, Neumann G, Peters J (2012) Generalization of human grasping for multi-fingered robot hands. In: IEEE/RSJ international conference on intelligent robots and systems. https://doi.org/10.1109/IROS.2012.6386072 22. Nikandrova E, Kyrki V (2015) Category-based task specific grasping. Robot Auton Syst 30:25– 35. https://doi.org/10.1016/j.robot.2015.04.002 23. Varley J, Weisz J, Weiss J, Allen P (2015) Generating multi-fingered robotic grasps via deep learning. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 4415–4420. https://doi.org/10.1109/IROS.2015.7354004 24. Faria D, Martins R, Lobo J, Dias J (2012) Extracting data from human manipulation of objects towards improving autonomous robotic grasping. Robot Auton Syst 60:396–410. https://doi. org/10.1016/j.robot.2011.07.020 25. Fang B, Guo D, Sun F, Liu H, Wu Y (2015) A robotic hand-arm teleoperation system using human arm/hand with a novel data glove. In: IEEE international conference on robotics and biomimetics (ROBIO), pp 2483–2488. https://doi.org/10.1109/ROBIO.2015.7419712 26. Alexandre L (2012) 3D descriptors for object and category recognition: a comparative evaluation. In: IEEE/RSJ international conference on robotics and automation 27. Liu C, Li W, Sun F, Zhang J (2015) Grasp planning by human experience on a variety of objects with complex geometry. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 511–517. https://doi.org/10.1109/IROS.2015.7353420 28. Liu C, Sun F, Ban X (2016) An effective method for grasp planning on objects with complex geometry combining human experience and analytical approach. Sci China Inf Sci 59:112212. https://doi.org/10.1007/s11432-015-0463-9 29. Breckon T, Fisher R (2005) Amodal volume completion: 3D visual completion. Comput Vis Image Und 99:499–526. https://doi.org/10.1016/j.cviu.2005.05.002 30. Sun F, Liu C, Huang W, Zhang J (2016). Object Classification and Grasp Planning Using Visual and Tactile Sensing. IEEE Trans Syst Man Cyb Syst 46:1–11. https://doi.org/10.1109/TSMC. 2016.2524059 31. Lin Y, Sun Y (2015) Task-based grasp quality measures for grasp synthesis. In: IEEE/RSJ international conference on intelligent robots and systems. https://doi.org/10.1109/IROS.2015. 7353416
Part IV
Conclusions
This part of the book comprises one chapter, which summarizes the main work of this book and presents some important future directions.
Chapter 8
Conclusions
This book comprehensively introduces developed wearable devices and robotic manipulation learning from their demonstrations. Under the developed unified framework of wearable technologies, the following wearable demonstration and manipulation learning problems are systematically addressed. 1. Motion Capture. Wearable devices often need multiple sensors to capture human operation information, which brings challenges such as wearable device design, sensor calibration, and information fusion of different sensors. In Chaps. 2 and 3, the inertial sensors are utilized to tackle this problem. The developed wearable devices are elegant, flexible, and effective to address the hand-arm motion capture problem, while previous work usually considers the hand motion capture problem only. 2. Gesture Recognition. Gesture recognition provides an intelligent, natural, and convenient way for human–robot interaction. In Chap. 4, the multi-modal dataset of gestures are built and machine learning methods are developed. It shows that the wearable device is suitable for recognizing fine gestures. 3. Robotic Manipulation Learning. With the capabilities of capturing human manipulation information and providing high quality demonstrations for robotic imitation, wearable devices are applied on robots to acquire manipulation skills. Robotic manipulation learning methods from teleoperation demonstrations based on wearable devices and vision are developed in Chaps. 5 and 6, respectively. Furthermore, learning method from indirect demonstration is shown in Chap. 7. In summary, robotic manipulation learning has become an important key technology in the field of robotics and artificial intelligence. However, the current research work in this area still faces many challenges, which are summarized as follows: (1) Demonstration: Although great progress has been made in teaching via wearable devices, most of teleoperation teachings only consider the position and posture. Most of them are the mechanical arms or the end effectors, lacking information about the overall operation of the hand-arm system. For a © Springer Nature Singapore Pte Ltd. 2020 B. Fang et al., Wearable Technology for Robotic Manipulation and Learning, https://doi.org/10.1007/978-981-15-5124-6_8
207
208
8 Conclusions
multi-DoF humanoid manipulator, it is necessary to consider the operational configuration, position, attitude, and the dexterous hand’s operating force, which results in the tactile teaching. But how to integrate multi-modal teaching methods to achieve high quality demonstrations remains a challenging issue. (2) Representation: Using the demonstration to characterize the learning state and operational intent of the teacher is an important step in imitation learning. Most of the current research focuses on the trajectory or visual representation. Although visual and tactile representations in our teaching samples can provide more information to imitation learning, how to exploit the correlation between visual and tactile information with robotic operations and how to learn the representation of multi-modal information are very important issues in practical applications. It is not only the cornerstone of operational representation but also an important direction for future multi-modal teaching information representation. (3) Learning: the existing imitative operation learning has a low utilization rate of teaching samples, and cannot achieve efficient strategy learning. At the same time, the operation of the imitation learning algorithm is sensible to multimodal characteristics, operational space locality, and small samples which pose a great challenge to the generalization of imitation operations. How to design an efficient robotic imitation learning framework is still an open problem in robot learning. In general, the wearable technology provides a more efficient and high quality way for robotic manipulation and learning. There are still many challenging academic problems in this field, and it is necessary to carry out in-depth exploration and analysis from the perspectives of signal processing, machine learning, and robot learning theory.