131 33
English Pages 384 [379] Year 2021
Intelligent Systems Reference Library 200
Md Atiqur Rahman Ahad Upal Mahbub Tauhidur Rahman Editors
Contactless Human Activity Analysis
Intelligent Systems Reference Library Volume 200
Series Editors Janusz Kacprzyk, Polish Academy of Sciences, Warsaw, Poland Lakhmi C. Jain, KES International, Shoreham-by-Sea, UK
The aim of this series is to publish a Reference Library, including novel advances and developments in all aspects of Intelligent Systems in an easily accessible and well structured form. The series includes reference works, handbooks, compendia, textbooks, well-structured monographs, dictionaries, and encyclopedias. It contains well integrated knowledge and current information in the field of Intelligent Systems. The series covers the theory, applications, and design methods of Intelligent Systems. Virtually all disciplines such as engineering, computer science, avionics, business, e-commerce, environment, healthcare, physics and life science are included. The list of topics spans all the areas of modern intelligent systems such as: Ambient intelligence, Computational intelligence, Social intelligence, Computational neuroscience, Artificial life, Virtual society, Cognitive systems, DNA and immunity-based systems, e-Learning and teaching, Human-centred computing and Machine ethics, Intelligent control, Intelligent data analysis, Knowledge-based paradigms, Knowledge management, Intelligent agents, Intelligent decision making, Intelligent network security, Interactive entertainment, Learning paradigms, Recommender systems, Robotics and Mechatronics including human-machine teaming, Self-organizing and adaptive systems, Soft computing including Neural systems, Fuzzy systems, Evolutionary computing and the Fusion of these paradigms, Perception and Vision, Web intelligence and Multimedia. Indexed by SCOPUS, DBLP, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/8578
Md Atiqur Rahman Ahad · Upal Mahbub · Tauhidur Rahman Editors
Contactless Human Activity Analysis
Editors Md Atiqur Rahman Ahad Department of Electrical and Electronic Engineering University of Dhaka Dhaka, Bangladesh
Upal Mahbub Multimedia R&D and Standards Qualcomm Technologies Inc. San Diego, CA, USA
Department of Media Intelligent Osaka University Osaka, Japan Tauhidur Rahman College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA, USA
ISSN 1868-4394 ISSN 1868-4408 (electronic) Intelligent Systems Reference Library ISBN 978-3-030-68589-8 ISBN 978-3-030-68590-4 (eBook) https://doi.org/10.1007/978-3-030-68590-4 © Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Foreword by Matthew Turk
In the midst of a global pandemic in late 2020, it is clearer than ever that systems based on contactless sensing are becoming increasingly important. This volume, edited by Md Atiqur Rahman Ahad, Upal Mahbub, and Tauhidur Rahman, provides a timely overview of state-of-the-art research in the area, focusing on challenges and advances in sensing, signal processing, computer vision, and mathematical modeling, and addressing a number of interesting application areas, including health analysis. The chapters provide both breadth and technical depth that can be of great interest and use to students, researchers, engineers, and managers wanting to better understand what is possible—currently and in the near future. Along with technical advances and opportunities come questions of privacy and various ethical issues associated with sensing and analyzing human activity. It is encouraging that several of the papers raise these kinds of questions to consider, as questions of when, how, where, and if to deploy such technologies will be important. The authors are well qualified to provide this instruction and insight, and they also display the international nature of efforts in this area that has made much progress over the years. The readers will have much to gain by understanding the methods and approaches taken, the experiments reported, and the conclusions drawn. Chicago, IL, USA November 2020
Matthew Turk
v
Comments from Experts
The book, Contactless Human Activity Analysis, edited by Prof. Md Atiqur Rahman Ahad, Dr. Upal Mahbub and Prof. Tauhidur Rahman, provides a comprehensive overview of recent research and developments in an important field. As methods from artificial intelligence are becoming pervasive, this book which covers machine learning, computer vision and deep learning methods based on multi-modal sensing for multi-domain applications, including surveillance and elderly care, will be an asset to entry-level and practicing engineers and scientists. The editors and authors of all the chapters have done a terrific job of covering the vast amount of material in a lucid style. Rama Chellappa, IEEE Fellow, IAPR Fellow, OSA Fellow, AAAS Fellow, ACM Fellow, AAAI Fellow; Golden Core Member, IEEE Computer Society; Bloomberg Distinguished Professor, Johns Hopkins University, USA An exciting array of multimodal sensor-based approaches to human behavior understanding and intervention. State of the art and current challenges are well represented and will be of keen interest to researchers at all levels. Jeffrey Cohn, AAAC Fellow, Chair, Steering Committee for the IEEE Conf. on Automatic Face and Gesture Recognition; University of Pittsburgh, USA As editors, Ahad, Mahbub and Rahman have assembled in this volume a timely assessment of state-of-the-art contactless sensing in biomedical and healthcare applications. In many ways, the COVID-19 pandemic has provided incentives to adapt to new technological norms as far as different aspects of human activity analysis and desired outcomes (including emphasis on both privacy and ethical issues) are concerned. The chapters in this volume cover sufficient grounds and depth in related challenges and advances in sensing, signal processing, computer vision, and mathematical modeling. Mohammad Karim, IEEE Fellow, OSA Fellow, SPIE Fellow, IPIET Fellow; University of Massachusetts Dartmouth, USA
vii
viii
Comments from Experts
A truly comprehensive, timely and very much needed treatise on the conceptualization of analysis, and design of contactless human activities. The reader can enjoy a carefully organized and authoritative exposure to the wealth of concept and algorithms deeply rooted in the recent advancements of intelligent technologies, advanced sensors and vision and signal processing techniques. Witold Pedrycz, IEEE Fellow; Canada Research Chair, University of Alberta, Canada It is great to see the enormous range of techniques and applications that now exist for human activity analysis. In the early days, it was difficult to imagine that we could reach this position, and with continuing advance and technological development we are now in a position where this technology can greatly assist lifestyle, activity and in the analysis of health. Mark Nixon, IET Fellow, IAPR Fellow, BMVA Distinguished Fellow 2015; University of Southampton, UK A great collection of chapters that report state-of-the-art research in contactless human activity recognition with literature reviews and discussion of challenges. From an interaction design perspective, the book provides views and methods that allow for more safe, trustworthy, efficient, and more natural interaction with technology that will be embedded in our daily living environments. Anton Nijholt, Specialty Chief Editor, Frontiers in Human-Media Interaction; University of Twente, The Netherlands This book delves into an area that combines human activity recognition research with diverse sensors embedded into the user’s surroundings, rather than on the body—an exciting and fast-growing field! Kristof Van Laerhoven, General Chair, ACM Int. Joint Conf. on Pervasive and Ubiquitous Computing (UbiComp) 2020; University of Siegen, Germany In a world driven by touch interaction, finding a book focused on contactless systems and interaction is a breath of fresh air. Highly recommended to design toward a future beyond the screens that dominate our lives. Denzil S. T. Ferreira, Associate Editor, PACM IMWUT; Associate Editor, Frontiers in Human-Media Interaction; University of Oulu, Finland Contactless monitoring is critical to ensure more transparent, healthy and convenient Human-Computer interactions. There are many challenges to face in the next decade including technical, ethical, and legal concerns. However, the advances in this field will provide key enabling technologies to change the way we monitorize and understand our society. This book analyses some of the most important applications of contactless technologies including activity recognition, behavioural analysis, affective interfaces, among others. The book presents the latest advances as well as fundamentals, challenges, and future directions in these important topics. Aythami Morales Moreno, Universidad Autónoma de Madrid, Spain
Comments from Experts
ix
A very detailed look into multi-modal activity recognition. This book gives a history of the field from different perspectives while also giving ample attention to the state of the art. The emphasis on assistive technology is refreshing. The book highlights the ways in which activity recognition can be used in smart and connected homes to help the elderly and those with dementia. Emily M. Hand, University of Nevada, Reno, USA
Preface
In recent years, contactless sensors have gained popularity for contactless physiological signal detection leading to novel innovations in human activity and behavior analysis, authentication, scene analysis, anomaly detection, and many other related tasks. The tremendous advancements in high-performance sensors and the rapid growth in computation power have made it possible to acquire and analyze contactless sensor data for real-time applications. The field is also flourished with a multitude of techniques and methods for improved efficiency, higher accuracy, and faster computation in recent times. This book provides a comprehensive overview of the past and current research works associated with different contactless sensors focusing on human activity analysis, including contemporary techniques. It also sheds light on advanced application areas, and futuristic research topics such as contactless activity analysis in health care and activity recognition in various devices. It is expected that the systematic study of the state-of-the-art techniques covered in this book will help to further our understanding of the current state and future potential of contactless human activity research. In this book, we covered 12 chapters from 15 organizations. The authors are from Australia, Bangladesh, China, Japan, the UK, and the USA. The multi-stage review process spanned over six months to ensure the quality of the chapters. The chapters cover both imaging and non-imaging sensor-based approaches, skeleton-based action modeling, computational camera, vital sign issues, fall detection, emotion, dementia, and other issues. The book concludes with a chapter covering the most notable challenges and future scopes related to these issues. We are thankful to the reviewers for their excellent cooperation. Our sincere gratitude to the authors who did a tremendous job preparing and submitting manuscripts for this book. We are glad that the call for book chapter was well-received by the research community and we had a good number of proposals. Like any other publication, we had to reject some promising chapters to ensure better quality and coherence of the book. Also, we would like to thank Springer Nature, Switzerland publication teams, working alongside us to ensure a high-quality publication. xi
xii
Preface
We especially thank the following experts for their enormous time in going through the book and giving us their valuable feedback on the book: Rama Chellappa (Bloomberg Distinguished Professor, Johns Hopkins University, USA), Jeffrey Cohn (University of Pittsburgh, USA), Mohammad Karim (University of Massachusetts Dartmouth, USA), Witold Pedrycz (University of Alberta, Canada), Mark Nixon (University of Southampton, UK), Anton Nijholt (University of Twente, The Netherlands), Kristof Van Laerhoven (University of Siegen, Germany), Denzil S. T. Ferreira (University of Oulu, Finland), Aythami Morales Moreno (Universidad Autónoma de Madrid, Spain), and Emily M. Hand (University of Nevada, Reno, USA) We are confident that this book will help many researchers, students, and practitioners in academia and industry to explore more research areas in the domain of contactless human activity analysis. Md Atiqur Rahman Ahad, Ph.D., SMIEEE Professor, Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, Bangladesh [email protected] http://ahadvisionlab.com Specially Appointed Associate Professor, Department of Media Intelligent, Osaka University, Osaka, Japan Upal Mahbub, Ph.D., SMIEEE Senior Engineer, Qualcomm Technologies Inc., San Diego, California, USA [email protected] https://www.linkedin.com/in/upal-mahbub-43835070/ Tauhidur Rahman, Ph.D. Assistant Professor, University of Massachusetts Amherst, Amherst, USA [email protected] http://mosaic.cs.umass.edu
Contents
1
Vision-Based Human Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . Tahmida Mahmud and Mahmudul Hasan 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Overview of Vision-Based Human Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Challenges in Vision-Based Human Activity Recognition . . . . . 1.3 Human Activity Recognition Approaches . . . . . . . . . . . . . . . . . . . 1.3.1 Handcrafted Feature-Based Human Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Deep Feature-Based Human Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Beyond Individual Activity Recognition from Videos . . . . . . . . . 1.4.1 Human Activity Recognition Versus Detection . . . . . . 1.4.2 Activity Recognition from Images . . . . . . . . . . . . . . . . 1.4.3 Group and Contextual Activity Recognition . . . . . . . . 1.4.4 Activity Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Benchmark Datasets and Performance Evaluation Criteria . . . . . 1.5.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.2 Performance Evaluation Criteria . . . . . . . . . . . . . . . . . . 1.6 Applications of Human Activity Recognition . . . . . . . . . . . . . . . . 1.6.1 Surveillance System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.2 Healthcare System and Assisted Living . . . . . . . . . . . . 1.6.3 Entertainment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.4 Autonomous Driving . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.5 Human-Robot Interaction (HRI) . . . . . . . . . . . . . . . . . . 1.7 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7.1 Zero-Shot/Few-Shot Learning . . . . . . . . . . . . . . . . . . . . 1.7.2 Computational Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 1.7.3 Analysis of Temporal Context . . . . . . . . . . . . . . . . . . . .
1 2 3 5 6 7 11 19 19 21 22 23 23 23 27 28 28 28 29 29 29 29 29 30 30
xiii
xiv
Contents
1.7.4 Egocentric Activity Recognition . . . . . . . . . . . . . . . . . . 1.7.5 Fine-Grained Activity Recognition . . . . . . . . . . . . . . . . 1.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3
Skeleton-Based Activity Recognition: Preprocessing and Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sujan Sarker, Sejuti Rahman, Tonmoy Hossain, Syeda Faiza Ahmed, Lafifa Jamal, and Md Atiqur Rahman Ahad 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Skeleton Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Depth Sensors and SDKs . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 OpenPose Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Body Marker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Skeleton Representation . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Action Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Preprocessing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Coordinate System Transformation . . . . . . . . . . . . . . . . 2.4.2 Skeleton Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Skeleton Data Compression . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Data Reorganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Hand-Crafted Features . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Deep Learning Features . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Recognition Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.1 Machine Learning Based Recognition . . . . . . . . . . . . . 2.6.2 Neural Network Based Recognition . . . . . . . . . . . . . . . 2.6.3 Graph Based Recognition . . . . . . . . . . . . . . . . . . . . . . . . 2.6.4 Unsupervised Action Recognition . . . . . . . . . . . . . . . . . 2.6.5 Spatial Reasoning and Temporal Stack Learning . . . . 2.6.6 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 2.6.7 Convolutional Sequence Generation . . . . . . . . . . . . . . . 2.6.8 Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Challenges in Skeleton-Based Activity Recognition . . . . . . . . . . 2.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contactless Human Activity Analysis: An Overview of Different Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Farhan Fuad Abir, Md. Ahasan Atick Faisal, Omar Shahid, and Mosabber Uddin Ahmed 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Historical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30 31 31 32 43
44 46 46 47 47 48 48 51 54 54 55 55 56 57 57 58 60 60 61 64 66 66 67 67 67 68 69 72 74 75 83
83 85
Contents
xv
3.3
Primary Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 RF-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Sound-Based Approaches . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Vision-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Frequent Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Application Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
86 87 92 98 101 103 106 106
Signal Processing for Contactless Monitoring . . . . . . . . . . . . . . . . . . . . Mohammad Saad Billah, Md Atiqur Rahman Ahad, and Upal Mahbub 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Activity Signals Sampling and Windowing Techniques . . . . . . . 4.2.1 Applications of Signal Sampling . . . . . . . . . . . . . . . . . . 4.2.2 Impact of Signal Windowing on Activity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Time and Frequency Domain Processing for Contactless Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Applications of Frequency Domain Transforms . . . . . 4.3.2 Time and Frequency Domain Filtering . . . . . . . . . . . . . 4.4 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Local Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.2 Global Descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Dimensionality Reduction Methods . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113
4
5
Fresnel Zone Based Theories for Contactless Sensing . . . . . . . . . . . . . Daqing Zhang, Fusang Zhang, Dan Wu, Jie Xiong, and Kai Niu 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Background of Fresnel Zones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Fresnel Zone Diffraction Sensing Model and Applications . . . . 5.3.1 Fresnel Zone Diffraction Sensing Model . . . . . . . . . . . 5.3.2 One-Side Diffraction Model . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Double-Side Diffraction Model . . . . . . . . . . . . . . . . . . . 5.3.4 Model Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.5 Characterizing Signal Variation Caused by Subtle Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.6 Characterizing Signal Variation Caused by Large Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.7 Application Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Fresnel Zone Reflection Sensing Model and Applications . . . . . 5.4.1 Signal Propagation in the Air . . . . . . . . . . . . . . . . . . . . . 5.4.2 Fresnel Zone Reflection Model . . . . . . . . . . . . . . . . . . . 5.4.3 Characterizing Signal Variation Caused by Movements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
113 116 116 119 121 122 123 128 128 133 135 136 137 145 146 147 148 148 149 151 152 153 153 154 156 156 158 160
xvi
Contents
5.4.4 Application Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6
7
Computational Imaging for Human Activity Analysis . . . . . . . . . . . . . Suren Jayasuriya 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Low-Level Physiological Sensing for Vitals Monitoring and Skin Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Vitals Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Skin Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Computational Cameras for Motion Sensing . . . . . . . . . . . . . . . . 6.3.1 Event-Based Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Coded Exposure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Privacy-Preserving Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Lensless Cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.2 Privacy-Preserving Optics . . . . . . . . . . . . . . . . . . . . . . . 6.5 Activity Analysis for Non-Line-of-Sight Imaging . . . . . . . . . . . . 6.6 Future Research Directions and Challenges . . . . . . . . . . . . . . . . . 6.6.1 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Location Independent Vital Sign Monitoring and Gesture Recognition Using Wi-Fi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daqing Zhang, Kai Niu, Jie Xiong, Fusang Zhang, and Shengjie Li 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 The Basics of Wi-Fi Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Location Dependence Issue in Small-Scale Respiration Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 The Factors Affecting the Performance of Respiration Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 The Effect of Target Location . . . . . . . . . . . . . . . . . . . . 7.3.3 The Effect of Target Orientation . . . . . . . . . . . . . . . . . . 7.4 Location Dependence Issue in Large-Scale Activity Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 The Factors Affecting Large-Scale Activity Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.2 The Effect of Target Location . . . . . . . . . . . . . . . . . . . . 7.4.3 The Effect of Motion Orientation . . . . . . . . . . . . . . . . . 7.5 Improving the Performance of Respiration Sensing with Virtual Multipath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.6 Applying Multiple Views to Achieve Location Independent Gesture Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . 7.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
165 165 166 167 168 170 170 171 173 174 175 175 178 179 180 185 186 186 189 190 191 192 193 193 194 195 196 198 200 201
Contents
xvii
8
203
9
Contactless Fall Detection for the Elderly . . . . . . . . . . . . . . . . . . . . . . . . M. Jaber Al Nahian, Mehedi Hasan Raju, Zarin Tasnim, Mufti Mahmud, Md Atiqur Rahman Ahad, and M Shamim Kaiser 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Contactless Fall Detection Approaches . . . . . . . . . . . . . . . . . . . . . 8.2.1 Vision-Based Fall Detection Approaches . . . . . . . . . . . 8.2.2 Radar Sensor-Based Fall Detection Approaches . . . . . 8.2.3 Radio Frequency Sensing Technology-Based Fall Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . 8.2.4 Acoustic Sensor-Based Fall Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.5 Floor Sensor-Based Fall Detection Approaches . . . . . 8.3 AI Algorithms for Contactless Fall Detection Approaches . . . . . 8.4 Publicly Available Datasets on Contactless Fall Detection Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Discussion and Future Research Directions . . . . . . . . . . . . . . . . . 8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Contactless Human Emotion Analysis Across Different Modalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nazmun Nahid, Arafat Rahman, and Md Atiqur Rahman Ahad 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Modalities Used for Emotion Analysis . . . . . . . . . . . . . . . . . . . . . 9.2.1 Facial Emotion Recognition (FER) . . . . . . . . . . . . . . . . 9.2.2 Speech Emotion Recognition (SER) . . . . . . . . . . . . . . . 9.2.3 Contactless Physiological Signal Based Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2.4 Multimodal Emotion Analysis . . . . . . . . . . . . . . . . . . . . 9.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.1 Challenges of FER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.2 Challenges of SER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3.3 Challenges of Contactless Physiological Signal Based Emotion Recognition . . . . . . . . . . . . . . . . . . . . . . 9.3.4 Challenges of Multimodal Emotion Recognition . . . . 9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10 Activity Recognition for Assisting People with Dementia . . . . . . . . . . Muhammad Fikry, Defry Hamdhana, Paula Lago, and Sozo Inoue 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Review Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3 Overview of Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Types of Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Causes of Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
204 205 206 208 213 218 219 222 223 224 227 228 237 237 241 241 250 256 257 259 259 259 260 260 261 262 271 272 272 275 275 277
xviii
Contents
10.3.3 Symptoms of Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.4 Monitoring of Dementia . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Human Activity Recognition for Dementia . . . . . . . . . . . . . . . . . . 10.4.1 HAR Applications Related to Dementia . . . . . . . . . . . . 10.4.2 HAR Systems and Sensors Used to Support People with Dementia . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Methods and Algorithms for Dementia Support Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.4 Datasets Used in Dementia Support Activity Recognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Making Tangible the Intangible Gestures of Craft . . . . . . . . . . . . . . . . Patricia J. Flanagan 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Challenges and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Mobility, Movement, Real-Time Systems . . . . . . . . . . . . . . . . . . . 11.5 Hybrid Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6 Traditional Aproches to Chaa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.1 Contact-Based Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.2 Contactless Sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.6.3 Hybrid Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.1 Systematic Literature Review of CHAA and Craft-Based HCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.3 Content Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.7.4 Applications of Gesture Tracking . . . . . . . . . . . . . . . . . 11.8 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.1 Horsetail Embroidery . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.8.2 Interviewing Master Craftspeople . . . . . . . . . . . . . . . . . 11.8.3 Methodology and Technology . . . . . . . . . . . . . . . . . . . . 11.8.4 Intangible Gestures of Shui Minority Horsetail Embroidery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.1 Digital Materialism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.2 Hybrid Materiality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.9.3 New Perspectives on Human Geographies . . . . . . . . . . 11.10 Future Applications—Crafting the Intangible . . . . . . . . . . . . . . . . 11.10.1 Interactive Media Installation . . . . . . . . . . . . . . . . . . . . . 11.10.2 Exploring Embodied Knowledge and Embodied Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
277 278 278 278 279 281 283 283 285 286 293 293 295 297 298 300 301 301 302 303 304 304 304 309 311 312 312 314 315 318 321 321 322 323 324 324 325
Contents
xix
11.10.3 Explore Affordances of Interactive Media Experiences to Engage Corporeal Memory in Elderly, Less-Able Bodied . . . . . . . . . . . . . . . . . . . . . 326 11.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327 12 Contactless Human Monitoring: Challenges and Future Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Upal Mahbub, Tauhidur Rahman, and Md Atiqur Rahman Ahad 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Sensor-Level Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.1 Accurate Sensing of the Real-World . . . . . . . . . . . . . . . 12.2.2 Optimum Number and Types of Sensors . . . . . . . . . . . 12.2.3 Large Scale Data Collection . . . . . . . . . . . . . . . . . . . . . . 12.2.4 Privacy Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.5 Sensor Placement Challenges . . . . . . . . . . . . . . . . . . . . . 12.2.6 Resolution, Bandwidth and Power . . . . . . . . . . . . . . . . 12.3 Feature-Level Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.1 Handcrafted Versus Learning-Based Features . . . . . . . 12.3.2 Fusing Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3.3 Feature Dimension and Interpretability . . . . . . . . . . . . 12.4 Algorithm-Level/Architecture-Level Challenges . . . . . . . . . . . . . 12.4.1 Preprocessing Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.2 Models and System Architecture . . . . . . . . . . . . . . . . . . 12.4.3 Real-Time Motion Analysis . . . . . . . . . . . . . . . . . . . . . . 12.4.4 Learning On-the-Fly . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.5 Learning with Limited Labels . . . . . . . . . . . . . . . . . . . . 12.4.6 Learning in the Presence of Label Noise . . . . . . . . . . . 12.4.7 Reduction of Multiply-Adds and Parameters of Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.4.8 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Implementation-Level Challenges . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.1 Robust Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.5.2 Real-Time and On-the-Edge Solution . . . . . . . . . . . . . . 12.5.3 Low Power, High-Performance Solutions . . . . . . . . . . 12.5.4 Benchmark and Performance Evaluation . . . . . . . . . . . 12.5.5 Domain Invariant CHAA . . . . . . . . . . . . . . . . . . . . . . . . 12.5.6 Multi-view Human Activity Analysis . . . . . . . . . . . . . . 12.6 Application-Level Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.1 Surveillance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.2 Extended Reality Applications . . . . . . . . . . . . . . . . . . . . 12.6.3 Action Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.4 Action Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.5 Game-Play Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.6 Imitation Learning or Action Mimicking . . . . . . . . . . .
335 335 337 337 338 339 340 340 341 341 342 342 343 344 344 344 346 347 347 347 348 348 349 349 349 350 350 350 351 351 351 353 353 354 354 354
xx
Contents
12.6.7 Assistive Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6.8 High-Level Human-Computer Interaction . . . . . . . . . . 12.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
355 355 356 357
Chapter 1
Vision-Based Human Activity Recognition Tahmida Mahmud and Mahmudul Hasan
Abstract Human activity analysis is an important problem studied widely in the vision community. Recognizing activities is crucial for a number of practical applications in different domains such as video surveillance, video indexing, autonomous navigation, human-computer interaction, active sensing, assisted living, etc. Most works on vision-based human activity recognition can be categorized as methods based on either handcrafted features or deep features. Handcrafted feature-based methods utilize various spatio-temporal feature descriptors such as STIP, dense trajectories, improved dense trajectories, etc. Deep feature-based methods utilize hierarchical feature representation produced by deep learning models. Several deep architectures have been explored in this direction, including 2D CNNs and 3D CNNs with both single-stream and two-stream models. In spite of achieving high recognition accuracy, two-stream models have their inherent limitations because of the additional computation cost imposed by the motion network fed with optical flow. Single stream methods, which take only raw video frames as input, are simple and efficient for processing, but lag behind in terms of accuracy. 3D CNNs trained on optical flow input still give remarkable accuracy boost when used in ensembles. There have been another line of work combining CNNs with LSTMs. In this chapter, we present a comprehensive overview of the recent methodologies, benchmark datasets, evaluation metrics and overall challenges of vision-based human activity recognition. We discuss about the recent advances and the state-of-the-art techniques in an attempt to identify the possible directions for future research.
T. Mahmud (B) Nauto, Inc., 220 Portage Ave, Palo Alto, CA 94306, USA e-mail: [email protected] M. Hasan Comcast AI, 1110 Vermont Avenue NW, suite 600, Washington, DC 20005, USA e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_1
1
2
T. Mahmud and M. Hasan
1.1 Introduction Human activity recognition (HAR) is an active research area in the video analysis community. Vision-based human activity recognition refers to the task of labeling images or videos involving human motion with different activity classes. Enormous amount of visual data is generated everyday, and activity recognition makes them semantically meaningful by automatic annotation. Understanding activities is vital in different aspects of daily life such as surveillance, healthcare, autonomous navigation, active sensing, assisted living, human-computer interaction, etc. The solution to the recognition problem is comprised of a number of different research topics such as human detection, learning feature representation, tracking, pose estimation, classification, etc. Feature representation in videos is more challenging compared to images due to the associated temporal aspect. Different motion patterns of the same person lead to different activities. So, action representation depends largely on learning the 3D spatio-temporal features. Before the advancement of deep learning-based methods, there had been a number of handcrafted feature-based activity recognition approaches [1–16]. Deep learning approaches [17–31, 33–37, 39–42] have enabled automatic learning of robust features for better action representation. In this chapter, we summarize different works on human activity recognition, relevant datasets and evaluation criteria, applications of activity recognition and future research directions based on the current limitations. There have been other surveys on human activity recognition [43–47] focused on different aspects. Most of the literature cited in our survey have been collected from important computer vision journals and conferences like IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), IEEE Transactions on Image Processing (TIP), IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision (ICCV), European Conference on Computer Vision (ECCV) and so on which have achieved state-of-the-art performance and/or high citations. A histogram showing the number of publications on activity recognition in the last two decades based on IEEE Xplore search using the keyword “Activity Recognition” is shown in Fig. 1.1 which clearly demonstrates the increasing trend in relevant research over the years. This chapter is organized as follows: Section 1.1.1 gives an overview of visionbased human activity recognition. The challenges of vision-based human activity recognition are presented in Sect. 1.2. Different activity recognition approaches based on handcrafted features and deep features are discussed in Sect. 1.3. In Sect. 1.4.1, the difference between activity recognition and detection is described. An overview of active recognition from still images is given in Sect. 1.4.2. In Sect. 1.4.3, there is a discussion on group or contextual activity recognition. Activity prediction which is closely related to activity recognition is discussed in Sect. 1.4.4. Section 1.5 provides an overview of the benchmark datasets and evaluation criteria for human activity recognition. The applications of human activity recognition are discussed in Sect. 1.6. Some future research directions are mentioned in Sect. 1.7 and conclusions are drawn in Sect. 1.8.
1 Vision-Based Human Activity Recognition
3
Fig. 1.1 Histogram of number of publications on activity recognition in the last two decades based on IEEE Xplore search using the keyword “Activity Recognition”
1.1.1 Overview of Vision-Based Human Activity Recognition Vision-based human activity recognition approaches can be classified into two broad categories: handcrafted feature-based methods and deep feature-based methods.The handcrafted feature-based methods consist of several stages like human segmentation, feature extraction and classification. For static cameras, background subtraction [48–50] and tracking [51, 52] are popular methods for human segmentation. For moving cameras, optical flow [53, 54] and temporal difference [55, 56] are used for segmenting humans. Once the segmentation is done, relevant features are extracted for action representation based on shape, color, pose, body movement and silhouette. The goal of the handcrafted feature-based methods is to learn activity representation by capturing human motion and spatio-temporal changes. These approaches include STIP-based methods, spatio-temporal volume-based methods, and skeletonjoint trajectory-based methods. Classical machine learning algorithms like SVM, probabilistic graphical models, binary tree, etc. are applied on the extracted features for activity classification. Spatio-temporal volume-based methods (STV) are one of the earliest approaches for human activity recognition. These methods perform well for fixed cameras with no occlusion since background subtraction is enough to obtain shape information in such cases. A three-dimensional spatio-temporal template is used in these methods following the template matching algorithms. Motion energy image (MEI) and motion history image (MHI) were used in [1]. In [2], a SIFT-based motion context descriptor was used for action representation. Histograms of Oriented Gradients (HOG) was used for human detection in [3]. In [4], 3D HOG features were introduced. A dictionary-based sparse learning method was used in [5] for action representation. The goal of the STIP-based methods is to extract the most interesting
4
T. Mahmud and M. Hasan
points or positions with significant changes. Once the key region is selected, feature vectors are extracted to represent the regions and classification is performed. In [6], 3D-Harris spatio-temporal feature points were used. In [9], spatial pyramid and vocabulary compression techniques were combined together as a novel vocabulary building strategy. A spatio-temporal attention mechanism-based key region extraction method was proposed in [10]. In [11], a hybrid supervector method was used for activity representation. 3D-Harris spatio-temporal features and 3D scale-invariance feature transform detection methods along with visual word histograms were used in [12]. Spatio-temporal feature-based methods became popular since the local features are scale and rotation invariant and stable under illumination and occlusion. They do not need explicit background segmentation as well. However, the performance of these methods are affected by viewpoint change since camera motion leads to many background feature points. In skeleton-joint trajectory-based methods, joints of the human skeleton are tracked for activity representation. In [57], improved dense trajectories (iDT) were used where the displacement was calculated using an optical flow-based approach. Split clustering was used to analyze local motion trajectories in [58]. In [59], human detection and iDT features were combined together to reduce the effect of background trajectories. Stacked Fisher vectors based on iDT features were used in [60]. These trajectory-based methods are not affected by viewpoint change but require a robust 2D or 3D skeleton model and efficient tracking algorithm. In pose estimation-based methods, the 2D/3D coordinates of the human body are transformed into geometric relational features (GRF) [61, 62]. After the feature extraction stage, classification algorithms are applied on the extracted features for human activity recognition. These classification approaches include dynamic time warping (DTW), discriminative models, and generative models. DTW measures the similarity between two temporal sequences with varying speed. Although there have been some initials works [63, 64] on activity recognition using DTW, it is not suitable for large number of activity classes. Both generative and discriminative probabilistic graphical models have been successfully used for human activity recognition. There have been works using Hidden Markov Models (HMM) [13, 14, 65, 66], Dynamic Bayesian Networks (DBN) [67], Support Vector Machine (SVM) [15, 68], Artificial Neural Network (ANN) [69], Kalman filter [70], binary tree [71, 72], multidimensional indexing [73], K nearest neighbor (K-NN) [16], and active learning [74–76]. In recent years, deep learning-based methods have achieved tremendous success in a variety of fields in the domain of computer vision; and activity recognition is no exception to this. There have been a lot of works on activity recognition using deep networks [17–31, 33–37, 39–42]. Based on the architectures of the deep networks, the methods can be categorized into single stream convolutional networks (3D CNNs), two-stream convolutional networks (both 2D and 3D), and long short-term memory (LSTM) networks. Single stream 3D CNNs (with spatio-temporal filters) [17, 18] achieved inferior performance compared to the other state-of-the-art methods since they could not utilize the benefits of ImageNet pre-training. In [19], features were extracted independently from each frame and the predictions from all the frames were
1 Vision-Based Human Activity Recognition
5
combined in an early, late or slow fusion manner inspired from the bag-of-word image modeling approaches. However, this approach cannot capture the temporal structure of the activity. The two stream convolutional networks have two separate 2D CNNs to capture the appearance and motion features. The appearance branch is trained on raw frames from the video sequence, whereas the motion branch is trained on pre-computed motion features. The decisions from these two branches are combined together in a late fusion manner to get the final prediction. The two stream network was first introduced in [20]. In [24], motion vector instead of optical flow was used in the input of the motion branch for real-time action recognition. A temporal segment network (TSN) was introduced in [25]. Mid fusion instead of late fusion of the two branches were proposed in [26] for achieving better performance. To reduce the added computation cost associated with the optical flow estimation method in two-stream networks while maintaining reasonable performance, single stream inflated 3D CNN networks (I3D) were proposed to capture the 3D space-time representation of videos along with the ability to learn from Imagenet pre-training. However, two-stream inflated 3D CNN trained on both raw frames and optical flow still achieves better performance. The two-stream I3D was introduced in [27] where the network structure of inception-V1 was extended from two to three dimensions. In [28], the temporal 3D ConvNet was proposed by extending DenseNet. Similarly in [38], the two-stream convolution network was transformed to a three-dimensional structure using temporal pyramid pooling. Another body of deep learning-based approaches on activity recognition uses a combination of LSTM and CNN [29, 34, 35]. These methods consider videos as a collection of sequential images, and activities are represented by capturing the change in features from each frame. In [34], Pseudo-3D Residual Net (P3D ResNet) was introduced. A combination of CNN and bidirectional LSTM was used in [35]. References [30, 31] include some active learning-based approaches for activity recognition using sparse autoencoders . A two-stream 3D ConvNet supporting videos of variable length and size was introduced in [36]. A discriminative pooling approach was introduced in [37], where only the features corresponding to the distinctive frames were used. A spatial-temporal network called StNet was proposed in [42], where a super-image is created using stacked frames, and a channel-wise and temporal-wise convolution is applied on the features extracted from that super-image. Detailed discussion on the state-of-the-art activity recognition methods is provided in Sect. 1.3.
1.2 Challenges in Vision-Based Human Activity Recognition Although there have been enormous amount of research focused on the domain of vision-based human activity recognition, the problem is still challenging due to the difficulties associated with different tasks, such as:
6
T. Mahmud and M. Hasan
• Low image resolution: This challenge is prevalent in recognition tasks on the edge such as in surveillance cameras where having a lower bit rate is crucial due to limited bandwidth and low resolution videos are sent to the processing servers. • Object occlusion: Object occlusion is a common problem for activity recognition in crowded places such as airport, bus station, concert gatherings, etc. where a large number of people and objects are present generating a lot of occlusions. • Illumination change: The performance of a recognition system is expected to be consistent during different times of the day and different weather conditions specially in automotive space. Working with infrared images at night is challenging as well. • Viewpoint change: Multi-view activity recognition is important for camera networks where the recognition system needs to handle view point changes caused by different camera positions and viewing angles. • High intra-class variance: For fine-grained actions such as cooking activity, robotic arm manipulation, etc. intra-class variance is pretty high which makes the recognition task difficult. • Low inter-class variance: Recognition methods need to generalize well over variations within one class and distinguish among different classes but fine grained activity datasets have higher overlap between classes (i.e. MPII-Cooking Dataset [32] has closely related activities like ‘cut apart’ and ‘cut slices’). • Spatio-temporal scale variation: Generally any public surveillance system has to deal with huge spatio-temporal scale variations. Due to the camera placement, an activity may occur either very close to the camera or very far away. It is a challenging task to develop algorithms that can deal with such variations. • Target motion variation: Highly variable target motion poses significant challenge for recognizing traffic activities. • Video recording settings variation: Video recording settings can be very different across activities in movie datasets or for videos collected from the web. Motion features can be affected by the difference in frame rate among the captured videos. • Inter-personal differences: Different persons may perform a task differently which is challenging for recognizing non-cyclic activities like ‘avoiding obstacles while walking’, ‘carrying an object’, etc.
1.3 Human Activity Recognition Approaches As discussed in Sect. 1.1.1, robust feature representation is a vital stage for accurate recognition of activities. Here, we discuss the details of the state-of-the-art handcrafted feature-based methods and deep feature-based methods along with some distinct classification approaches with their advantages and disadvantages.
1 Vision-Based Human Activity Recognition
7
1.3.1 Handcrafted Feature-Based Human Activity Recognition Handcrafted feature-based human activity recognition methods, also known as the traditional approaches, are aimed to capture human motion and spatio-temporal changes for action representation. The features representing the activities are extracted from the raw pixels, and recognition is performed by applying a classification algorithm. Before the advancement of deep feature-based methods, the handcraftedfeature-based approaches achieved state-of-the-art performance on the publicly available video classification datasets. Here we discuss the most popular handcrafted feature-based recognition approaches.
1.3.1.1
Motion Energy Image and Motion History Image
A temporal template was used in [1] for activity representation consisting of a static vector-image. Each point of the vector-image is representative of a function of the motion properties based on its spatial location. The template can be divided into two components: the first one represents the occurrence of motion by a binary motionenergy image (MEI) and the second one indicates how the motion is changing by a motion-history image (MHI). In their approach, the temporal segmentation is invariant to linear change in speed, and a recognition method matches the templates against known activity instances. Some example frames containing different moves along with their MEI and MHI representations are shown in Fig. 1.2.
Fig. 1.2 One example frame containing an aerobic exercise move along with the MEI and MHI representations. The figure has been redrawn from [1]
8
1.3.1.2
T. Mahmud and M. Hasan
Scale-Invariant Feature Transform (SIFT)
There have been some works on activity recognition [2, 82] using Scale-Invariant Feature Transform-Based (SIFT) approach [83]. In [2], a difference-of-Gaussian (DoG) function was applied for identifying scale and orientation invariant interest points, and more stable points were selected as keypoints. Based on the gradient directions of local images, orientations were assigned. Finally, the gradients of the image were measured and transformed into a representation allowing for local shape distortion and illumination changes. A 3D SIFT descriptor was introduced in [82], which is better suited to handle the 3D nature of video data. The major disadvantage of SIFT features is that they do not use any color information, which can provide important clues for activity recognition.
1.3.1.3
Histogram of Oriented Gradient (HOG) Features
A number of works used Histogram of Oriented Gradient-Based (HOG) features for activity recognition [3, 84–87]. In [3], a dense grid was used to evaluate the normalized local histograms of image gradient orientations. In the overlapping descriptor blocks; fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization were considered. A template-based algorithm using a PCA-HOG descriptor was proposed in [84] for joint tracking and recognition where the actors were first transformed to the HOG descriptor grid and then using Principal Component Analysis (PCA), were projected to a linear subspace. An example scenario is shown in Fig. 1.3. Since local descriptors are extracted at a fixed scale, the HOG features are sensitive to the size of the human. In [85], activities were classified by comparing the motion histograms. Activities were represented as words in [86], where each word corresponds to a pose described by HOG features. In [87], poses were represented by time series of HOG features and movements are represented by time series of Histogram of Oriented Optical Flow (HOOF) features for action recognition. A 3D HOG descriptor was proposed in [4].
1 Vision-Based Human Activity Recognition
9
Fig. 1.3 a., This is the image gradient. b., This is the HOG descriptor with a 2 × 2 grid and 8 orientation bins. c., This is the PCA-HOG descriptor with 12 principal components. d., This is the reconstructed HOG descriptor. The figure has been redrawn from [84]
1.3.1.4
3D-Harris Spatio-Temporal Features
An extended version of the Harris corner detector-based features [88] called the 3D-Harris features were introduced in [6], where the space-time interest points have significant variation with the local neighborhood in the spatio-temporal domain. By maximizing a normalized spatio-temporal Laplacian over spatial and temporal scales, the spatio-temporal extents of the detected activities were estimated.
1.3.1.5
Improved Dense Trajectory Features (iDT)
A novel method using Improved Dense Trajectory Feature (iDT) for activity recognition was introduced in [57]. SURF descriptors and dense optical flow were used to match feature points between frames for camera motion estimation followed by RANSAC for homography estimation. Since human motion produces inconsistent matches, they used a human detector to get rid of these matches for removing the associated trajectories. Also camera motion was separated from optical flow using the estimation. Examples of optical flow estimation before and after rectification (background suppression) are shown in Fig. 1.4.
10
T. Mahmud and M. Hasan
Fig. 1.4 The first column shows images of two consecutive frames superimposed; the second column shows the computed optical flow [8] between the two frames; the third column shows the optical flow when the camera motion is removed and the last column shows the removed trajectories in white due to camera motion. The figure has been redrawn from [57]
There are some other innovative state-of-the-art approaches for human activity recognition in terms of feature selection and classification.
1.3.1.6
Dynamic Time Warping (DTW)
Dynamic Time Warping (DTW) refers to the technique of measuring the distance/similarity between two variable length (different in time or speed) temporal sequences. This template matching algorithm considers the pair-wise distance between frames, computes the alignment cost for the sequences and calculates the optimal alignment. In [89], the space of temporal warping functions for a specific activity was considered for the alignment. A method exploiting the shape deformations of human silhouette was introduced in [63] to extract the shape-based features for gait recognition where the shape sequences were compared using a modified DTW algorithm. In [90], sequences were aligned on image position and scale along with temporal dimension using a dynamic space-time warping algorithm. In [64], an exemplar-based sequential single-layered DTW approach was introduced for activity recognition. Sequences of depth maps overlayed with segmented human region and skeleton tracking result for some activities are shown in Fig. 1.5. In spite of being fast and simple, DTW-based activity recognition approaches do not perform well on datasets with high inter-class variation and is prone to high computation cost.
1.3.1.7
Kalman Filter
Kalman filters have a prediction stage (estimation) and a correction stage (measurement). These two stages are recursively used to predict and modify the current estimate on the past measurements. It minimizes the mean square error while estimating the state of a dynamic process. In [70], Kalman filter was used to track pedestrians, analyze their position and velocity and classify their activities.
1 Vision-Based Human Activity Recognition
11
Fig. 1.5 Sequences of depth maps including segmented human region and skeleton tracking results for waving. The figure has been redrawn from [64]
1.3.1.8
DFT Features and K Nearest Neighbor (K-NN)
In [16], average Discrete Fourier Transform (DFT) features of small image blocks were used for feature selection followed by K-Nearest Neighbor (K-NN) for classification. Objects are classified based on closest training examples in K-NN. Each training sample is a vector in the multidimensional feature space with a label. During classification, each unlabeled vector is assigned the most frequent label among its K nearest neighboring samples. The computation time is independent of the number of classes. It works well with non-linearly separable data. However, the sensitivity of the performance depends on the choice of K.
1.3.2 Deep Feature-Based Human Activity Recognition Based on the diversity of the activity classes, it is hard to find a universal handcrafted feature suitable for all the datasets. This initiated the need for learning robust discriminative features directly from raw data. Similar to image classification, deep feature-based methods have achieved state-of-the-art performance in video classification as well by automating feature extraction and classification. For efficient training of these deep models, a large amount of labeled data is required which has motivated the development of many large-scale video benchmark datasets. Here we discuss the most popular deep feature-based recognition approaches ensuring a flow of simpler to more complex methods.
1.3.2.1
Fusion-Based CNN
In [19], they introduced a fusion-based CNN architecture for activity recognition. Also, similar to ImageNet for image classification, they introduced the Sports-1M dataset with 1M videos to take advantage of pretraining. Early fusion, late fusion and slow fusion were used to combine the predictions from individual frames. For early
12
T. Mahmud and M. Hasan
Fig. 1.6 Different fusion approaches to combine temporal information. The red, green and blue boxes represent convolutional, normalization and pooling layers respectively. The depicted columns share parameters in the Slow Fusion model. The figure has been redrawn from [19]
fusion, the filters of the first convolutional layer were modified by extending them to have an additional temporal dimension. For late fusion, two single-frame networks were applied on two separate streams (15 frames apart) with shared parameters and they were merged in the first fully connected layer. Slow fusion is similar to the approaches used in [17, 18], where the connectivity of all the convolutional layers were extended in time by applying temporal convolution. Slow fusion model performs the best among these alternatives. The different fusion approaches are shown in Fig. 1.6. They also proposed a multiresolution model for speeding up computation using a fovea stream (frames downsampled at half of the original spatial resolution) and a context stream (center region at the original resolution).
1.3.2.2
Two-Stream 2D ConvNet
Video representation consists of two components: the spatial component which provides information about the relationship between scenes and objects and the temporal component which incorporates the motion of objects across frames. Two-stream 2D ConvNet was first introduced in [20], where two separate streams (spatial and temporal) were used as the input to the two branches (2D CNNs) of the network and the predictions were combined in a late fusion manner. Raw video frames and dense optical flow were used as the inputs to incorporate both appearance and motion information. The appearance branch can take advantage of ImageNet pretraining which proved to be useful. The network does not need to implicitly encode the motion information as in [18] since the optical flow displacement fields between adjacent frames are stacked together in the input of the motion branch. The proposed network architecture is shown in Fig. 1.7. There have been several improvements on [20] i.e. replacing optical flow by motion vector [24], using mid-fusion instead of late-fusion [26], where the two networks are fused at a convolutional layer, etc.
1 Vision-Based Human Activity Recognition
13
Fig. 1.7 The two-stream 2D ConvNet network architecture redrawn from [20]
1.3.2.3
Single-Stream 3D ConvNet
A single-stream 3D ConvNet was introduced in [18], where 3D convolutional operations helped extracting discriminative features from both spatial and temporal dimensions. The single-stream network encoded multiple channels of information from adjacent frames by applying convolution and subsampling in each channel with a 3D kernel and finally the information from all the channels were combined together. Motion information was captured by connecting the feature maps in the convolution layer to multiple adjacent frames in the previous layer. An issue with the 3D ConvNets is the increased number of parameters compared to 2D ConvNets because of the added kernel dimension which makes training harder. Training from scratch without ImageNet pretraining is one of the principal reason behind their inferior performance.
1.3.2.4
3D Convolutional Network (C3D)
A deep 3D convolutional networks (3D ConvNet) trained on a large scale supervised video dataset was introduced in [21] for spatiotemporal feature learning. Their novelty lies in the fact that opposed to [18, 19], this was the first work to implement 3D ConvNetson large-scale supervised datasets. Different to the segmented video volumes which were used in [18], this method takes raw video frames as input without any preprocessing. They introduced the C3D features to encode generic information about objects, scenes and actions which selectively attend to both motion and appearance. C3D features focus on the appearance in the first few frames followed by tracking the salient motion in the following frames.
14
T. Mahmud and M. Hasan
Fig. 1.8 An overview of the two-step framework proposed in [17]
1.3.2.5
3D ConvNet+LSTM
In [17], a two-step method was used for activity recognition, where they extended the 2D ConvNet to 3D ConvNet for extracting the spatio-temporal video features and then applied an LSTM network on the features for sequence classification. The overview of the framework is shown in Fig. 1.8.
1.3.2.6
Trajectory-Pooled Deep-Convolutional Descriptor (TDD)
Videos were represented using trajectory-pooled deep-convolutional descriptors (TDD) in [22]. Trajectory constrained pooling was applied to the convolutional feature maps. The feature maps were transformed using spatio-temporal normalization and channel normalization.
1.3.2.7
Sparse AutoEncoder
A sparse autoencoder-based active learning method was used in [31] for activity recognition. Active learning helps to take advantage of incoming unlabeled instances in continuous streams and thus incrementally improves the model with fewer manually labeled instances. They used a multi-layer sparse autoencoder with one input, one output, and a number of hidden layers in the middle. The gradient diffusion problem in multi-layer neural networks was prevented by training each layer separately and then stacking them together.
1.3.2.8
Temporal Segment Network (TSN)
A Temporal Segment Network (TSN) was introduced in [25], where uniformly distributed video frames were sparsely sampled for activity recognition. Then, information from the snippets were combined using a segmental structure to model longrange temporal dependency with low cost. They also discussed and incorporated several good practices like cross-modality pre-training including modalities like RGB,
1 Vision-Based Human Activity Recognition
15
Fig. 1.9 The LRCN network architecture redrawn from [29]
RGB difference, optical flow, and warped optical flow; regularization, and enhanced data augmentation to overcome the limitation of fewer training samples.
1.3.2.9
Long-Term Recurrent Convolutional Network (LRCN)
In [29], they combined convolutional layers with long-range temporal recursion for activity recognition. Each frame is passed through a CNN to produce a fixed-length feature vector which is fed into the LSTM layer. They experimented with two variants: LSTM placed after the first fully connected layer and LSTM placed after the second fully connected layer. They used both RGB and optical flow and achieved the best performance by taking a weighted average of the inputs.The network architecture is shown in Fig. 1.9.
1.3.2.10
Inflated 3D Convolutional Network (I3D)
A novel two-stream Inflated 3D ConvNet (I3D) was introduced in [27], where filters and pooling kernels of very deep 2D ConvNets along with their parameters were inflated to successfully capture the spatio-temporal information by taking the advantage of ImageNet pretraining. The I3D model based on Inception-V1 achieved state-of-the-art performance after being pre-trained on Kinetics. They proposed a two-stream inflated 3D ConvNet and showed that in spite of the fact that 3D CNNs should implicitly learn motion features from RGB frames, adding an additional opti-
16
T. Mahmud and M. Hasan
Fig. 1.10 The I3D network architecture with Inception-V1 backbone. Here, Inc. represents the Inception submodule. The figure has been redrawn from [27]
Fig. 1.11 The 3D temporal transition layer redrawn from [28]
cal flow stream still achieves better performance. The network architecture with Inception-V1 backbone is shown in Fig. 1.10.
1.3.2.11
Temporal 3D Convolutional Network (T3D)
A new temporal layer called Temporal Transition Layer (TTL) was introduced in [28] to model variable temporal convolution kernel depths. They extended the DenseNet architecture using 3D filters and pooling kernels to Temporal 3D ConvNet (T3D). They also proposed a supervision transfer from 2D to 3D CNN to avoid training from scratch. The 3D temporal transition layer is shown in Fig. 1.11.
1.3.2.12
Long-Term Temporal Convolutions (LTC)
Long-Term Temporal Convolutions were introduced in [33]. The temporal extent of video representations was increased by decreasing spatial resolution to have tractable network complexity. They studied the impact of raw video pixels and optical flow
1 Vision-Based Human Activity Recognition
17
vectors (low-level representations) and showed the importance of high-quality optical flow estimation for better recognition performance.
1.3.2.13
Pseudo-3D Residual Network (P3D)
A novel bottleneck building block called Pseudo-3D (P3D) was introduced in [34]. They used a combination of one 1 × 3 × 3 convolutional layer and one 3 × 1 × 1 convolutional layer in parallel replacing the standard 3 × 3 × 3 convolutional layer to reduce the size of the model. Also by initializing the 1 × 3 × 3 convolutional filters with 3 × 3 convolutions in 2D CNN, ImageNet pretraining could be leveraged. Different types of bottleneck building blocks in a residual framework were used in the Pseudo-3D Residual Network. The motivation was that enhancing structural diversity along with increasing depth should achieve better performance.
1.3.2.14
CNN+ Deep Bidirectional LSTM
A novel activity recognition method using a combination of CNN and deep bidirectional LSTM (DB-LSTM) network was proposed in [35]. Feature extraction was performed on every sixth frame using AlexNet pre-training and a two layer DBLSTM with forward and backward pass was used to learn the sequential features.
1.3.2.15
Two-Stream 3D-ConvNet Fusion
In [36], they removed the requirement of fixed size (frame dimension) and length of input videos for 3D ConvNets. They used a spatial temporal pyramid pooling (STPP) ConvNet for extracting descriptors of equal dimension from frame sequences of different size followed by an LSTM or CNN-E to learn a global description. The spatial and temporal streams were combined by late fusion.
1.3.2.16
Discriminitive Pooling
Generating independent predictions from short video segments and applying pooling operation is a common approach for many deep feature-based activity recognition methods. However, not all frames are equally important or distinctive for representing the activities and an end-to-end trainable discriminative pooling approach was proposed in [37] to overcome this redundancy issue. A nonlinear hyperplane separating the distinctive features was learned first. The parameters of the hyperplane were used as feature descriptors of the video by applying multiple instance learning. This descriptor is called SVM Pooled (SVMP) descriptor. A visualization of the discriminative pooling is shown in Fig. 1.12.
18
T. Mahmud and M. Hasan
Fig. 1.12 A visualization of sequential discriminative pooling performed on RGB frames: (i) a sample frame, (ii) average pooling across all frames, (iii) rank pooling resulted in dynamic image, and (iv) SVM pooling. The figure has been redrawn from [37]
Fig. 1.13 Network architecture with temporal pyramid pooling. The orange blocks and the blue blocks denote the spatial stream and the temporal stream respectively. The temporal pyramid pooling (TPP) layer in each stream combines the frame-level features for video representation. The scores from the two streams are aggregated using a weighted average fusion method. The two ConvNets applied to different segments share weights. The figure has been redrawn from [38]
1.3.2.17
Temporal Pyramid Pooling (DTPP)
A global and sequence-aware end-to-end learning framework was proposed in [38], where the Deep Networks with Temporal Pyramid Pooling (DTPP) approach was introduced. RGB frames and optical flow were sparsely sampled from the video and the features were combined using a temporal pyramid pooling layer. The network architecture is shown in Fig. 1.13.
1 Vision-Based Human Activity Recognition
19
Fig. 1.14 The overview of the network redrawn from [39]
1.3.2.18
Appearance-and-Relation Networks (ARTNet)
In [39], multiple building blocks called SMART are stacked together to explicitly learn appearance and relation information from RGB input. The generic blocks factorize the spatio-temporal module into an appearance branch and a relation branch for spatial and temporal modeling respectively and these are combined with a concatenation and reduction operation. The overview of the model framework is shown in Fig. 1.14.
1.3.2.19
Spatial-Temporal Network (StNet)
In [42], a novel spatial-temporal network (StNet) was proposed where N successive video frames were stacked together to create a super-image with 3N channels, and 2D convolution was applied on the super-image for capturing local spatio-temporal relationship. A temporal Xception block incorporating a separate channel-wise and temporal-wise convolution over the feature sequence was used for modeling global spatio-temporal relations. Research work [42] in is different than [25] since it samples temporal segments containing multiple adjacent frames rather than a single frame. The temporal Xception block (TXB) is shown in Fig. 1.15.
1.4 Beyond Individual Activity Recognition from Videos In this section, we discuss some of the topics relevant to activity recognition for the sake of completeness. We discuss the difference between recognition and detection (trimmed vs. untrimmed activities), methods focused on recognizing activities from images, methods focused on recognizing activities performed by a group of people, and the difference between recognition and prediction (observed vs. unobserved).
1.4.1 Human Activity Recognition Versus Detection In spite of being two closely related tasks, there is a fundamental difference between human activity recognition and detection in videos. Activity recognition refers to
20
T. Mahmud and M. Hasan
Fig. 1.15 The temporal Xception block (TXB) redrawn from [42]. The parameters in the bracket denotes (number of kernel, kernel size, padding, number of groups) configuration of 1D convolution. Blocks in green and blue denote channel-wise and temporal-wise 1D convolutions respectively
the task of labeling segmented or trimmed videos whereas detection is the task of both localizing and classifying the occurrences of different actions in a long untrimmed video sequence. Activity recognition does not require such temporal localization since the starting and ending point of the video contains only one activity. Detection [77–79] is much more difficult than recognition but at the same time a more realistic problem relevant to real-world applications. In [77], a bounding box is located around a person using a tracking algorithm. Then a multi-stream CNN is used followed by a bi-directional Long Short-Term Memory (LSTM) to capture the long-term temporal dynamics. In [78], an end-to-end structured segment network (SSN) is proposed which models the temporal structure of activities using a structured temporal pyramid. In [79], they used a Region Convolutional 3D Network (R-C3D) to generate candidate temporal regions with activities. Inspired by the faster RCNN framework used for object detection, [80] proposed a network called TALNet for temporal action localization. A language-driven approach was proposed in
1 Vision-Based Human Activity Recognition
21
[81] where they used a semantic matching reinforcement learning model for action localization.
1.4.2 Activity Recognition from Images Activity recognition from still images is an important research topic in the vision community. Unlike videos or sequential images, activities are recognized using a single image in this task. Since there is no temporal information available, existing spatio-temporal features are not suitable for such purpose. The task becomes more challenging because of the limited context along with the effect of cluttered backgrounds. The lack of motion cue in the images makes it more difficult to separate humans from the background. However, since a large number of images are available online, analyzing human activities in those images is valuable. Such image-based activity recognition would facilitate content search and image retrieval. There have been surveys [91] focused on still image-based activity recognition. Different highlevel cues including human body, body parts, objects, human-object interaction, and the whole scene or context have been used for recognizing activities from still images. Here, we discuss some of the state-of-the-art approaches in this domain. Human body is a crucial context for still image-based action recognition. Reference [92] used a group of visual discriminative instances called Exemplarlets for each activity, and Latent Multiple Kernel Learning (L-MKL) was used to learn the activities. An unsupervised method was proposed in [93], where pairs of images were matched based on the coarse shape of the human figures. Human pose was used for action recognition in [94] using an extended Histogram of Oriented Gradient (HOG) based descriptor. A random forest-based approach using discriminative decision trees algorithm was proposed in [95] to combine the strength of discriminative feature mining and randomization. In [96], attributes and parts were used for activity recognition where the attributes are related to verbs in the language. Similar to human body, body parts also play an important role in still image-based activity recognition. In [97], they used a bag-of-features-based method along with the part-based latent SVM approach. There are poselet-based action classifiers [98, 99], where Poselet Activation Vector has been used as features. A graphical model encoding positions of different body parts and activity class was proposed in [100]. In [101], a 2.5D graph was used where different nodes represent key-points in the human body and the edges incorporate the spatial relationships between them. Objects are crucial cue for action recognition since they are closely related to the corresponding activity. A multiple instance learning-based approach was used in [102]. In [103], a random field model was used for recognizing activities involving human-object interaction. There have been some other works [104–107] utilizing human-object interactions for activity recognition for images. In [108], an intermediate layer of superpixels was used whose latent classes act as the attributes of the activity. Research work in [109], they used a number of discriminatively learned part templates. There have been different deep learning approaches for activity recognition in still images as well. Contextual cues were incorporated in [110] using RCNN. In
22
T. Mahmud and M. Hasan
[111], a human-mask loss was introduced to automatically guide the feature map activations to the human performing the activity. They proposed a multi-task deep learning model to jointly predict the action label and the location heatmap. RNN was used for feature extraction and SVM was used as the classifier in [112]. In [113], a novel Hybrid Video Memory machine was proposed to hallucinate temporal features of still images from video memory so that activities can be recognized from few still images. In [114], they used a deep network with two parts for action recognition; one for part localization and the other for action prediction. In [115], an encoder-decoder CNN was used where they proposed a novel technique for encoding optical flow which transforms an image to a flow map.
1.4.3 Group and Contextual Activity Recognition Understanding of complex visual scenes requires information involving interactions between humans and objects in a social context. Group activity recognition [116, 117, 119–131, 133–139] facilitates exploring such local and global spatio-temporal relationships since the dynamics of the scene is affected by the dynamics of the individuals. The task is often referred to as contextual activity recognition since the surrounding information (objects and other activities) acts as useful context as they are dependent of each other. Because of the hierarchical nature of the task, probabilistic graphical models [119, 121, 124, 126, 127] have been a popular choice for group activity recognition. In [127], an adaptive latent structure was used for learning the relationships. In [119, 120], they explored the social behaviors of individuals. Research work [128] in used a joint framework for tracking and recognition. Research work [123] in used a 3D Markov random field for group activity localization. In [129–131], they used AND-OR graphs and [122] used dynamic Bayesian networks. However, graphical models based on hand-crafted features cannot fully capture the complexity of the task. Deep learning methods have achieved superior performance in group activity recognition [117, 132–139]. Reference [133] used multiple LSTMs for different persons and combined them to generate a higher level model. In [134], a unified framework was proposed by combining a probabilistic graphical model with a ConvNet where a message passing neural network was used to model the probabilistic graphical model and the belief propagation layers were used for label refinement. Research work [138] in introduced a novel energy layer which was used for energy estimation of the predictions. In [117], they used a Graph Convolutional Netwrok to learn Action Relation Graphs (ARG). In [118], a Convolutional Relational Machine (CRM) was used for group activity recognition where activity maps are produced for individual activities followed by an aggregation approach.
1 Vision-Based Human Activity Recognition
23
1.4.4 Activity Prediction While activity recognition refers to the classification of observed activities, activity prediction or early prediction refers to predicting the class of a future unobserved activity or a partially observed ongoing activity respectively. Only the first few frames of the activity have already been observed in early prediction. For prediction, none of the frames of that activity is available beforehand. Activity prediction is crucial for applications where anticipatory response is required. There have been a number of works on early prediction using Probabilistic Suffix Tree (PST) [140], augmented-Hidden Conditional Random Field (a-HCRF) [141], kernel-based reinforcement learning [142], max-margin learning [143] and so on. Semantic scene labeling [144], Markov Random Field (MRF) [145], inverse reinforcement learning [146], context-based LSTM model [147], CNN and RNN [148], multi-task learning [149] etc. have been used for the future prediction task. In [150], future visual representations are predicted using a single frame.
1.5 Benchmark Datasets and Performance Evaluation Criteria The state-of-the-art vision-based activity recognition approaches use publicly available datasets designed for training, evaluation and scientific comparison. There are some recent datasets collected from web and movies which are more realistic in nature. Here, we discuss the widely used datasets for individual human activity recognition along with widely used evaluation metrics.
1.5.1 Benchmark Datasets Publicly available video datasets allow comparing the performance of different activity recognition methods and make it easier to understand the strength and weakness of each of them. The most popular datasets used in the activity analysis domain for experiments are discussed here. The type of sensor data provided with each of the datasets has also been mentioned. Details of some benchmark datasets are shown in Table 1.1.
1.5.1.1
KTH Human Action Dataset
KTH Human Action Dataset [68] consists of 6 activities (‘boxing’, ‘handclapping’, ‘handwaving’, ‘jogging’, ‘running’, and ‘walking’). These activities are performed by 25 different subjects across 4 different scenarios: outdoors, outdoors with different clothes, outdoors with scale variations, and indoors with lighting variations. All of these scenarios have static background. There are 2391 sequences and the videos
24
T. Mahmud and M. Hasan
Table 1.1 Summary of the popular datasets for human activity recognition Dataset
# Classes
# Samples
# Subjects
Avg. Duration (sec)
Resolution
Year
KTH
6
600
25
4
160 × 120
2004
Weizmann
10
90
9
–
180 × 144
2005
IXMAS
11
1650
10
–
320 × 240
2007
Hollywood
8
233
–
–
–
2008
HMDB-51
51
6766
–
–
–
2011
UCF101
101
13320
–
7.21
320 × 240
2012
MPII-Cooking
65
44
12
–
1624 × 1224
2012
Sports-1M
487
1M
–
336
–
2014
ActivityNet
100
4819
–
300–600
1280 × 720
2015
YouTube-8M
4800
8M
–
–
–
2016
Kinetics
400
240000
–
10
–
2017
AVA
80
437
–
900
–
2018
have a frame rate of 25 f ps, duration of 4s and spatial resolution of 160 × 120. The dataset consists of gray-scale images.
1.5.1.2
Weizmann Human Action Dataset
Weizmann Human Action Dataset [151] from Weizmann Institute consists of 10 activities (‘run’, ‘walk’, ‘skip’, ‘jumping-jack’, ‘jump-forward-on-two-legs’, ‘jump-in-place-on-two-legs’, ‘gallop-sideways’, ‘wave-one-hand’, ‘wave-two-hand’, ‘bend’) performed by 10 subjects in a static background. There are 90 sequences and the videos have a frame rate of 50 f ps and spatial resolution of 180 × 144. The dataset consists of RGB images.
1.5.1.3
IXMAS Dataset
IXMAS Dataset [152] from Inria includes 14 activities (‘check watch’, ‘cross arms’, ‘scratch head’, ‘sit down’, ‘get up’, ‘turn around’, ‘walk’, ‘wave’, ‘punch’, ‘kick’, ‘point’, ‘pick up’, ‘throw over head’ and ‘throw from bottom up’) performed by 10 actors and recorded from 5 viewpoints with static background. The dataset consists of RGB images.
1.5.1.4
Hollywood Human Action Dataset
Hollywood human action dataset [15] contains 8 activities (‘answer phone’, ‘get out of car’, ‘handshake’, ‘hug’, ‘kiss’, ‘sit down’, ‘sit up’ and ‘stand up’) extracted from
1 Vision-Based Human Activity Recognition
25
32 movies with various actors and dynamic background. It has two training sets: one manually labeled and another automatically annotated using scripts. The dataset consists of RGB and gray-scale images.
1.5.1.5
HMDB-51 Dataset
HMDB-51 Dataset [153] or Human Motion Database consists of 51 activities and 7000 manually annotated clips extracted from movies and YouTube. The actions can be divided into five categories: general facial actions (‘smile’, ‘talk’ etc.), facial actions with object manipulation (‘smoke’, ‘drink’, etc.), general body movements (‘climb’, ‘drive’, ‘run’ etc.), body movements with object interaction (‘brush hair’, ‘kick ball’ etc.), and body movements for human interaction (‘hug’, ‘punch’ etc.). The dataset consists of RGB and gray-scale images.
1.5.1.6
UCF101 Dataset
UCF101 [154] consists of 101 action categories (‘baby crawling’, ‘horse riding’, ‘diving’, ‘brushing teeth’ etc.) and 13320 videos collected from YouTube. The actions can be divided into 5 types: human-object interaction, body-motion only, human-human interaction, playing musical instruments, sports. It is a very challenging dataset since there are large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions etc. The frame rate and resolution is 25 f ps and 320 × 240 respectively. The dataset consists of RGB images.
1.5.1.7
MPII-Cooking Dataset
MPII-Cooking Dataset [32] is a fine grained complex activity dataset. The dataset consists of different tools, ingredients and containers required to complete different recipes. It consists of 65 different cooking activities (‘cut slices’, ‘squeeze’, ‘take out from fridge’ etc.) performed by 12 actors preparing 14 recipes in 44 videos. The dataset contains 5609 annotations where each video includes multiple activities. The videos have a resolution of 1624 × 1224. The dataset consists of RGB images.
1.5.1.8
Sports-1M Dataset
Sports-1M Dataset [19] is a large-scale video classification dataset which contains 497 activity classes and 1 million videos collected from YouTube. The classes are arranged in a custom taxonomy. The internal nodes of that taxonomy include aquatic sports, team sports, winter sports, ball sports, combat sports, sports with animals etc. The duration of the videos is 5 min and 36 s on average. The dataset consists of RGB images.
26
1.5.1.9
T. Mahmud and M. Hasan
ActivityNet Dataset
ActivityNet [155] is a large-scale human activity understanding dataset. ActivityNet 1.2 consists of 100 activity classes (‘ballet’, ‘hockey’, ‘shaving’ etc.) and has an average of 1.5 temporal activity segments per video. The dataset has 4819 training videos, 2383 validation videos and 2480 test videos collected from YouTube. The actions can be divided into five categories: eating and drinking activities; sports, exercise, and recreation; socializing, relaxing, and leisure; personal care; and household activities. The videos have a duration between 5 to 10 min on average. Around half of them have a resolution of 1280 × 720 and a frame rate of 30 f ps. The dataset consists of RGB images.
1.5.1.10
YouTube-8M Dataset
YouTube-8M [156] is a large-scale video classification benchmark dataset including 8 million videos and 4800 activity classes collected from YouTube. The average number of classes per video is 1.8. The dataset consists of RGB and gray-scale images.
1.5.1.11
Kinetics Human Action Dataset
Kinetics Human Action Dataset [157] is a challenging human activity recognition dataset with 400 classes (‘hugging’, ‘washing dishes’, ‘drawing’ etc.) and around 240000 training videos collected from YouTube. The clips have an average duration of 10s. The activities can be categorized as: person actions, person-person actions and person-object actions. The dataset consists of RGB images.
1.5.1.12
AVA Dataset
AVA (Atomic Visual Actions) Dataset [158] is a challenging activity classification dataset providing person-centric annotation for atomic visual activities. It contains 80 classes and 437 videos with 15-minutes duration (1.59 M instances). Each action is spatio-temporally localized. Each individual is linked across consecutive keyframes resulting in short temporal sequences of activities. The dataset consists of RGB images. Apart from the above mentioned datasets, there are several other activity analysis datasets such as UCF Sports Dataset [159], UCF50 Dataset [160], 50 salad Dataset [161], Breakfast Dataset [162], VIRAT Ground Dataset [163], Charades Dataset [164], etc. Discussion on all of the datasets is beyond the scope of this chapter. In order to provide a more clear view, a summary of the above mentioned datasets is presented in Table 1.1.
1 Vision-Based Human Activity Recognition
27
Table 1.2 Recognition accuracies of some state-of-the-art methods on benchmark datasets Method
Year
KTH
Weizmann HMDB51
UCF-101
Sports1M
ActivityNet
Kinetics
[41]
2018
–
–
75.9
96.8
–
–
77.2
[40]
2018
–
–
78.7
97.3
–
–
75.4
[39]
2018
–
–
70.9
94.3
–
–
72.4
[38]
2018
–
–
82.1
98.0
–
–
–
[37]
2018
–
–
81.3
–
–
–
–
[24]
2018
–
–
55.3
87.5
–
–
–
[35]
2017
–
–
87.6
91.2
–
–
–
[34]
2017
–
–
–
93.7
66.4
75.1
–
[33]
2017
–
–
67.2
92.7
–
–
–
[28]
2017
–
–
63.5
93.2
–
–
–
[27]
2017
–
–
80.2
97.9
–
–
74.2
[26]
2016
–
–
65.4
92.5
–
–
–
[25]
2016
–
–
69.4
94.2
–
–
–
[59]
2016
–
–
60.1
86.0
–
–
–
[11]
2016
–
–
61.1
87.9
–
–
–
[31]
2015
98.0
–
–
–
–
–
–
[29]
2015
–
–
–
82.66
–
–
–
[22]
2015
–
–
65.9
91.5
–
–
–
[21]
2015
–
–
–
85.2
61.1
–
–
[165]
2015
–
–
62.1
87.3
–
–
–
[20]
2014
–
–
59.4
88.0
–
–
–
[19]
2014
–
–
–
65.4
63.9
–
–
[60]
2014
–
–
66.79
–
–
–
–
[30]
2014
96.6
–
–
–
–
–
–
[57]
2013
–
–
57.2
–
–
–
–
[18]
2013
91.7
–
–
–
–
–
–
[17]
2011
94.4
–
–
–
–
–
–
[87]
2009
–
100.0
–
–
–
–
–
[8]
2009
–
94.4
–
–
–
–
–
[2]
2008
91.33
92.89
–
–
–
–
–
[4]
2008
91.4
84.3
–
–
–
–
–
[15]
2008
91.8
–
–
–
–
–
–
[7]
2007
–
72.8
–
–
–
–
–
1.5.2 Performance Evaluation Criteria To evaluate the performance of different activity recognition algorithms on the publicly available datasets, there are some standard metrics used in the research community. Some of the most common evaluation metrics are accuracy, precision, recall, F1 score, average precision and confusion matrix. The recognition accuracy is the ratio
28
T. Mahmud and M. Hasan
of the number of correctly recognized samples to the number of total samples. But if the dataset is imbalanced, F1 score which is the weighted average of precision and recall provides a more clear picture. Also, the confusion matrix is a more detailed metric since it indicates class-wise performance. Recently, the computational cost or the inference time of the algorithms has become an important metric and researchers are focusing on real-time performance as well. The recognition accuracies of some of the state-of-the-art methods on the popular datasets are presented in Table 1.2.
1.6 Applications of Human Activity Recognition Human activity recognition has versatile applications in different real-life domains because of the diversity among the different activity classes. Robust recognition of activities is crucial for but not limited to sectors like surveillance, autonomous navigation, robotics, healthcare, entertainment and so on. These applications and the methods specially designed for such applications are discussed here:
1.6.1 Surveillance System Generally, surveillance systems include a camera network with a number of cameras. For content analysis, the systems require continuous human monitoring. Activity recognition automates the process of tracking people and understanding their motive which facilitates the detection of suspicious personnel and criminal activities by triggering an immediate alert. Such intelligent surveillance system helps the authority to get rid of continuous human monitoring and reduces the workload. There have been a number of works on tracking and detection in surveillance system [166–171]. Anomalous activity recognition is another important application in surveillance videos, which has motivated a number of research works [172–178]. Recognizing abnormal events can prevent crimes like kidnapping, drug-dealing, robbery etc. Another major application of activity recognition in surveillance domain is gait-based person identification which reduces the need for physical touch-based biometric authentication systems.
1.6.2 Healthcare System and Assisted Living Activity recognition has a vital role in managing assisted living for the seniors and the disabled persons ensuring their safety and improved quality of life. By gradually monitoring and analyzing the activities of an individual in smart homes, such systems can alert the authority when there is an anomaly; for example if the patient is having a heart attack or stroke or if he had not responded for a long time. There are video monitoring methods which remind patients to take medicines. Some systems detect abnormal respiratory behavior as well as dangerous situations like falling events. There have
1 Vision-Based Human Activity Recognition
29
been a number of works on activity recognition in the domain of assisted living [179– 181]; some of them are specifically designed for fall detection [182, 183]. Activity recognition helps to guide visually impaired persons in their daily movements. There are monitoring systems available in swimming pools for drowning detection.
1.6.3 Entertainment A growing amount of research have been conducted on sport activity recognition [67, 184, 185] which helps to automatically detect meaningful incidents in sports for example differentiating among different strokes in tennis like forehand stroke and backhand stroke etc. Active gaming is also benefited from activity recognition. There have been works on recognizing dance forms as well.
1.6.4 Autonomous Driving In autonomous navigation or Advanced Driver Assistance Systems (ADAS), activity recognition can help in crash detection, driver distraction and drowsiness detection and so on. There have been a number of works on driver activity recognition [186, 187]; some of them specially focused on drowsiness detection [188–193]. Apart from the above mentioned applications, there are several other applications of human activity recognition in different domains of our daily life which have inspired researchers to focus more on this important task.
1.6.5 Human-Robot Interaction (HRI) In active sensing, robots need to understand the surrounding activities to perform an action on its own both in domestic [194] and industrial [195] environment. Some egocentric activity recognition and prediction methods have been proposed in [196, 197] so that robots can understand human intentions and behavior.
1.7 Future Research Directions 1.7.1 Zero-Shot/Few-Shot Learning The current state-of-the-art activity recognition methods mostly depend on datahungry deep models trained on large hand-labeled video datasets. However, It is not practical or feasible to label such large number of videos for various tasks across
30
T. Mahmud and M. Hasan
different domains. An increasing challenge in the domain of activity analysis is to train models with limited labeled training data. There have been some activity classification works with limited supervision for example zero-shot or few-shotbased recognition methods [198–204]. Such approaches should be explored more to achieve better performance with lower annotation cost.
1.7.2 Computational Efficiency Most of the existing state-of-the-art methods for activity recognition work offline and are not suitable for real-time inference. A lot of the existing models use optical flow for motion estimation which is computationally expensive. There has been attempt [24] to replace optical flow with something simpler like motion vector to incorporate teacher-student learning to achieve real-time recognition while preserving reasonable performance. In [205], Semantic Texton Forests (STFs) were applied to utilize appearance and structural information for action recognition in real-time. In [41], 3D convolutions at the bottom of the network were replaced by 2D convolutions, and temporally separable convolution and spatio-temporal feature gating were used for faster inference. Since time-efficiency is a crucial requirement for many real world applications, finding real-time solutions for action recognition has generated increasing interest in the research community and there is still a lot of scope for future research.
1.7.3 Analysis of Temporal Context Some frames in the videos are more important than the others for activity recognition [206, 207]. This requires a clear understanding of the temporal context in videos. Sometimes reducing the number of required frames can provide improvement in terms of speed with limited or no reduction in performance. For some activities, the initial frames are more informative than the others and for some it is the opposite. Interpretability of video model is a scarcely explored research problem which can answer things like why the initial frames are more vital for some activities, which salient signals are more important and so on. This exploration would lead to more efficient recognition model.
1.7.4 Egocentric Activity Recognition Egocentric (first-person) activity recognition is vital for applications in robotics, human-computer interaction, etc. Egocentric videos provides a first person perspective using a forward-facing wearable camera. There are some additional challenges
1 Vision-Based Human Activity Recognition
31
associated with ego-motion action recognition. For example, videos can be noisy because of the sharp ego-motion introduced by the camera wearer. There have been some research works focused on egocentric activity recognition [208–217]. There are very few large scale datasets (i.e. EPIC-Kitchens [218]) designed for this task. However, based on the impact of the task, more relevant datasets and research attempts are essential to improve the state-of-the-art performance.
1.7.5 Fine-Grained Activity Recognition Fine-grained activity recognition is useful for many applications like human-robot interaction, surveillance, industrial automation, assistive technologies, automated skill assessment, surgical training, etc. Such activities often have large intra-class variation and small inter-class variation based on different actors, styles, speed, and order. For example, ‘cut apart’ and ‘cut slices’ are both closely related fine-grained actions under the broad activity ‘prepare recipe’. Recognizing such activities is more challenging compared to full-body activities like ‘walking’ or ‘swimming’. There have been works focused on fine-grained activity recognition [219–221] and there is huge potential for further research.
1.8 Conclusions We have presented a comprehensive survey on vision-based human activity recognition (HAR). Based on the large amount of video data available online and its diverse applications in different real life domains, vision-based human activity recognition has lately become an interesting research topic to make these large amount of data semantically meaningful. The purpose of this survey is to provide an overview of the different aspects of the recognition problem. We discussed the challenges relevant to the task and some of its important applications. We explored different vision-based activity recognition approaches which use either hand-crafted features or deep features. We briefly discussed some relevant topics like the difference between detection and recognition, group and contextual activity recognition, activity prediction and activity recognition from still images. We summarized the benchmark activity recognition datasets and compared the performance of different state-of-the-art methods on these datasets. Finally, we have given some thoughts on future possible research directions based on current limitations and demand. We hope that this survey will motivate new research efforts for the advancement of the field and serve as a comprehensive review document.
32
T. Mahmud and M. Hasan
References 1. Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 3, 257–267 (2001) 2. Zhang, Z., Hu, Y., Chan, S., Chia, L.T.: Motion context: a new representation for human action recognition. In: European Conference on Computer Vision, pp. 817–829 (2008) 3. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893 (2005) 4. Klaser, A., Marszalek, M., Schmid, C.: A spatio-temporal descriptor based on 3D-gradients. In: British Machine Vision Conference, pp. 1–10 (2008) 5. Somasundaram, G., Cherian, A., Morellas, V., Papanikolopoulos, N.: Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vision Image Underst. 123, 1–13 (2014) 6. Laptev, I.: On space-time interest points. Int. J. Comput. Vision 64(2–3), 107–123 (2005) 7. Niebles, J. C., Fei-Fei, L.: A hierarchical model of shape and appearance for human action classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2007) 8. Chaudhry, R., Ravichandran, A., Hager, G., Vidal, R.: Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1932–1939 (2009) 9. Chakraborty, B., Holte, M.B., Moeslund, T.B., Gonzàlez, J.: Selective spatio-temporal interest points. Comput. Vision Image Underst. 116(3), 396–410 (2012) 10. Nguyen, T.V., Song, Z., Yan, S.: STAP: Spatial-temporal attention-aware pooling for action recognition. IEEE Trans. Circuit Syst. Video Technol. 25(1), 77–86 (2014) 11. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vision Image Underst. 150, 109–125 (2016) 12. Nazir, S., Yousaf, M.H., Velastin, S.A.: Evaluating a bag-of-visual features approach using spatio-temporal features for action recognition. Comput. Electr. Eng. 72, 660–669 (2018) 13. Duong, T.V., Bui, H.H., Phung, D.Q., Venkatesh, S.: Activity recognition and abnormality detection with the switching hidden semi-markov model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 838–845 (2005) 14. Yamato, J., Ohya, J., Ishii, K.: Recognizing human action in time-sequential images using hidden markov model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 379–385 (1992) 15. Laptev, I., Marszałek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 16. Kumari, S., Mitra, S.K.: Human action recognition using DFT. In: IEEE National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics, pp. 239–242 (2011) 17. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp. 29–39. Springer (2011) 18. Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014) 20. Simonyan, K., Zisserman, A: Two-stream convolutional networks for action recognition in videos. In: Neural Information Processing Systems Conference, pp. 568–576 (2014) 21. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: IEEE International Conference on Computer Vision, pp. 4489–4497 (2015)
1 Vision-Based Human Activity Recognition
33
22. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4305–4314 (2015) 23. Lan, Z., Zhu, Y., Hauptmann, A.G., Newsam, S.: Deep local video feature for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1–7 (2017) 24. Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with deeply transferred motion vector CNNs. IEEE Trans. Image Process. 27, 2326–2339 (2018) 25. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: European conference on computer vision, pp. 20–36 (2016) 26. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1933–1941 (2016) 27. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4724–4733 (2017) 28. Diba, A., Fayyaz, M., Sharma, V., Karami, A.H., Arzani, M.M., Yousefzadeh, R., Van Gool, R.: Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv:1711.08200 (2017) 29. Donahue, J., Hendricks, L.A., Rohrbach, M., Venugopalan, S., Guadarrama, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. IEEE Trans. Pattern Anal. Mach. Intell. 39, 677–691 (2017) 30. Hasan, M., Roy-Chowdhury, A.K.: Continuous learning of human activity models using deep nets. In: European Conference on Computer Vision, pp. 705–720 (2014) 31. Hasan, M., Roy-Chowdhury, A.K.: A continuous learning framework for activity recognition using deep hybrid feature models. IEEE Trans. Multime. 17(11), 1909–1922 (2015) 32. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194–1201 (2012) 33. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017) 34. Qiu, Z., Yao, T., Mei, T.: Learning spatio-temporal representation with pseudo-3d residual networks. In: IEEE International Conference on Computer Vision, pp. 5534–5542 (2017) 35. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017) 36. Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-d convnet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2017) 37. Wang, J., Cherian, A., Porikli, F., Gould, S.: Video representation learning using discriminative pooling. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1149–1158 (2018) 38. Zhu, J., Zou, W., Zhu, Z.: End-to-end Video-Level Representation Learning for Action Recognition. arXiv:1711.04161 (2017) 39. Wang, L., Li, W., Li, W., Van Gool, L.: Appearance-and-relation networks for video classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1430–1439 (2018) 40. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. In: IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459 (2018) 41. Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: European Conference on Computer Vision, pp. 305–321 (2018)
34
T. Mahmud and M. Hasan
42. He, D., Zhou, Z., Gan, C., Li, F., Liu, X., Li, Y., Wang, L., Wen, S.: Stnet: local and global spatial-temporal modeling for action recognition. AAAI Conf. Artif. Intell. 33, 8401–8408 (2019) 43. Poppe, R.: A survey on vision-based human action recognition. Image Vision Comput. 28(6), 976–990 (2010) 44. Ke, S.R., Thuc, H., Lee, Y.J., Hwang, J.N., Yoo, J.H., Choi, K.H.: A review on video-based human activity recognition. Computers 2(2), 88–131 (2013) 45. Sargano, A., Angelov, P., Habib, Z.: A comprehensive review on handcrafted and learningbased action representation approaches for human activity recognition. Appl. Sci. 7(1), 110 (2017) 46. Rodríguez-Moreno, I., Martínez-Otzeta, J.M., Sierra, B., Rodriguez, I., Jauregi, E.: Video activity recognition: state-of-the-art. Sensors 19(14), 3160 (2019) 47. Zhang, H.B., Zhang, Y.X., Zhong, B., Lei, Q., Yang, L., Du, X.J., Chen, D.S.: A comprehensive survey of vision-based human action recognition methods. Sensors 19(5), 1005 (2019) 48. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: real-time tracking of the human body. IEEE Trans. Pattern Anal. Mach. Intell. 19(7), 780–785 (1997) 49. Cucchiara, R., Grana, C., Piccardi, M., Prati, A.: Detecting moving objects, ghosts, and shadows in video streams. IEEE Trans. Pattern Anal. Mach. Intell. 25(10), 1337–1342 (2003) 50. Seki, M., Fujiwara, H., Sumi, K.: A robust background subtraction method for changing background. In: IEEE Workshop on Applications of Computer Vision, pp. 207–213 (2000) 51. Brendel, W., Todorovic, S.: Video object segmentation by tracking regions. In: IEEE International Conference on Computer Vision, pp. 833–840 (2009) 52. Yu, T., Zhang, C., Cohen, M., Rui, Y., Wu, Y.: Monocular video foreground/background segmentation by tracking spatial-color gaussian mixture models. In 2007 IEEE Workshop on Motion and Video Computing, pp. 5–5 (2007) 53. Daniilidis, K., Krauss, C., Hansen, M., Sommer, G.: Real-time tracking of moving objects with an active camera. Real-Time Imaging 4(1), 3–20 (1998) 54. Huang, C.M., Chen, Y.R., Fu, L.C.: Real-time object detection and tracking on a moving camera platform. In 2009 ICCAS-SICE, pp. 717–722 (2009) 55. Murray, D., Basu, A.: Motion tracking with an active camera. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 449–459 (1994) 56. Kim, K.K., Cho, S.H., Kim, H.J., Lee, J.Y.: Detecting and tracking moving object using an active camera. In The 7th International Conference on Advanced Communication Technology, 2005, ICACT 2005, vol. 2, pp. 817–820 (2005) 57. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: IEEE International Conference on Computer Vision, pp. 3551–3558 (2013) 58. Gaidon, A., Harchaoui, Z., Schmid, C.: Activity representation with motion hierarchies. Int. J. Comput. Vsion 107(3), 219–238 (2014) 59. Wang, H., Oneata, D., Verbeek, J., Schmid, C.: A robust and efficient video representation for action recognition. Int. J. Comput. Vision 119(3), 219–238 (2016) 60. Peng, X., Zou, C., Qiao, Y., Peng, Q.: Action recognition with stacked fisher vectors. In: European Conference on Computer Vision, pp. 581–595 (2014) 61. Hoang, L.U.T., Ke, S., Hwang, J., Yoo, J., Choi, K.: Human action recognition based on 3D body modeling from monocular videos. In: Frontiers of Computer Vision Workshop, pp. 6–13 (2012) 62. Hoang, L.U.T., Tuan, P.V., Hwang, J.: An effective 3D geometric relational feature descriptor for human action recognition. In: IEEE RIVF International Conference on Computing and Communication Technologies, Research, Innovation, and Vision for the Future, pp. 1–6 (2012) 63. Veeraraghavan, A., Roy-Chowdhury, A.K., Chellappa, R.: Matching shape sequences in video with applications in human movement analysis. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1896–1909 (2005) 64. Sempena, S., Maulidevi, N.U., Aryan, P.R.: Human action recognition using dynamic time warping. IEEE Int. Conf. Electr. Eng. Inf. (ICEEI) 17–19, 1–5 (2011)
1 Vision-Based Human Activity Recognition
35
65. Brand, M., Oliver, N., Pentland, A.: Coupled hidden Markov Models for complex action recognition. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 994–999 (1997) 66. Hoang, L.U.T., Ke, S., Hwang, J., Tuan, P.V., Chau, T.N.: Quasi-periodic action recognition from monocular videos via 3D human models and cyclic HMMs. In: IEEE International Conference on Advanced Technologies for Communications, pp. 110–113 (2012) 67. Luo, Y., Wu, T., Hwang, J.: Object-based analysis and interpretation of human motion in sports video sequences by dynamic Bayesian networks. Comput. Vision Image Underst. 92, 196–216 (2003) 68. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. International Conference on Pattern Recognition, 2004. International Conference on Pattern Recognition, vol. 3, pp. 32–36 (2004) 69. Fiaz, M.K., Ijaz, B.: Vision based human activity tracking using artificial neural networks. In: IEEE International Conference on Intelligent and Advanced Systems, pp. 1–5 (2010) 70. Bodor, R., Jackson, B., Papanikolopoulos, N.: Vision-based human tracking and activity recognition. Mediterr. Conf. Control Autom. 1, 18–20 (2003) 71. Ribeiro, P.C., Santos-Victor, J.: Human activity recognition from video: modeling, feature selection and classification architecture. Int. Workshop Human Activity Recognit. Model. 1, 61–70 (2005) 72. Stauffer, C., Grimson, W.E.L.: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22, 747–757 (2000) 73. Ben-Arie, J., Wang, Z., Pandit, P., Rajaram, S.: Human activity recognition using multidimensional indexing. IEEE Trans. Pattern Anal. Mach. Intell. 24, 1091–1104 (2002) 74. Hasan, M., Roy-Chowdhury, A.K.: Incremental activity modeling and recognition in streaming videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 796–803 (2014) 75. Hasan, M., Roy-Chowdhury, A.: Context aware active learning of activity recognition models. In: IEEE International Conference on Computer Vision, pp. 4543–4551 (2015) 76. Hasan, M., Paul, S., Mourikis, A.I., Roy-Chowdhury, A.K.: Context aware query selection for active learning in event recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2018) 77. Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1961–1970 (2016) 78. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. In: IEEE International Conference on Computer Vision, pp. 2914–2923 (2017) 79. Xu, H., Das, A., Saenko, K.: R-c3d: Region convolutional 3d network for temporal activity detection. In: IEEE international Conference on Computer Vision, pp. 5783–5792 (2017) 80. Chao, Y.W., Vijayanarasimhan, S., Seybold, B., Ross, D.A., Deng, J., Sukthankar, R.: Rethinking the faster r-cnn architecture for temporal action localization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1130–1139 (2018) 81. Wang, W., Huang, Y., Wang, L.: Language-driven temporal activity localization: a semantic matching reinforcement learning model. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 334–343 (2019) 82. Scovanner, P., Ali, S., Shah., M.: A 3-dimensional sift descriptor and its application to action recognition. In: ACM international conference on Multimedia, pp. 357–360 (2007) 83. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004) 84. Lu, W.L., Little, J.J.: Simultaneous tracking and action recognition using the PCA-HOG descriptor. In: Canadian Conference on Computer and Robot Vision, p. 6 (2006) 85. Thurau, C.: Behavior histograms for action recognition and human detection. In Human Motion-Understanding, Modeling, Capture and Animation. Springer, pp. 299–312 (2007)
36
T. Mahmud and M. Hasan
86. Hatun, K., Duygulu, P.: Pose sentences: a new representation for action recognition using sequence of pose words. In: International Conference on Pattern Recognition, pp. 1–4 (2008) 87. Chen, C.C., Aggarwal, J.: Recognizing human action from a far field of view. Workshop on Motion and Video Computing, pp. 1–7 (2009) 88. Harris, C., Stephens, M.: A combined corner and edge detector, Alvey Vision Conference, pp. 147–151 (1988) 89. Veeraraghavan, A., Chellappa, R., Roy-Chowdhury, A.K.: The function space of an activity. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit. 1, 959–968 (2006) 90. Yao, B., Zhu, S.C.: Learning deformable action templates from cluttered videos. In: IEEE International Conference on Computer Vision, pp. 1507–1514 (2009) 91. Guo, G., Lai, A.: A survey on still image based human action recognition. Pattern Recognit. 47(10), 3343–3361 (2014) 92. Li, P., Ma, J.: What is happening in a still picture? In: IEEE Asian Conference on Pattern Recognition, pp. 32–36 (2011) 93. Wang, Y., Jiang, H., Drew, M., Li, Z.-N., Mori, G.: Unsupervised discovery of action classes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1654–1661 (2006) 94. Thurau, C., Hlavac, V.: Pose primitive based human action recognition in videos or still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008) 95. Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1577–1584 (2011) 96. Yao, B., Jiang, x., Khosla, A., Lin, A., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: IEEE International Conference on Computer Vision, pp. 1331–1338 (2011) 97. Delaitre, V., Laptev, I., Sivic, J.: Recognizing human actions in still images: a study of bag-offeatures and part-based representations. In: British Machine Vision Conference, p. 7 (2010) 98. Maji, S., Bourdev, L., Malik, J.: Action recognition from a distributed representation of pose and appearance. iN: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3177–3184 (2011) 99. Zheng, Y., Zhang, Y.-J., Li, X., Liu, B.-D.: Action recognition in still images using a combination of human pose and context information. In: IEEE International Conference on Image Processing, pp. 785–788 (2012) 100. Raja, K., Laptev, I., Pérez, P., Oisel, L.: Joint pose estimation and action recognition in image graphs. In: International Conference on Image Processing, pp. 25–28 (2011) 101. Yao, B., Fei-Fei, L.: Action recognition with exemplar based 2.5 d graph matching. In: European Conference on Computer Vision Workshops and Demonstrations, pp. 173–186 (2012) 102. Sener, F., Bas, C., Ikizler-Cinbis, N.: On recognizing actions in still images via multiple features. In: European Conference on Computer Vision Workshops and Demonstrations, pp. 263–272 (2012) 103. Yao, L., Fei-Fei, L.: Modeling mutual context of object and human pose in human-object interaction activities. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 17–24 (2010) 104. Delaitre, V., Sivic, J., Laptev, I.: Learning person-object interactions for action recognition in still images. Advances in Neural Information Processing Systems. MIT Press (2011) 105. Yao, B., Fei-Fei, L.: Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1691–1703 (2012) 106. Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects, EEE Transactions on Pattern Analysis and Machine Intelligence, pp. 601–614 (2012) 107. Desai, C., Ramanan, D., Fowlkes, C.: Discriminative models for static humanobject interactions. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 9–16 (2010)
1 Vision-Based Human Activity Recognition
37
108. Abidi, S., Piccardi, M., Williams, M.A.: Action recognition in still images by latent superpixel classification. arXiv:1507.08363 (2015) 109. Sharma, G., Jurie, F., Schmid, C.: Expanded parts model for human attribute and action recognition in still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–659 (2013) 110. Gkioxari, G., Girshick, R., Malik, J.: Contextual action recognition with r* cnn. In: IEEE International Conference on Computer Vision, pp. 1080–1088 (2015) 111. Liu, L., Tan, R.T., You, S.: Loss guided activation for action recognition in still images. In: Asian Conference on Computer Vision, pp. 152–167 (2018) 112. Sreela, S.R., Idicula, S.M.: Action recognition in still images using residual neural network features. Proced. Comput. Sci. 143, 563–569 (2018) 113. Wang, Y., Zhou, L., Qiao, Y.: Temporal hallucinating for action recognition with few still images. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5314–5322 (2018) 114. Zhao, Z., Ma, H., You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3391–3399 (2017) 115. Gao, R., Xiong, B., Grauman, K.: Im2flow: motion hallucination from static images for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5937– 5947 (2018) 116. Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: International Conference on Computer Vision, pp. 778–785 (2011) 117. Wu, J., Wang, L., Wang, L., Guo, J., Wu, G.: Learning actor relation graphs for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 9964– 9974 (2019) 118. Azar, S.M., Atigh, M.G., Nickabadi, A., Alahi, A.: Convolutional relational machine for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7892–7901 (2019) 119. Lan, T., Sigal, L., Mori, G.: Social roles in hierarchical models for human activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1354–1361 (2012) 120. Ramanathan, V., Yao, B., Fei-Fei, L.: Social role discovery in human events. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2475–2482 (2013) 121. Ryoo, M.S., Aggarwal, J.K.: Stochastic representation and recognition of high-level group activities. Int. J. Comput. Vision 93(2), 183–200 (2011) 122. Zhu, Y., Nayak, N., Roy-Chowdhury, A.: Context-aware modeling and recognition of activities in video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2491–2498 (2013) 123. Choi, W., Shahid, K., Savarese, S.: What are they doing?: Collective activity classification using spatio-temporal relationship among people. In: IEEE International Conference on Computer Vision Workshops, pp. 1282–1289 (2009) 124. Amer, M.R., Lei, P., Todorovic, S.: Hirf: Hierarchical random field for collective activity recognition in videos. In: European Conference on Computer Vision, 2014. pp. 572–585. Springer (2014) 125. Ibrahim, M.S., Mori, G.: Hierarchical relational networks for group activity recognition and retrieval. In: European Conference on Computer Vision, pp. 742–758 (2018) 126. Lan, T., Yang, W., Weilong, Y., Mori, G.: Beyond actions: discriminative models for contextual group activities. Advances in Neural Information Processing Systems (2010) 127. Lan, T., Wang, Y., Yang, W., Robinovitch, S., Mori, G.: Discriminative latent models for recognizing contextual group activities. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1549– 1562 (2012) 128. Choi, W., Savarese, S.: A unified framework for multitarget tracking and collective activity recognition. In: European Conference on Computer Vision, pp. 215–230. Springer (2012) 129. Amer, M.R., Xie, D., Zhao, M., Todorovic, S., Zhu, S.: Cost-sensitive top-down / bottom-up inference for multiscale activity recognition. In: European Conference on Computer Vision, pp. 187–200 (2012)
38
T. Mahmud and M. Hasan
130. Gupta, A., Srinivasan, P., Shi, J., Davis, L.S.: Understanding videos, constructing plots: Learning a visually grounded storyline model from annotated videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2012–2019 (2009) 131. Shu, T., Xie, D., Rothrock, B., Todorovic, S., Zhu, S.-C.: Joint inference of groups, events and human roles in aerial videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4576–4584 (2015) 132. Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: Advances in Neural Information Processing Systems, pp. 1799–1807 (2014) 133. Ibrahim, M.S., Muralidharan, S., Deng, Z., Vahdat, A., Mori, G.: A hierarchical deep temporal model for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition 1971–1980 (2016) 134. Deng, Z., Zhai, M., Chen, L., Liu, Y., Muralidharan, S., Roshtkhari, M., Mori, G.: Deep structured models for group activity recognition. In: British Machine Vision Conference (2015) 135. Qi, M., Qin, J., Li, A., Wang, Y., Luo, J., Gool, L.V.: Stagnet: an attentive semantic RNN for group activity recognition. In: European Conference on Computer Vision, pp. 104–120 (2018) 136. Wang, M., Ni, B., Yang, X.: Recurrent modeling of interaction context for collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7408– 7416 (2017) 137. Bagautdinov, T.M., Alahi, A., Fleuret, F., Fua, P., Savarese, S.: Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3425–3434 (2017) 138. Shu, T., Todorovic, S., Zhu, S.: CERN: confidence-energy recurrent network for group activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4255– 4263 (2017) 139. Li, X., Chuah, M.C.: SBGAR: semantics based group activity recognition. In: IEEE International Conference on Computer Vision, pp. 2895–2904 (2017) 140. Li, K., Fu, Y.: Prediction of human activity by discovering temporal sequence patterns. IEEE Trans. Pattern Anal. Mach. Intell. 36(8), 1644–1657 (2014) 141. Wei, X., Lucey, P., Vidas, S., Morgan, S., Sridharan, S: Forecasting events using an augmented hidden conditional random field. In: Asian Conference on Computer Vision, pp. 569–582 (2014) 142. Huang, D.A., Kitani, K.M.: Action-reaction: forecasting the dynamics of human interaction. In: European Conference on Computer Vision, pp. 489–504 (2014) 143. Lan, T., Chen, T.C., Savarese, S.: A hierarchical representation for future action prediction. In: European Conference on Computer Vision, pp. 689–704 (2014) 144. Kitani, K.M., Ziebart, B.D., Bagnell, J.A., Hebert, M.: Activity forecasting. In: European Conference on Computer Vision, pp. 201–214 (2012) 145. Chakraborty, A., Roy-Chowdhury, A.: Context-aware activity forecasting. In: Asian Conference on Computer Vision, pp. 21–36 (2014) 146. Rhinehart, N., Kitani, K.M.: First-person activity forecasting with online inverse reinforcement learning. In: IEEE International Conference on Computer Vision, pp. 3696–3705 (2017) 147. Mahmud, T., Hasan, M., Roy-Chowdhury, A.K.: Joint prediction of activity labels and starting times in untrimmed videos. In: IEEE International Conference on Computer Vision, pp. 5773– 5782 (2017) 148. Abu Farha, Y., Richard, A., Gall, J.: When will you do what?-anticipating temporal occurrences of activities. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5343–5352 (2018) 149. Liang, J., Jiang, L., Niebles, J.C., Hauptmann, A.G., Fei-Fei, L.: Peeking into the future: predicting future person activities and locations in videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5725–5734 (2019) 150. Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv:1504.08023 (2015)
1 Vision-Based Human Activity Recognition
39
151. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: IEEE International Conference on Computer Vision. Volume 1, vol. 2, pp. 1395–1402 (2005) 152. Weinland, D., Boyer, E., Ronfard, R: Action recognition from arbitrary views using 3d exemplars (2007) 153. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: International Conference on Computer Vision, pp. 2556–2563 (2011) 154. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012) 155. Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970 (2015) 156. Abu-El-Haija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube-8m: a large-scale video classification benchmark. arXiv:1609.08675 (2016) 157. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., Suleyman, M., Zisserman, A.: The kinetics human action video dataset. arXiv:1705.06950 (2017) 158. Gu, C., Sun, C., Ross, D.A., Vondrick, C., Pantofaru, C., Li, Y., Vijayanarasimhan, S., Toderici, G., Ricco, S., Sukthankar, R., Schmid, C., Malik, J.: AVA: a video dataset of spatio-temporally localized atomic visual actions. In: IEEE Conference on Computer Vision and Pattern Recognition (2018) 159. Rodriguez, M., Ahmed, J., Shah, M.: Action mach: a spatiotemporal maximum average correlation height lter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 160. Reddy, K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vision Appl. J (2012) 161. Stein, S., McKenna, S.J.: Combining embedded accelerometers with computer vision for recognizing food preparation activities. In: ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 729–738 (2013) 162. Kuehne, H., Arslan, A., Serre, T.: The language of actions: recovering the syntax and semantics of goal-directed human activities. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787 (2014) 163. Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.J., Lee, J.T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L. et al.: A largescale benchmark dataset for event recognition in surveillance video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3153–3160 (2011) 164. Sigurdsson, G., Varol, G., Wang, X., Farhadi, A., Laptev, I., Gupta, A.: Hollywood in homes: crowdsourcing data collection for activity understanding. In: European Conference on Computer Vision, pp. 510–526 (2016) 165. Lan, Z., Lin, M., Li, X., Hauptmann, A.G., Raj, B.: Beyond gaussian pyramid: multi-skip feature stacking for action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 204–212 (2015) 166. Lizhong, L., Zhiguo, L., Yubin, Z.: Research on detection and tracking of moving target in intelligent video surveillance. Int. Conf. Comput. Sci. Electron. Eng. 3, 477–481 (2012) 167. Kratz, L., Nishino, K.: Tracking pedestrians using local spatio-temporal motion patterns in extremely crowded scenes. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 987–1002 (2011) 168. Samarathunga, S.H.J.N., Wijesinghe, M.U.M.G.C.P.K., Kumara, R.P.S.R., Thambawita., D.R.V.L.B.: Person re-identification and tracking for surveillance camera systems (2019) 169. Sharma, A., Pathak, M., Tripathi, A., Vijay, P., Jain, S.A.: Automated human detection and tracking for surveillance applications (2019) 170. Hussain, M., Kharat, G.: Person detection and tracking using sparse matrix measurement for visual surveillance. In: International Conference on Data Engineering and Communication Technology, pp. 281–293. Springer (2017)
40
T. Mahmud and M. Hasan
171. Gajjar, V., Gurnani, A., Khandhediya, Y.: Human detection and tracking for video surveillance: a cognitive science approach. In: IEEE International Conference on Computer Vision, pp. 2805–2809 (2017) 172. Sfar, H., Ramoly, N., Bouzeghoub, A., Finance, B.: CAREDAS: context and activity recognition enabling detection of anomalous situation. In: Conference on Artificial Intelligence in Medicine in Europe, pp. 24–36 (2017) 173. Sfar, H., Bouzeghoub, A.: Activity recognition for anomalous situations detection. IRBM (2018) 174. Khatrouch, M., Gnouma, M., Ejbali, R., Zaied, M: Deep learning architecture for recognition of abnormal activities. In Tenth International Conference on Machine Vision, vol. 10696, p. 106960F, International Society for Optics and Photonics (2018) 175. Singh, D., Mohan, C.K.: Graph formulation of video activities for abnormal activity recognition. Pattern Recognit. 65, 265–272 (2017) 176. Xu, D., Yan, Y., Ricci, E., Sebe, N.: Detecting anomalous events in videos by learning deep representations of appearance and motion. Comput. Vision Image Underst. 156, 117–127 (2017) 177. Colque, R.V.H.M., Caetano, C., de Andrade, M.T.L., Schwartz, W.R.: Histograms of optical flow orientation and magnitude and entropy to detect anomalous events in videos. IEEE Trans. Circuits Syst. Video Technol. 27(3), 673–682 (2016) 178. Sadeghi-Tehran, P., Angelov, P.: A real-time approach for novelty detection and trajectories analysis for anomaly recognition in video surveillance systems. In: IEEE Conference on Evolving and Adaptive Intelligent Systems, pp. 108–113 (2012) 179. Chernbumroong, S., Cang, S., Atkins, A., Yu, H.: Elderly activities recognition and classification for applications in assisted living. Expert Syst. Appl. 40(5), 1662–1674 (2013) 180. Rafferty, J., Nugent, C.D., Liu, J., Chen, L.: From activity recognition to intention recognition for assisted living within smart homes. IEEE Trans. Hum. Mach. Syst. 47(3), 368–379 (2017) 181. Fleck, S., Straßer, W.: Smart camera based monitoring and its application to assisted living. Proc. IEEE 96(10), 1698–1714 (2008) 182. Chua, J.L., Chang, Y.C., Lim, W.K.: A simple vision-based fall detection technique for indoor video surveillance. Signal Image Video Process. 9(3), 623–633 (2015) 183. Feng, W., Liu, R., Zhu, M.: Fall detection for elderly person care in a vision-based home surveillance environment using a monocular camera. Signal, image and video processing 8(6), 1129–1138 (2014) 184. Swears, E., Hoogs, A., Ji, Q., Boyer, K: Complex activity recognition using granger constrained dbn (gcdbn) in sports and surveillance video. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 788–795 (2014) 185. Vallim, R.M., Andrade Filho, J.A., De Mello, R.F., De Carvalho, A.C.: Online behavior change detection in computer games. Expert Syst. Appl. 40(16), 6258–6265 (2013) 186. Braunagel, C., Kasneci, E., Stolzmann, W., Rosenstiel, W.: Driver-activity recognition in the context of conditionally autonomous driving. In: International Conference on Intelligent Transportation Systems, pp. 1652–1657 (2015) 187. Ohn-Bar, E., Martin, S., Tawari, A., Trivedi, M.M.: Head, eye, and hand patterns for driver activity recognition. In: International Conference on Pattern Recognition, pp. 660–665 (2014) 188. García, I., Bronte, S., Bergasa, L.M., Hernández, N., Delgado, B., Sevillano, M.: Visionbased drowsiness detector for a realistic driving simulator. In: Conference on Intelligent Transportation Systems, pp. 887–894 (2010) 189. Reddy, B., Kim, Y.H., Yun, S., Seo, C., Jang, J.: Real-time driver drowsiness detection for embedded system using model compression of deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 121–128 (2017) 190. Vu, T.H., Dang, A., Wang, J.C.: A deep neural network for real-time driver drowsiness detection. IEICE Trans. Inf. Syst. 102(12), 2637–2641 (2019) 191. Shakeel, M.F., Bajwa, N.A., Anwaar, A.M., Sohail, A., Khan, A.: Detecting driver drowsiness in real time through deep learning based object detection. In: International Work-Conference on Artificial Neural Networks, pp. 283–296. Springer, Cham (2019)
1 Vision-Based Human Activity Recognition
41
192. Ghoddoosian, R., Galib, M., Athitsos, V.: A realistic dataset and baseline temporal model for early drowsiness detection. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0 (2019) 193. Dasgupta, A., George, A., Happy, S.L., Routray, A.: A vision-based system for monitoring the loss of attention in automotive drivers. IEEE Trans. Intell. Transp. Syst. 14(4), 1825–1838 (2013) 194. Zhang, L., Jiang, M., Farid, D., Hossain, M.A.: Intelligent facial emotion recognition and semantic-based topic detection for a humanoid robot. Expert Syst. Appl. 40(13), 5160–5168 (2013) 195. Roitberg, A., Perzylo, A., Somani, N., Giuliani, M., Rickert, M., Knoll, A.: Human activity recognition in the context of industrial human-robot interaction. In: Signal and Information Processing Association Annual Summit and Conference (APSIPA), pp. 1–10 (2014) 196. Ryoo, M.S., Fuchs, T.J., Xia, L., Aggarwal, J.K., Matthies, L.: Robot-centric activity prediction from first-person videos: what will they do to me?. In: ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 295–302 (2015) 197. Xia, L., Gori, I., Aggarwal, J.K., Ryoo, M.S.: Robot-centric activity recognition from firstperson rgb-d videos. In: IEEE Winter Conference on Applications of Computer Vision, pp. 357–364 (2015) 198. Guadarrama, S., Krishnamoorthy, N., Malkarnenkar, G., Venugopalan, S., Mooney, R., Darrell, T., Saenko, K.: Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: IEEE international conference on computer vision, pp. 2712–2719 (2013) 199. Gan, C., Lin, M., Yang, Y., De Melo, G., Hauptmann, A.G.: Concepts not alone: exploring pairwise relationships for zero-shot video activity recognition. In: AAAI Conference on Artificial Intelligence (2016) 200. Zellers, R., Choi, Y.: Zero-shot activity recognition with verb attribute induction. arXiv:1707.09468 (2017) 201. Xu, X., Hospedales, T., Gong, S.: Transductive zero-shot action recognition by word-vector embedding. Int. J. Comput. Vision 123(3), 309–333 (2017) 202. Feng, S., Duarte, M.F.: Few-shot learning-based human activity recognition. Expert Syst. Appl. 138, 112782 (2019) 203. Mishra, A., Verma, V.K., Reddy, M.S.K., Arulkumar, S., Rai, P., Mittal, A.: A generative approach to zero-shot and few-shot action recognition. In: IEEE Winter Conference on Applications of Computer Vision, pp. 372–380 (2018) 204. Xu, B., Ye, H., Zheng, Y., Wang, H., Luwang, T., Jiang, Y.G.: Dense dilated network for few shot action recognition. In: ACM on International Conference on Multimedia Retrieval, pp. 379–387 (2018) 205. Yu, T.H., Kim, T.K., Cipolla, R.: Real-time action recognition by spatiotemporal semantic and structural forests. Br. Mach. Vision Conf. 2(5), 6 (2010) 206. Raptis, M., Sigal, L.: Poselet key-framing: a model for human activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2650–2657 (2013) 207. Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: European Conference on Computer Vision, pp. 536–548 (2010) 208. Sudhakaran, S., Escalera, S., Lanz, O.: Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9954–9963 (2019) 209. Cao, C., Zhang, Y., Wu, Y., Lu, H., Cheng, J.: Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules. In: IEEE International Conference on Computer Vision (2017) 210. Ma, M., Fan, H., Kitani, K.M.: Going deeper into firstperson activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (2016) 211. Ryoo, M.S., Rothrock, B., Matthies, L.: Pooled motion features for first-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
42
T. Mahmud and M. Hasan
212. Sigurdsson, G., Gupta, A., Schmid, C., Farhadi, A., Alahari, K.: Actor and observer: joint modeling of first and third-person videos. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7396–7404 (2018) 213. Singh, S., Arora, C., Jawahar, C.V.: First person action recognition using deep learned descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2620–2628 (2016) 214. Sudhakaran, S., Lanz, O.: Convolutional long short-term memory networks for recognizing first person interactions. In: IEEE International Conference on Computer Vision Workshop, pp. 2339–2346 (2017) 215. Sudhakaran, S., Lanz, O.: Attention is all we need: nailing down object-centric attention for egocentric activity recognition. In: British Machine Vision Conference (2018) 216. Zaki, H.F.M., Shafait, F., Mian, A.S.: Modeling sub-event dynamics in first-person action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7253– 7262 (2017) 217. Zhou, Y., Ni, B., Hong, R., Yang, X., Tian, Q.: Cascaded interactional targeting network for egocentric video analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1904–1913 (2016) 218. Dima, D., Doughty, H., Farinella, G.M., Fidler, S., Kazakos, A.F.E., Moltisanti, D. et al.: Scaling egocentric vision: the epic-kitchens dataset. In: European Conference on Computer Vision, pp. 720–736 (2018) 219. Lea, C., Hager, G.D., Vidal, R.: An improved model for segmentation and recognition of fine-grained activities with application to surgical training tasks. In: IEEE winter conference on applications of computer vision, pp. 1123–1129 (2015) 220. Kataoka, H., Aoki, Y., Satoh, Y., Oikawa, S., Matsui, Y.: Fine-grained walking activity recognition via driving recorder dataset. In: IEEE International Conference on Intelligent Transportation Systems, pp. 620–625 (2015) 221. Pishchulin, L., Andriluka, M., Schiele, B.: Fine-grained activity recognition with holistic and pose based features. In: German Conference on Pattern Recognition, pp. 678–689 (2014)
Chapter 2
Skeleton-Based Activity Recognition: Preprocessing and Approaches Sujan Sarker, Sejuti Rahman, Tonmoy Hossain, Syeda Faiza Ahmed, Lafifa Jamal, and Md Atiqur Rahman Ahad
Abstract Research in Activity Recognition is one of the thriving areas in the field of computer vision. This development comes into existence by introducing the skeleton-based architectures for action recognition and related research areas. By advancing the research into real-time scenarios, practitioners find it fascinating and challenging to work on human action recognition because of the following core aspects—numerous types of distinct actions, variations in the multimodal datasets, feature extraction, and view adaptiveness. Moreover, hand-crafted features and depth sequence models cannot perform efficiently on the multimodal representations. Consequently, recognizing many action classes by extracting some smart and discriminative features is a daunting task. As a result, deep learning models are adapted to work in the field of skeleton-based action recognition. This chapter entails all the fundamental aspects of skeleton-based action recognition, such as—skeleton tracking, S. Sarker (B) · S. Rahman · S. Faiza Ahmed · L. Jamal Department of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] S. Rahman e-mail: [email protected] S. Faiza Ahmed e-mail: [email protected] L. Jamal e-mail: [email protected] T. Hossain Department of Computer Science and Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh e-mail: [email protected] M. A. R. Ahad Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] Department of Media Intelligent, Osaka University, Osaka, Japan
© Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_2
43
44
S. Sarker et al.
representation, preprocessing techniques, feature extraction, and recognition methods. This chapter can be a beginning point for a researcher who wishes to work in action analysis or recognition based on skeleton joint-points.
2.1 Introduction Skeleton-based action recognition system has captured significant attention because of its excellent ability to deal with large data-set by improving the execution speed. The skeleton structures the entire organization of the human body’s external architecture by providing shape, support, and other indispensable parts of the body. Skeletonbased action recognition relies on the analysis of the 3D orientation of the body joints devising a depth measuring camera to generate a depth map [1]. Alternatively, the skeleton can also be tracked from direct RGB images using the pose method [2] or placing markers in the original locations of the human body [3]. Any changes of movement are detected by tracking down the coordinate change of the joints. The activity recognition applications, including daily activity monitoring, fall detection, rehabilitation, autistic child care, and patient monitoring, can be greatly facilitated with the progressive sophistication of this technology [4]. Activity recognition has become a hotspot in the computer vision-based research domain. An action can be defined as a consequence of a single or series of motion changes from a human agent. Consequently, Activity Recognition implies detecting sequences of actions. Thus, Human Activity Recognition (HAR) refers to the detection of actions executed by a subject with the help of observations from different sensory devices [5]. Depending on the sensor motion, there are three types of action recognition: single-user, multi-user, and group activity recognition [6]. The advent of the skeleton-based system has made empirical jobs in this domain easier to perform. The conventional sensors provide information about the joint coordinates of the human body [7]. This parametric specification of the human body yielded by the skeleton data is further processed with the advanced algorithms and computations to label different actions. Figure 2.1 depicts the general steps of a skeleton-based action recognition system. Skeleton-based vision system outperforms the existing modes of activity recognition by preserving the privacy of the users. It comprises only the coordinate values of the specific joints instead of actual visual representation of any real-life instance. It excels computationally from the coexisting technology as the underlying structure of the data is the coordinate of the axis of the joint. In contrast, RGB images had the complicacy of storing multidimensional pixels values of every position [8]. Moreover, a skeleton-based action recognition system has also been prevalent and corroborate the research due to the availability of various types of depth cameras and sensors [9]. With the help of these visionary capturing systems, various types of 3D action datasets in distinct fields are being created, which has stimulated the research activity in this field. These skeleton datasets are openly available, which aids in developing a wide range of learning algorithms for activity recognition.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
45
Fig. 2.1 General steps of a skeleton-based action recognition system
Several factors make the skeleton-based action recognition system a challenging task. The empirical jobs in this field, like data collection, are preferred to be performed in the constrained environment of a laboratory as it can minimize the additional features introduced in the model due to different backgrounds/surroundings. However, to be more applicable in real-life scenarios, the predictions of the models demand extensive computational developments to account for the issues of interruptions related to dynamic background conditions. Additionally, such a system deals with the subject’s body size, position, orientation, and viewpoint variations, making action recognition a challenging task. The requirement of preserving the joint correlation of skeleton structure in the representation and prepossessing steps further intensified this challenge. Moreover, the lack of precision from the tracking devices can also cause an imperfect outcome from the system. Researchers have been addressing these challenges for further optimization of the system. In the human skeleton system, arms/bones and joints play crucial roles in detecting a specific action. The Authors divides the human skeleton system into five parts - two legs, two arms, and one trunk [10]. Many studies emphasized the role of these specific joints in action recognition. The work [11] accumulated skeleton-based human action classification models in a survey with insights on various pre-processing methods, action representation, and classification techniques. Several types of research dedicated to deep learning—oriented action recognition models [12]. In [13], authors reviewed CNN and RNN based network architecture and ref. [14] covered temporal, LSTM, 2D, and 3D Models in their paper. In another work, Wang et al. [15] remarked on Kinect-based action recognition techniques, which was followed by a comprehensive study on HON4D (Histogram of Oriented 4D Normals), HDG (Histograms of Depth Gradients), LARP-SO (Lie Algebra Relative Pairs via SO), SCK + DCK (Sequence Compatibility Kernel + Dynamics Compatibility Kernel), P-LSTM (Part-aware Long Short-term Memory), Clips + CNN+MTLN (Clips + Convolutional Neural Network + Multi-task Learning Network), IndRNN (Independently Recurrent
46
S. Sarker et al.
Neural Network), and ST-GCN (Spatio-Temporal Graph Convolutional Networks) algorithms. Jegham et al. [16] highlighted some challenges in action recognition systems such as—anthropometric variation, multiview variation, occlusion, inconsistent data, and camera motion. This chapter presents a comprehensive study of skeleton-based action recognition. The stream of operations in skeleton-based action recognition flows through skeleton tracking, skeleton representation, action representation, pre-processing techniques, feature extraction techniques, and recognition models. We expounded all of these steps in the rest of the chapters as follows. In Sect. 2.2, we describe different skeleton tracking devices and methods. Section 2.3 reviews different skeleton and action representation techniques. Both handcrafted and deep learning features are studied in Sect. 2.5. Section 2.6 presents a comprehensive study of the recent traditional machine learning and deep learning-based methods for skeleton-based action recognition. Finally, we conclude this chapter with some future directions in Sect. 2.10.
2.2 Skeleton Tracking Skeleton tracking is the beginning and fundamental steps of action recognition. Researchers tried to extract the skeletal trajectories or postures employing different tracking techniques. Depth sensors, OpenPose toolbox, body markers are some of the prominent techniques in skeleton tracking.
2.2.1 Depth Sensors and SDKs Depth sensors are used to acquire the range in the 3D system, which measures the information of multi-point distance beyond a Field-of-View (FoV). • Kinect Depth Sensors: Among the wide variety of available depth sensors, Kinect camera from Microsoft has gained popularity among the researchers of the 3D action recognition field due to its ability to capture both RGB and depth videos in real-time [15]. Here, we list some generations of Kinect sensors with their skeleton tracking capabilities: – Xbox 360 Kinect: Kinect for Xbox 360 is capable of tracking six subjects simultaneously where each subject is featured with a 20 joints skeleton, as shown in Fig. 2.2a. – Xbox One Kinect: Kinect for Xbox One can identify six subjects at a time while it models the skeleton of a subject with 25 individual joints (including thumbs) and can detect the position and orientation of these joints. – Kinect V2: Similar to Kinect for Xbox One, Kinect V2 can track skeleton data of six subjects at a time where each skeleton consists of 25 body joints as shown
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
47
in Fig. 2.2b numbered from 0 to 24. Each joint can be attributed by-color (x, y), depth (x, y), camera (x, y, z), and orientation (x, y, z, w). – Azure Kinect: The Azure Kinect DK is a PC peripheral and most recently released (March 2020) development kit that enables tracking of multiple human bodies at a particular time assigning an ID to each body that temporally correlates the kinematic skeleton with the frames. Considering the skeleton architecture using Azure Kinect, each skeleton consists of 32 joints forming a joint hierarchy where a child joint is linked with a parent joint with a bone (connection). • Intel RealSense: Intel RealSense is a parallel sensor that works efficiently to understand 3D objects. In this RealSense technology, the SDK can work on 18 joints simultaneously. Furthermore, the depth camera of this SDK supports up to five people in a single scene. Also, an external dedicated Graphics Processing Unit (GPU) is not required to set up this device. This sensor provides real-time performance that solely works for video streaming on edge hardware. • NUITRACK SDK: Nuitrack is a middleware that can recognize skeleton and gesture data by tracking 19 joints concurrently. It has a cross-platform SDK to enable Natural User Interface (NUI) on Android, Windows, and Linux. This middleware can be used with various 3D sensors (Kinect V1, Asus Xtion, Orbbec Astra, Orbbec Persee, Intel RealSense). It has features to generate a 3D point cloud with the tracked data and can perform desired user masks to the subjects. • Cubemos: Cubemos is a deep learning-based 2D/3D skeleton tracking SDK. IBM developed it as the most forefront tracking technology to track an unlimited number of subjects in a scene. It can track down 18 joints simultaneously. It’s userfriendly integration feature is accompanied by cross-platform usability(Windows and Linux). It is also not required to be integrated with any GPU for better optimization.
2.2.2 OpenPose Toolbox Proposed by the researchers of Carnegie Mellon University, OpenPose is the first multi-person real-time method to unitedly detect human body, facial, hand, foot (135 keypoints) from single images [17–19]. The OpenPose toolbox can estimate 15 (OpenPose MPI), 18 (OpenPose-COCO), or 25 (OpenPose Body-25) keypoints for human body/foot joints with a confidence score for each joint (Fig. 2.2c).
2.2.3 Body Marker Body marker plays an indispensable role by tracking the motion in action recognition. To Model the skeleton, each distinct marker is used to demonstrate the biological location. Errors in the placement of any of the markers result in a disorganized shape
48
S. Sarker et al.
Fig. 2.2 Human skeleton model by different kinect sensors. a 20 joints (Kinect for Xbox 360), b 25 joints (Kinect V2), and c 25 joints OpenPose Body-25 model
of the skeleton [20, 21]. There are two types of markers: Joint Marker (implanted along the joint axes [22]), and Segment Marker (asymmetrically placed on skeleton body segments [23]).
2.3 Representation In the field of human action recognition, we have to deal with two types of representation—Skeleton Representation and Action Representation. A skeleton j with J number of joints can be defined as a set, X t = {X t1 , X t2 , ..., X t , ..., X tJ } ∈ j D×J where X t denotes the co-ordinate of a joint j at time step t and D denotes R the skeleton dimension (i.e., for a 3D skeleton, D = 3). On the other hand, the perspective of which we perform a particular work by frequently changing the absolute body parts is termed an action (e.g., walking, running, or skipping). Skeleton representation considers the arrangement of the skeleton structure, whereas the action representation works on the movement-related functions of the human body.
2.3.1 Skeleton Representation The process of specifying a planar region is known as shape representation, and to depict the skeleton in this planar field can be defined as skeleton representation. Here, we described some of the significant skeleton representation techniques.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
2.3.1.1
49
Coordinate System
The coordinate system is used to express a geometric location of a particular point or object in a two, three, or multi-dimensional system. In HAR, two types of the coordinate system are mostly employed, which are as follows: • Cartesian Coordinate System: Cartesian coordinate system can be expressed as a mathematical position of an object, point, or particularly any subject. We can express a cartesian coordinate of a subject in the 2D coordinate system as (x, y) or in the 3D coordinate system as (x, y, z). • Cylindrical Coordinate System: It is a 3D coordinate system that defines the position of a point based on three subjects—distance from a reference axis, direction from a relative axis direction, and distance from a designated reference plane which is perpendicular to the reference axis. If ρ (axial distance) is the Euclidean distance from the z-axis, ϕ (azimuth) is the angle between the reference direction and projected line from the origin to a plane, and z (axial height) is the signed distance then the cylindrical coordinate of a point can be expressed as point P(ρ, ϕ, z) [24]. The transformation from Cartesian to a cylindrical system can be defined as y ), z) (2.1) P(ρ, ϕ, z) = ( x 2 + y 2 , ar csin( 2 x + y2
2.3.1.2
Graph-Based Representation
Graph-based representation techniques have been comprehensively explored in this literature for representing skeleton data of a subject as depicted in Fig. 2.3. Here, we present some graph-based skeleton representation techniques. • Skeletal Graph: The human skeleton is considered as an undirected graph having nodes as body joints and an adjacency matrix is constructed to represent the natural connection between two joints in the human skeleton [25–28]. • Spatio-Temporal Graph: A Spatio-temporal graph is a weighted directed graph where nodes and edges act as joints and associated bone respectively [29]. • Directed Acyclic Graph (DAG): Skeleton data can be represented by Directed Acyclic Graph (DAG) hinged upon kinematic dependencies [30, 31]. The kinematic dependency relies on the skeleton joints (vertices) and bones (edges) of the human body. If each joint/vertex is denoted as ji then the incoming and outgoing bone/edge headed to the joint is bi− and bi+ . The bone (b j ) joining the two joints is defined as a vector directed from source vector j sj to j tj (Fig. 2.3d). • Structural Graph: An undirected graph can be generated based on the joints and bones or parts of the body of the human skeletal system [32]. It is useful when we need spatial-temporal features or information to train the model. A directed graph is focused on the dependencies while the structural graph deals with the information of the skeleton sequences.
50
S. Sarker et al.
Fig. 2.3 Different graph-based techniques for skeleton representation. a a skeletal graph b skeleton representation using directed acyclic graph (DAG) c Skeleton graph with intrinsic dependencies (solid lines) and extrinsic dependencies (dotted lines) d Skeleton graph with structural links, and e A spatio-temporal graph
• Graph with Weighted Adjacency Matrix: The adjacency matrix is an invariant and isomorphic square matrix used to represent a graph and find the adjacency vertices. It can formulate any kind of descriptive spatial-temporal features [33] and deal with the scale or view-invariant information. A weighted adjacency matrix can be represented as [34]
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
⎧ ⎪ ⎨0 if i = j ai j = α if joints/vertices i and j are connected ⎪ ⎩ β if joints/vertices i and j are disconnected
51
(2.2)
• Tree Structure Reference Joints Image: A Tree Structure Reference Joint Image (TSRJI) is developed for the representation of skeletal joints. In reference joints architecture, the latter joints secures the spatial relations using a DFS (Depth-First Search) algorithm, while the former deals with the corroboration of the spatial relationship of the joints [35, 36].
2.3.2 Action Representation Skeleton-based action representation denotes to represent a particular action using skeletal data. Action representation can be broken down into three classes: dynamicbased, joint-based, and mined joint-based descriptors, as shown in Fig. 2.4 [37]. In the following, we describe each of the descriptors with an appropriate depiction.
2.3.2.1
Dynamic-Based Descriptors
Dynamic descriptors deal with the design of the subset or all of the joints in a skeleton. There are various architectures by which we can develop a dynamic descriptor such as—Hidden Markov Models (HMM), Linear, and Dynamical System (LDS) [38, 39].
2.3.2.2
Joint-Based Descriptors
Joint-based representation is used to capture the relative joint locations. It finds the feature representation of a skeleton relating to the correlation of the body. Based on the pairwise distances of the joints, the joint-based descriptor has three subcategories: spatial descriptors, geometrical descriptors, and key-pose based descriptors [37, 40]. Figure 2.5 depicts different skeletal joint-based action representation techniques. • Spatial Descriptors: It acknowledges the pairwise distances of the body joints. There is a deficiency in considering the pairwise distance—lacking in temporal information. Sometimes, this descriptor cannot extract the temporal information resulting as an inconclusive descriptor for action recognition. • Spatio-temporal Descriptors: The spatial and temporal discriminative features can be represented by a Spatio-temporal graph. Spatio-temporal information is important to recognize an action accurately.
52
S. Sarker et al.
Fig. 2.4 Different techniques for skeleton-based action representation [37]
• Geometric Descriptors: The geometric descriptors consider the geometric similarities of the body parts. It determines the transformation sequences of the geometric structure for moving a skeleton beyond time. It also represents the geometrical sub-sets of skeleton joints. • Key-Pose Based Descriptors: A set of key-pose is calculated, and the skeleton sequence is characterized considering the closest key-poses. This method tries to learn a dictionary of key-poses and illustrate a skeleton sequence based on the dictionary.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
53
Fig. 2.5 Joint-based action representation techniques [37]
2.3.2.3
Mined Joints-Based Descriptors
Mined joint based descriptors try to learn the body parts related to the action of a human skeleton. It finds a path to extract and divide the discriminative features into sub-classes. To discriminate against the actions in different classes, detection of the subsets of skeleton joints is required that correlates with this descriptor.
2.3.2.4
Lie Group
The elements of the lie group are formed in a regular, smooth and continuous manner, while the members of the discrete group are scattered [41] which makes it a differentiably manifold (locally similar to a linear structure). The motion features can be
54
S. Sarker et al.
represented on a lie group as a high-dimensional trajectory. The skeletal data can be represented using Lie group [42]. Let S = J, E where S = s1 , s2 , s3 , . . . , s N ∈ ‘skeletal joints’ and E = e1 , e2 , e3 , . . . , e M ∈ ‘skeletal bones’. Then, a rotation mapping layer working as a rotation matrix is required to map the lie group. If em and en are the 3D vectors defined as the two body parts (∈ S), then the rotation matrix from T T Rm,n = Rm,n Rm,n = In ). em to en calculated as [43]—Rm,n (Rm,n
2.4 Preprocessing Techniques Preprocessing techniques play a crucial role in the skeleton-based activity recognition system. Different preprocessing techniques are often used by the skeleton-based activity recognition system to deal with a subject’s size and view variations, anthropometric differences, varying duration of the skeletal sequences, and noise in the captured sequences [37]. In this section, we describe different preprocessing techniques applied to the skeletal data.
2.4.1 Coordinate System Transformation Coordinate system transformation techniques play a vital role in extracting the features by maintaining the size or orientation of the object [44–47]. We will comprehensively discuss three of the most operated coordinate system transformation methods—translation, rotation, and scaling. • Translation: A translation is a type of geometric transformation where each of the points of a translated object will move to a place maintaining an identical distance. To translate a particular point Z (x, y) m unit up (+ve) or down (-ve) and n unit right (+ve) or left (-ve), the translated point will be—Z (x ± m, y ± n). • Rotation: Rotation is a transformation technique where the object rotates by a particular angle concerning a point, axis, or line. In most of the paper, the rotation method is adopted as it corroborates with the rotation of joints associated with the bones. • Scaling: Scaling is applied when we need to change the size/shape/orientation of the object. It is essential in the augmentation of the skeleton data. If (x, y, z) and (x , y , z ) defines the coordinates of an object before and after applying the scaling operation then the scaled coordinates can be expressed as—(x , y , z ) = (Sx × x, S y × y, Sz × z). Here, Sx , S y and Sz are the scaling matrix representing the scaling of an object through x, y and z axis.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
55
2.4.2 Skeleton Normalization Skeleton-based action recognition system depends on the variance of the subject’s body size, position, orientation, and viewpoint. Normalization of the skeleton sequences is required to make the system adaptable with the scale and view variations of the subject. • View Alignment: To diminish the effects of different viewpoints, view alignment transformation on each skeletal frame in a skeleton sequence is performed. There are two types of transformation—Frame-based transformation (each frame is transformed individually reducing the relative motion), and Sequence-based transformation (keeps the original relative motion unchanged by applying the same transformation to all the frames). The original coordinates are transformed to the view-aligned coordinates by performing a translation followed by a rotation transformation [48]. This projection is made correspondence to the global X -axis, where the hip vectors are projected to the ground [49]. Moreover, two view-invariant features named Joint Euler angles and Euclidean distance matrix between joints, which represents the joint motion and their structural relations respectively is associated with the view alignment transformation [50]. • Scale Normalization: Scale normalization is required to eliminate the effect of body size variations. The average distance between the “hip-center” joint and the “spine” joint is used as the distance unit, and coordinates of the other joints are changed accordingly to get the scale normalized representation [48]. In [49], a person-centric coordinate system is adopted where the hip center is considered as the origin of the new system to make the skeleton location invariant. To obtain the scale-invariant representation, all the skeletons are normalized in such a way that their body-part length is made equal to that of a reference skeleton while keeping their joint angles constant.
2.4.3 Skeleton Data Compression Li et al. [51] introduced the concept of skeleton data compression in skeleton-based action recognition. The proposed algorithm named Motion-based Joints Selection (MJS) compressed the skeleton data by measuring the movements of different joints. This algorithm first calculated the motion flow of the skeleton data, assigned a motion score to each joint based on its motion intensity, and selected top k joints according to their scores, thus reducing the skeleton data size.
56
S. Sarker et al.
2.4.4 Data Reorganization Most of the previous works represented the position information of the skeleton joints as an unordered list. In such representations, all the joints are treated equally without preserving their correlation information, which makes it difficult for the learning model to extract useful order. Some works used the traditional graph structure to represent connections between different joints. However, this representation is also unable to capture the physical connection and symmetric relationship of the skeleton due to using only one type of edge in a traditional graph. The result is that these representation methods can not preserve the correlation information of the joints and makes it difficult to recognize actions when fed to the learning models. A novel data reorganization method represents the local and global structural information of the human skeleton in [52], which preserves these structures as well as correlation information while training the model. The reorganization process consists of three steps: • Data Mirror: Skeleton joints are split into the left and right parts to capture the symmetric property of the human skeleton, which preserves the global information associated with an action. • Data Reorganization Based on Local Structure: Skeleton joints are rearranged in such a way that preserves the symmetric and correlation information of the local skeleton structure. • Dimension Expansion: The similar skeleton joints are superimposed, which results in the expansion of the data dimension. This preserves the correlation information in the training steps. Table 2.1 lists some of the mostly used prepossessing techniques in skeleton-based activity recognition system.
Table 2.1 A List of different prepossessing approaches employed in skeleton-based activity recognition system Preprocessing approach References Coordinate system transformation (Translation, Rotation, Scaling) View invariant transformation Sequence level translation Scale normalization Skeleton data compression Data reorganization Noise filtering Upsampling/downsampling of skeletal data
[44–47, 53–55] [48–50, 56] [47, 57] [28, 48, 49, 53] [51] [52] [55] [27]
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
57
2.5 Feature Extraction In action recognition, features play an important role as the efficiency of a model depends on the extracted features. Here, we describe some of the features used in the recent action recognition works by broadly categorizing those into two categories— hand-crafted and deep learning features.
2.5.1 Hand-Crafted Features Hand-crafted features are those extracted manually from the skeleton data stream, which in combination facilitates the recognition of a particular action as well as an activity. Here, we discuss some of the most used hand-crafted features for action recognition. • Coordinates of Key points: Coordinate of the key points are used in the extraction of scale-invariant, view-invariant, and center of gravity features. Based on the keypoint coordinates, the center of gravity is calculated for the computation of the normalized distance [58]. Furthermore, the representation of a 3D clip/frame can be done by these key-based coordinate systems [59]. • Transformation Invariant Features: Scale-invariant and view-invariant features represent the robustness of the 3D skeleton data. The body parts need to be transformed into a view-invariant space to formulate the action [46]. Sometimes latent key points or variations may affect the invariance features [58]. • Centre of Gravity: Center of Gravity (G) is particularly required to calculate the key points normalized distance. This common motion invariant feature can be formulated by normalizing the distances by the vertical distance d. Symbolizing the coordinates of the X Y plane by (xi , yi ), keypoint visibility by (ki ) and number of keypoints by N then the CG features can be defined as [58] N Gx =
i=1 x i ki N i=1
N ; Gy =
i=1
N
yi ki
(2.3)
i=1
• Angle: Angle between the skeleton joints is one of the most common features extracted from the action recognition model. To make the skeletons scale-invariant, the reference skeleton’s length is kept equal to the actual length of the body part [46]. Furthermore, the Euler angle is employed to find the Euclidean distance to cluster the features distinctly [60]. • Velocity/Motion: Action is continuously generated by some precise elementary motions such as skipping, jumping, walking, running [27]. Dynamic Motion features as well as the movement patterns used in any probabilistic model play a vital role in modeling the basic dynamic approaches [28].
58
S. Sarker et al.
• Co-occurrence features: The actions of the human body are always corroborated with its associated parts, such as hand and arm is needed for writing parts. Primarily, co-occurrence feature learning is applied to the skeleton joints or key points. A CNN-based regularization scheme or network is required to learn these co-occurrence features [61]. • Skeleton Motion Decomposition: Human motion depends on two factors: rigid movement and the non-rigid deformation. Decomposing the skeletal motion function will result in a better representation of the motion using a recurrent network. Decomposition consists of two parts: local and global components. While the local component deals with the internal structure deformation of the skeleton, the global component works with the exterior information i.e., size, shape, movement in a fixed area, etc. If the coordinate of a set of joint location is It ∈ (xti , yti )i=1,2,...,k where k = skeletal joints and t = time step, then the decomposition can be defined g in 2D space as—Iti = It + Itl,i [62]. • Key-Segment Descriptor: It is a discriminative descriptor created in building skeletal sequences. It contains all the spatial-temporal information. A keysegmented descriptor can be built by following the pipeline [48]: break down the skeleton sequence into multiple skeleton segments, extract the skeleton segment features, and make clusters of the features and represent each of the skeleton sequences by a clustered index based on the distance between the cluster and the skeleton sequence. • Skeleton Map: To interpret the skeletal poses, a skeleton map is employed to depict the human body’s internal posture [63]. It is also used to dispatch the skeletal pose from one skeleton to another, considering different topologies. Skeleton joint difference maps [64], depth maps [65, 66], and translation-scale invariant maps [67] are some of the unparalleled skeleton map models. • Bone Length and Direction: Length and Bone direction are the second-order information that is the prominent discriminative features in skeleton-based action recognition. It is devised as a vector that points from the source joint to the target joint. Then, the vector is fed into a CNN architecture for the prediction of the action label [29].
2.5.2 Deep Learning Features Deep learning features are pervasive in learning a neural network architecture. Moreover, a neural network architecture outperformed the traditional action recognition model nowadays. Hence, five types of deep learning features (Joint Distance and Orientation Metrics, Temporal Step Matrix, Structural/Actional Links, Dynamic Skeleton Features, and Decoupled Features) are studied below.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
59
• Joint Distance and Orientation Metrics: A rotation matrix known as orientation matrix is developed by hybridizing with joint distance to represent the correlation in spatial and temporal dimensions. The joint distance is calculated using the Euclidean distance formula of 3D space and the orientation matrix is measured by the cosine similarity between the skeleton sequences [68]. • Temporal Step Matrix: Temporal step matrix is a type of matrix where the temporal features of a skeleton sequence are coded/summarized into a step matrix. If ci is the nearest center cluster to the ith skeleton segment and l is the segmented skeleton sequence length then each of the skeleton sequences can be formulated as a word sequence—S = s1c1 , s2c2 , s3c3 , ..., slcl [48]. After initializing the word, the step matrix (m e ) of dimension e is [48] m ewi ,wi+e = m ewi ,wi+e + 1, (i = 1, 2, ..., l − e)
(2.4)
• Structural/Actional Links: A link that captures the action-specific hidden dependencies is known as Actional Link (A-link) that captures the action-specific higherorder dependencies is known as a structural link (S-link). It is used to denote the higher-order dependencies. Applying S-link to a model, the action can be performed at a load time. Structural links mostly used to exemplify the higher-order dependencies of a model [69], while Actional links applied to extract the dependencies which are action-specific. A graph convolutional neural network termed as an Actional-Structural Graph Convolutional Network (AS-GCN) is developed based on the fusion of these two distinct links. For that purpose, an ActionalStructural Graph Convolutional block (ASGC) needs to be employed to model the structure. If λ is a hyper-parameter maintaining the connection between structural and actional links, output feature dimension is dout (X out ∈ Rn×dout ) then the operation of the ASGC can be formulated as [69] Z out = ASGC(Z in ) = Z str uctur e + λZ action ∈ Rn×dout
(2.5)
• Dynamic Skeleton Features: Developing over time, dynamic skeleton features are defined by the change of motion or position. The dynamic feature contains sufficient discriminative power to achieve the optimal action from a 3D video. The dynamic skeleton movements can be decomposed into two parts—local body posture and global body movement [62]. • Decoupled Features: The decoupled feature follows a similar attention mechanism where the optimal features may call from different situations in the learning stages. Obtaining these features, the model gets much higher generalizations than any monolithic network. Also, the decoupled features interact with the spatialtemporal model to identify the irregularities from a 3D video [62, 70].
60
S. Sarker et al.
2.6 Recognition Method In the field of Skeleton-based Action Recognition, a lot of distinguished methods have been proposed. Some of the authors focused on Machine Learning-based approaches, and some other embraced Neural Network-based methods. Following, we will categorize the recognition methods in distinct sections with appropriate depiction and characterization.
2.6.1 Machine Learning Based Recognition Previously (before 2015), most of the authors focus on the basic or ensemble machine learning methods for skeleton-based action recognition. Classification and Regression techniques based on the dataset and the problems are being adopted for the work. While some of the works focused on extracting the discriminative features, others focused on the learning stages. A comprehensive study on these techniques and a brief representation of the model is outlined in the following. • Support Vector Machine (SVM): Developed by Vladimir Vapnik and Alexey Chervonenkis [71, 72], this classification model divides a dataset into two fractions by passing a linearly separable hyperplane. This hyperplane can be of different dimensions, such as—line (2D), plane (3D), or hyperplane (4D or above). A linear SVM model is adapted to recognize the action recognition represented in 3D skeletons by integrating dynamic time warping and Fourier temporal pyramid representation with linear SVM to find the optimal result [73]. Furthermore, an action recognition model based on handwritten-inspired features is proposed [74]. Based on a time partitioning scheme, the handwritten features are extracted from the pre-processed joint location. After that, SVM is applied for the features for the classification purpose [75]. • Particle Swarm Optimization-SVM: PSO method tries to develop an iterative solution to aid the provided quality measure. Xu et al. [76] work on a 3D skeleton model optimizing the Dynamic time warping and Particle Swarm Optimization on Support Vector Machine (PSO-SVM) algorithm. From the training set, they extracted the skeleton representation, members of the Special Euclidean group (SE), and applied the proposed optimized algorithm. • Linear Discriminant Analysis (LDA): LDA tries to find the optimal linear combination of features to separate the dataset two or multi-class objects. A viewinvariant action recognition model is introduced by combining Motion and Shape Matrices [77]. There are three steps to build the model: a distance matrix is formed for each of the skeletons by calculating distances among pairwise skeleton joints, characterize the motion and shape cues between the matrices, Motion Matrices, and Shape Matrices are devised, and encode the Motion and Shape Matrices. • Naive Bayes Nearest Neighbor: Naive Bayes Nearest Neighbor (NBNN) is a nonparametric classification algorithm that avoids the step of vector quantization. A
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
61
model based on Naive-Bayes Nearest Neighbor (NBNN) algorithm considering Spatio-Temporal features are developed [78]. This model relaxed the NBNN algorithm’s assumption by considering the Spatial features and developed a SpatioTemporal NBNN model (ST-NBNN). Moreover, the model triggers to identify the key temporal stages and spatial joints. For classification, the authors adopted a bilinear classifier. • Logistic Regression: Using the logistic function, logistic regression estimates the parameters of a model. A logistic regression model is built leveraging the depth and skeleton information of an RGB database [79]. In three settings in the learning stage (0◦ , 45◦ , and 0◦ ∪ 45◦ ), a Spatio-temporal Local Binary Pattern (STLBP) and Fourier Temporal Pyramid (FTP) are implemented for the depth and skeleton features respectively. • Citation kNN: The Citation-kNN algorithm is used for the skeleton based recognition model [80]. The author claims the recognition system as a Multiple Instance Learning (MIL) problem. This is emphasized on substantial inter-class similarity or intra-class variability natures. Moreover, the model provides a transparent way for temporal misalignment and tolerance regulation of the noise. Validating on three datasets, the performance of the model varies a lot on the datasets.
2.6.2 Neural Network Based Recognition Nowadays, neural network models for action recognition surpassed the earlier traditional machine learning model in terms of performance and precision [57, 68, 81]. In the following, we will study some of the prominent neural network architecture in the field of skeleton-based action recognition [82]. • Recursive Neural Network (RNN): A class of Artificial Neural Network (ANN) and generalization of feed-forward network integrating a memory. It can process a sequence of inputs with the aid of internal memory. A view adaptive RNN architecture is developed to predict the virtual viewpoints [83, 84]. • Hierarchical RNN (HRNN): Hierarchical Recurrent Neural Networks are a group of stacked RNN architectures outlined to hierarchical model models in sequential data. To recognize actions between the limbs and the trunks, this model divided the skeleton into five body parts (Left arm, Right arm, Trunk, Left Leg, and Right leg), fed into five distinct subnets of the network [55]. Furthermore, there are three fusion layers in between the four bi-directionally and recurrently connected subnets (BRNN). After the fourth BRNN layer (embedded with LSTM model), a fully connected layer is employed, which derives the final classification result. Finally, a softmax layer is used with the persuasiveness of Rotation transformation and Random Scale algorithms [85]. • Encoder-Decoder RNN: Encoder-Decoder RNN consists of two RNN where one recurrent neural network encodes a chain of symbols into a vector representation of fixed-length, and the other RNN decodes the representation. Over-
62
S. Sarker et al.
coming the appearance features which are used in anomaly detection in video, a dynamic skeleton feature-based method modeling the regular patterns of human movements for anomaly detection is proposed [62]. The model breaks down the skeletal movements into local body posture and global body movement. Firstly, Skeleton Motion Decomposition is done to extract the dynamic patterns and the relationship between the features. Secondly, the distinct dynamic patterns and the relationship between them is modeled into the proposed Message-Passing Encoder-Decoder Recurrent-Network (MPED-RNN) structure. For two skeleton feature components, the MPED-RNN consists of two branches—local and global branches. The local and global branches consist of shaded and transparent Gated Recurrent Unit (GRU) blocks, respectively. Interact with each of the GRUs with others to comprehensively generate the MLPs [56]. • Adaptive RNN: View adaptive RNN works on the network itself for adaption to the observation viewpoints from an end to end approach. In [47], authors introduce a view adaptive recurrent neural network with an LSTM structure, overcoming the significant view variation problems [47]. The adaptive RNN with the LSTM architecture can be partitioned into two networks: View Adaption Subnetwork and Main LSTM Network. At each time slot, two LSTM layers work parallel with a Fully-Connected (FC) layer to produce an appropriate observation viewpoint. Then, three LSTM layers, along with an FC layer and the softmax classifier, construct the outcome class [84, 86]. Figure 2.6 represents an underlying architecture of an adaptive RNN network. • Deep Ensemble Network: Deep Ensemble Network deals with the hybrid architecture built with a Deep Neural Network integrated with any related architecture. An ensemble method is developed by incorporating a Lie group structure into the deep network architecture [43]. At first, rotation mapping layers are developed to convert the input Lie group features, enhancing coordination in the temporal domain. After that, the model reduced the spatial dimensions by introducing a rotation pooling layer. Finally, a logarithm mapping layer is proposed to outline the resulting manifold data into tangent space. In other articles, a similar deep network is used for geometrical or skeletal feature extraction [52, 70]. Moreover, a Semantics-Guided deep Network (SGN) is developed [57], introducing the highlevel semantic joints to augment the feature representation efficiency. • Multi-task Learning Network (MTLN): In a Multi-task Learning Network, all tasks are executed parallelly by concurrently utilizing the discrepancies and commonalities across the task. An MTLN based method emphasizing on the 3D trajectories of human skeleton joints is proposed [36]. In this model, each skeleton
Fig. 2.6 An adaptive RNN network architecture [47]
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
63
sequence is transformed into three sub-clips where each of the clips subsisted with seven frames, which helps the learning of spatial-temporal features using deep network architecture. These three clips generated from the skeleton sequence are fed into a deep CNN model consists of a temporal mean pooling layer that is used to build a compact representation of the feature. After that, an MTLN including a Fully Connected layer (FC), a Rectified Linear Unit (ReLU), following another FC layer, and finally, a softmax layer is processing the four feature vectors (Fig. 2.7). Moreover, a similar view-invariant or clip-based deep CNN method is employed for activity recognition [50, 87–90]. • Symmetric Position Definitive (SPD) Learning: A Neural Network-based Symmetric Positive Definite (SPD) manifold learning method for skeleton-based hand gesture recognition is developed [91] by accumulating two distinct approaches which are spatial and temporal domains. The developed method consists of three sections: increment of discriminative power, aggregation of the features, and learning stage. Firstly, the model tries to increase the discriminative power of the learned features based on a convolutional layer. Secondly, different types of structures ST-GA-NET, TS-GA-NET, and SPDC-NET for the Gaussian Aggregation of the Spatial and Temporal of the joint features are employed. Finally, from skeletal data, the final SPD matrix is learned using a newly proposed layer depends on Stiefel Manifolds [92, 93]. • Global Co-occurrence Feature Learning: Global features characterize the frame of a 3D video as a whole to generalize the complete object [94]. Li et al. [54] come up with a model where the spatial and temporal co-occurrence features learn globally, termed as Spatio-Temporal-Unit Feature Enhancement (STUFE). To align the skeleton samples, Active Coordinate Skeleton Conversion (ACSC) is introduced for this preprocessing. Briefly summarizing the STUFE algorithm’s internal structure, a feature map capturing the Spatio-temporal feature is generated, calculating the distance of the feature map unit and producing the final result. The Spatio-temporal feature units build on the shared parameters of the two-embedding model (FC + ReLU + FC + ReLU + FC + ReLU) calculating the Cosine Distance. • Regularized Deep LSTM Network: Regularized Deep LSTM Network is a modified interpretation of the basic LSTM network by adding the weight regularization technique [95, 96]. This technique imposes a constraint on the weights in the LSTM nodes. It reduces the chances of overfitting and improves the performance of the model. Complying with the Co-Occurrence feature learning (as input in each time slot), a skeleton-based action recognition model using Regularized Deep LSTM Networks is developed [61]. Furthermore, a dropout algorithm working on the cells and responses of the outputs to efficiently train the deep LSTM network is incorporated [97].
Fig. 2.7 A multi-task learning network
64
S. Sarker et al.
• Temporal LSTM Network: The Temporal LSTM network is an adaptation of the LSTM network to extract and incorporate the Spatio-temporal features. Lee et al. [45] worked on the representation of the features of skeleton joints and proposed an ensemble method based on LSTM Networks. The model transforms the raw skeleton into a definitive coordinate system for the robustness and longterm dependency. Different types of augmenting operations are performed from the coordinate system to extract the salient features from the transformed input data. Then, the developed Temporal Sliding LSTM (TS-LSTM) networks consisting of short-term, medium-term, and long-term TS-LSTMs is employed. A spatialtemporal attention model aligned with the temporal LSTM Network is established with the aid of the LSTM network [98, 99].
2.6.3 Graph Based Recognition The graph-based action recognition models performed excellently with the integration of LSTM, CNN or Bayesian Network [100–102]. Because of the correlation of human posture with the characteristics of a directed or acyclic graph, the trajectories can be represented smoothly with the aid of a graph’s structure. • Graph-Based Convolution Neural Network (GCNN): Operating on a graphbased model, a GCNN takes a feature matrix and an adjacency matrix as an input and concentrated on the spatial-temporal dependencies [69]. For that purpose, an encoder-decoder structure is introduced called the A-link interference module for extracting the action-specific latent dependencies. Modifying the existing GCNN model [34] and the A-link interference module, an Actional-Structural Graph Convolutional Network is proposed to capture both temporal and spatial features. In this model, a future pose prediction head is added to capture the most detailed patterns with the help of self-supervision [103–107]. • Directed Graph Neural Network (DGNN): Shi et al. [53] develop a Directed graph-based Neural Network method for action recognition combining first and second-order spatial-temporal information. This model represented the data as a Directed Acyclic Graph (DAG) that focuses on the joint and bone’s kinematic dependencies. To make predictions, the authors employed a novel Neural Network architecture based on the directed graph. Furthermore, the motion features are exploited and embedded with the spatial information to augment the performance of the model. • Adaptive Graph Convolutional Network: In another graph-based network architecture, a two-stream adaptive model, focusing on the topology of the network, is developed [29]. The learning manner of the proposed two-stream adaptive graph convolutional network (2s-AGCN) can be uniformly or individually integrated by the Back-Propagation algorithms taking into account the first and second-order features such as—length and direction of the bones.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
65
Fig. 2.8 A general architecture of Attention enhanced Graph Convolutional LSTM
• Attention Enhanced Graph Convolutional LSTM (AGC-LSTM): An AGCLSTM network extracts the discriminative features in Spatial-temporal configuration and dynamics [108], analyzing the co-occurrence relationship in Spatialtemporal domains. An AGC-LSTM model is introduced, incorporating the cooccurrence relationship between spatial and temporal domains [26]. Firstly, the input layer passes through an FC layer followed by a feature augmentation layer by calculating the difference of the features. Secondly, an LSTM network is employed, disregarding the scale variance of the feature differences (calculated from FA) and position frames. After that, three AGC-LSTM layers are developed in between the Temporal Average Pooling layer to model the feature. Lastly, two FC layers composed of the local feature of the focused joints and global features of all joints predict the action, respectively (Fig. 2.8). • Long-Short Graph Memory Network: Long-short Graph Memory (LSGM) is a modified form of the LSTM network. It focuses on graph-based structures of the action recognition model. Working on the latent structural dependencies among internal nodes, a Long-short Graph Memory (LSGM) learns the high-level spatialtemporal discriminative features and the overlooked but intrinsic spatial information [25]. Moreover, a calibration module named Graph-Temporal Spatial Calibration (GTSC) is employed to enhance the discriminative abilities. The proposed method can break into three segments - the first segment represents the skeleton sequence into a 3D coordinate system and transfer it to a three Bi-LSGM + temporal attention architecture,the second segment discriminately multiplies the outcomes of the Bi-LSGM and Temporal attention model of the three coordinates and concatenate the results, and the final segment performs the spatial calibration and introduces the softmax to generate the final output. • Bayesian Graph Convolution LSTM: The Bayesian Graph Convolution LSTM is the extended version of the Graph-Convolution LSTM (GC-LSTM) by integrating the Bayesian framework. Variation of the generalized subject, Spatial and temporal dependencies action recognition are the three factors taken into consideration in a Bayesian Graph Convolution embedding the LSTM network [28]. Firstly, extracting the structure-aware features from body pose data, an LSTM network is used on the extracted features to capture the temporal dynamics. Then, the Bayesian probabilistic model is operated to extract the improved variation and stochasticity in the data. To prevent overfitting, the dropout technique is applied
66
S. Sarker et al.
Fig. 2.9 A basic architecture of Bayesian Graph Convolution LSTM
to both the input and output of the LSTM. A representation of the Bayesian Graph Convolutional LSTM architecture is depicted in Fig. 2.9. • Disentangle Unified Graph-Based Representation: A disentangling and unifying graph-based representation performs well on the aspects: joint relationship modeling concerning multiscale operators and extracting convoluted spatialtemporal dependencies [102]. This model worked on the disentangle depiction of multi-graph convolution and, secondly, a graph convolution operator (G3D) based on a unified spatial-temporal representation. For active long-range modeling, this approach detached emphasis on nodes in the disparate neighborhood.
2.6.4 Unsupervised Action Recognition Su et al. [56] proposed an unsupervised skeleton-based action recognition model focuses on Predict and Cluster. It is an encoder-decoder RNN model where the encoder and decoder learn hidden states into feature spaces where the identical and different movements cluster into the same and different clusters, respectively. Though the model does not require the action label in each stage of the learning process, camera and depth inputs are required in each step. Specifically, it only needs the body keypoints to operate on various dimensions.
2.6.5 Spatial Reasoning and Temporal Stack Learning To address the spatial structure information and temporal dynamic features, a Spatial Reasoning and Temporal Stack Learning (SR-TSL) model consists of a Spatial Reasoning Network (SRN) and Temporal Stack Learning Network (TSLN) is developed [109]. The SRN architecture extracts the high-level spatial information, while the TSLN architecture can build detailed dynamic skeleton features. The TSLN structure assembles with multiple skip-clip LSTM networks. Firstly, the skeleton structure fed into an SRN consisting of five FC layers. Then the outputs from the FC layers
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
67
work as input in the Regional Gating Neural Network (RGNN). The RGNN then passes the data to an FC layer, which forwards the output into the TSLN structure. The TSLN consists of a velocity and position network which determines the motion and position, respectively.
2.6.6 Reinforcement Learning To condense the most informative frames and dispense with the ambiguous or redundant frames, a deep progressive reinforcement learning for action recognition is proposed [34]. This model progressively modified and adjusted the frames by promoting the quality and the relationship between the frames in the whole video. Moreover, focusing on the postures of the human body as a topology of a graph-based architecture, this model integrated a graph-based CNN to extract the dependencies between the joints. An FDNet (Frame Distillation Network) and Graph-based Convolutional Neural Network (GCNN) are the two sub-networks of the proposed model.
2.6.7 Convolutional Sequence Generation The generation of an extended action sequence can be beneficial to the learning process in action recognition [110]. Convolutional Sequence Generation Network (CSGN) models the architecture in spatial and temporal dimensions. Firstly, Gaussian Process (GP) is applied to the input data, sampling it to the latent variable sequence and deconvolution to the Spatial-Temporal Graph (SPG). Secondly, Graph Upsampling and Convolution is performed on the variable sequence in the SPG to produce the final Action Sequence. Finally, the CSGN is acquiesced with the bidirectional transformation between the observed and latent spaces to facilitate semantic manipulation of the human action sequences in different forms [27].
2.6.8 Attention Network To capture the long-term spatial-temporal relationship, three versions of the selfattention network (SAN): SAN-V1 (baseline model encoding the input features), SAN-V2 (learns the movement), and SAN-V3 (learns the different modalities) by leveraging the long-range correlations are developed [111]. At first, figuring out the joints’ motion is necessary, and because of that, a position embedding layer is employed for this purpose. Then the three variants of the SAN are exercised on the output of the position embedding and are concatenated the results of the variants. After globally averaging the concatenated results, it goes to the FC layer for the
68
S. Sarker et al.
outcome. A Sub-Sequent Attention Network (SSAN) is operated with the SpatioTemporal features [112, 113].
2.7 Performance Measures Performance Evaluation is an important task that assesses the quality and achievement of a model. The commonly used performance evaluation metrics for action recognition are Accuracy, Precision, and Recall. In this section we characterized a brief description of these evaluation metrics. • Pre-requisite Definitions – – – –
True Positive (TP): Actual and predicted action is positive. True Negative (TN): Actual and predicted action is negative. False Positive (FP): Actual action is negative but predicted action is positive. False Negative (FN): Actual action is positive but predicted action is negative.
• Accuracy: By far, it is the most fundamental performance metric for action classification. Accuracy can be defined as the ratio of correctly predicted actions and all actions as follows, Accuracy =
TP +TN T P + T N + FP + FN
(2.6)
In terms of skeleton-based action recognition, there are four types of accuracy evaluated for the model. – Top-1 (%): Top-1 accuracy can be defined as the accuracy where the actual class correctly matches the most possible classes predicted by a model. It can be determined following the same properties of calculating standard accuracy. Evaluating the Kinetics dataset, this performance measure is adopted by most of the action recognition works [114]. – Top-5 (%): When the actual action class matches the five most possible classes predicted by the architecture, it can be referred to as Top-5 accuracy. Top-1 and Top-5 accuracy technique is being measured working on the Kinetics dataset [114]. – Cross-Subject (CS): When an evaluation protocol is carried out by splitting the number of subjects of a dataset, then this measure of calculating accuracy is defined as Cross-Subject (CS) accuracy. For example—after splitting the subjects into training and testing groups, out of 56,880 samples in the NTU60 dataset, 40,320 samples are considered for training, and the rest (16,560 samples) are employed for testing purposes [115]. – Cross-View (CV): Cross-View (CV) evaluation considers the samples based on the camera view. In the NTU dataset, the samples are extracted using three Kinect V2 cameras simultaneously [115]. This evaluation process considers the
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
69
samples of camera two and camera three for training purposes and camera 1 for testing purposes. For the NTU-60 dataset, the training and testing samples are 37,920 and 18,960, respectively, considering the CV evaluation [111]. • Precision: It is the ratio of the number of accurately predicted positive actions and all positively predicted actions. Therefore, it calculates the accuracy of the minority class. TP (2.7) Pr ecision = T P + FP • Recall: It is the ratio of the number of accurately predicted positive actions and all actual positive actions. It explains missed positive prognosis. Recall =
TP T P + FN
(2.8)
2.8 Performance Analysis A brief analysis of the performance of existing skeleton-based action recognition models is covered in this section. For this purpose, we have considered some of the large-scale multimodal action recognition datasets such as NTU-60, NTU-120, and Kinetics-400. NTU-60 and NTU-120 Datasets: NTU RGB-D is a large-scale multimodal dataset. Based on the number of action classes, this dataset have two versions, namely—NTU60 (60 classes) [115] and NTU-120 (120 classes) [116]. The first version, consisting of 60 classes, built with 56,880 video samples covering 40 subjects, ages between 10 and 35, in 80 different camera viewpoints. Working on 120 action classes, the second version has 114,540 videos from 106 subjects (ages between 10 and 57) in 155 distinct camera viewpoints. These datasets comprise daily-life RGB videos, 25 joints 3D skeleton data, depth map sequences, and Infrared videos for each sample. Furthermore, three Kinect V2 sensors are employed concurrently to capture the videos. Kinetics-400 Dataset: Kinetics-400 dataset is one of the prominent large-scale multimodal RGB action recognition datasets. Because the samples of this dataset remain in RGB format, most of the researchers converted this RGB data into a skeletal format using the OpenPose method [117]. Collecting YouTube videos based on daily life activities [118], approximately 306,425 short videos are integrated, covering 400 action types. Furthermore, each action class has approximately 400 video clips. The performance measure of various types of skeleton-based action recognition architecture has comprehensively analyzed. We have focused on the large-scale multimodal datasets such as NTU-60, NTU-120, and Kinetics-400 to give a brief idea about the existing real-time action recognition implementation. First of all, we have analyzed the deep learning architectures evaluated on NTU-60 and NTU-120 datasets and characterized the interpretation in Tables 2.2 and 2.3. In Table 2.2, a GCNN
70
S. Sarker et al.
Table 2.2 Performance analysis (in %) of deep learning-based HAR approaches on NTU-60 dataset. Handcrafted methods are not mentioned as the accuracies are very low and insignificant References Year Method Result (%) CS CV Lee et al. [45] Rahmani et al. [46] Zhang et al. [47] Ke et al. [36] Yan et al. [117] Tang et al. [34] Si et al. [109]
2017 2017 2017 2017 2018 2018 2018
Zhang et al. [84] Zhao et al. [28] Li et al. [69] Si et al. [26]
2019 2019 2019 2019
Shi et al. [29] Su et al. [56] Cho et al. [111] Huang et al. [25] Li et al. [54] Tian et al. [88]
2019 2020 2020 2020 2020 2020
Cheng et al. [100]
2020
Liu et al. [102] De et al. [44] Dong et al. [119]
2020 2020 2020
Temporal Sliding LSTM CNN View Adaptive LSTM CNN + MTLN ST-GCN DPRL + GCNN Spatial Reasoning and Temporal Stack Learning View Adaptive-RNN Bayesian LSTM Actional Structural-GCN Attention Enhanced Graph Convolutional-LSTM 2 stream-AGCN Feature-level Auto-encoder SAN LSGM STUFE + ACSC STGCN + co-occurrence feature learning 4-stream Shift Graph Convolutional Network Multi-scale G3D Infrared + Pose estimation GCNN
74.6 75.2 79.4 79.57 81.5 83.5 84.8
81.25 83.1 87.6 84.83 88.3 89.8 92.4
79.4 81.8 86.8 89.2
87.6 89.0 94.2 95.0
88.5 50.7 87.2 84.71 86.9 84.7
95.1 76.1 92.7 91.74 92.5 90.2
90.7
96.5
91.5 91.6 91.7
96.2 94.5 96.8
model based on graph convolutional architecture achieved the optimal performances in terms of cross-subject and cross-view accuracy [119]. Also, it can be seen that graph-based action recognition architecture outperformed the LSTM or Multi-task learning architectures. Evaluating on NTU-120 dataset, a multi-scale G3D architecture outperformed all the other current skeleton-based action recognition models achieving 86.9% and 88.4% accuracies [102]. Because the NTU-120 dataset covers 120 distinct classes including 60 classes from NTU-60 class, the model’s performance evaluating on the NTU-120 dataset is relatively lower than the NTU-60 dataset. It is comparatively better and efficient for real-time implementation to train a model covering multimodal and large-scale action classes dataset. Analyzing the architectures evaluated on the Kinetics-400 dataset, the MS-G3D model outperformed all the other existing deep learning architectures [102]. Though the MS-G3D model comparatively performed well on the Kinetics-400 dataset, the
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
71
Table 2.3 Performance analysis of deep learning-based HAR approaches on NTU-120 dataset References Year Method Result (%) CS CV Liu et al. [95]
2017
Liu et al. [120]
2017
Ke et al. [36] Liu et al. [121] Hu et al. [122] Liu et al. [123] Liu et al. [95]
2017 2017 2018 2019 2017
Ke et al. [87] Papadopoulos et al. [124] Huynh et al. [125]
2018 2019
Cheng et al. [100]
2020
Dong et al. [119] Liu et al. [102]
2020 2020
2020
Temporal Sliding LSTM + feature fusion Global Context-Aware Attention-LSTM MTLN Skeleton Visualization Soft RNN Scale Selection Network 2 stream Global Context-Aware Attention LSTM MT-CNN with RotClips Graph Vertex Feature Encoder + AGCN Geometric Pose Transition Dual-Stream CNN 4-stream Shift Graph Convolutional Network GCNN MS-G3D Net
58.2
60.9
58.3
59.2
58.4 60.3 36.3 59.9 61.2
57.9 63.2 44.9 62.4 63.3
62.2 78.3
61.8 79.8
76.3
78.9
85.9
87.6
86.4 86.9
89.4 88.4
Table 2.4 Performance analysis of deep learning-based HAR approaches on Kinetics-400 dataset References Year Method Result (%) Top-1 Top-5 Yan et al. [117] Zhang et al. [126] Li et al. [127] Li et al. [69] Shi et al. [29] Shi et al. [53] Zhang et al. [107] Cho et al. [111] Liu et al. [102]
2018 2019 2019 2019 2019 2019 2020 2020 2020
ST-GCN Body-part level hybrid model Spatio-Temporal Graph Routing-GCN AS-GCN 2s-AGCN DGNN Advanced CA-GCN SAN MS-G3D
30.7 33.4 33.6 34.8 36.1 36.9 34.1 35.1 38.0
52.8 56.2 56.1 56.5 58.7 59.6 56.6 55.7 60.9
Top-1, and Top-5 accuracies are significantly lower comparing it to the performances of NTU datasets. One probable reason can be the huge number of action classes that have been covered in this dataset. For real-time implementation, this dataset is one of the best recent examples for the research to develop a model that can efficiently work on the Kinetics-400 dataset (Table 2.4).
72
S. Sarker et al.
2.9 Challenges in Skeleton-Based Activity Recognition Several real-world challenges of human activity recognition are addressed, and related works are described in Ke et al. [36]. The skeleton-based activity recognition system also thrives with similar challenges. This section discusses the challenges faced in different steps of a skeleton-based human activity recognition system. • Diversity in Anthropometric Measures: Anthropometric variations produce alteration in the desired result from the images due to variants like physical deformation, posture irregularity, structural diversity, clothing variety of the subjects. The effect of these factors in the outcome is inherently context-dependent. They often contribute to the declination of optimal accuracy and precision the system could achieve without it. Researchers have sought to remove these effects with the extraction of the actor’s appearance and shape features. The Silhouette extraction techniques are also effective in removing the anthropometric features with dynamic components like Motion Energy Image (MEI) and the Motion History Image (MHI). • Multi View Variation: An ideal HAR system must possess great compatibility with different viewpoints that it can perform action recognition from unseen and unknown viewpoints. However, it is not convenient to achieve, as most of the compilation dataset in HAR is conducted in a constrained lab setup and with a fixed viewpoint. So, many cases of failures due to changes in viewpoints are obvious to arise. The most viable solution is introducing multiple cameras to the experimental setup. However, that results in producing huge amounts of data and gives rise to the system’s computational complexity. • Dynamic Background: Most of the experimentation for HAR datasets is conducted in uncluttered lab setups as it helps the models focus on the desired action to be analyzed. Consequently, this leads to the system’s incompetency in ignoring the noise in data due to cluttered and dynamic backgrounds. The hindrance for optimized image analysis due to the interruption in the scene for noisy consequences in performance degradation. • Low-Quality Videos: Disruptive features of low-quality videos like incessant frame flow, blurred frame in motion change, coarse resolution, data compression are very demarcating issues in action recognition. It repeatedly compels the system to diverge the focus from the central action points. • Occlusion: Occlusion refers to be an obstruction in the view perspective of a subject that hinders the action recognition process by covering a necessitated part. The three primary types of occlusion are: – Self-occlusion: It occurs when a body part obstructs the view of another body part. – Occlusion in crowd: It happens to arise for the disappearance of subjects in a crowded place – Occlusion due to object: Objects often occludes the visioning of a part by merely appearing in front of the desired subject.
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
73
• Illumination and Scale Factors: Constant illumination, a stipulation for optimized HAR performance is often happened to be less assured in real-world situations. Interruptions in the light source, shadow, obstructions disrupt the illuminations of the environment. This effect creates problems in action recognition, and shadowing makes the fake silhouette of analyzable shapes, which give inaccurate speculations about the existence of a subject. Scale variations occur with variation between the distance of the subject and the camera. It also affects the results adversely. • Camera Motion: Camera motion is an unusually challenging offset in HAR that the researchers have not addressed very frequently. This undesired motion creates additional misleading patterns that affect the actual action recognition process. The effects of camera motion are needed to be compensated before feature extraction • Inadequate Data: A major drawback in skeleton vision system-based HAR research is that as for being a comparatively novel domain, the existing datasets lacks to address many of the constraints related to real-world applications. A very few numbers of datasets exist where the camera motion is taken into account. The intra and inter-class variances and diverse scaling range are also mandated features of a good dataset in HAR. • Recognition Latency: Recognition latency is an evaluation protocol of skeletonbased HAR where the number of frames required to recognize an action is held as a performance criteria for the system. An important aspect of a good HAR system is accurately recognizing the actions with minimal latency. So that the real-world applications of the systems can render more spontaneous results. Recognition latency and accuracy of the system also depend on the optimization of the recognition network, i.e., deep networks need to be optimized in terms of layers, nodes, and internal structures. Optimization of a deep network involves tuning the hyperparameters, i.e., number of nodes, layers, dropout probability, and decay rate. • Preserving Spatio-Temporal Correlation: As features such as ordering, correlational information, symmetric relations happen to be tangled by many of the preprocessing steps. Saving the Spatio-temporal data in an unordered list often results in jumbling the correlation information of the points. This challenge is addressed by introducing data reorganization in the preprocessing steps. However, it is undeniably true that these steps increase the complexity of the action recognition system’s whole pipeline. • Data Annotation: Data annotation is very crucial for Skeleton-based HAR system. The existing methods employ supervised approaches that require a massive amount of training data annotated with proper labels. However, collecting such volume of data and annotating those is quite unviable in many applications. Nevertheless, the task of data annotation itself has its contextual complicacy. Many datasets may indeed contain ambiguous actions that are bound to have varied labeling by different annotators. Thus, supervised learning techniques have the annotating challenge for context-sensitive information sources like the video (RGB), and depth (+D) data. Unsupervised learning-based approach [56] for learning features and classifying the actions is a promising solution to deal with this challenge.
74
S. Sarker et al.
• Temporal Segmentation: From continuous skeletal sequences having a varying length, it is not easy to segment and cluster the required data frames of different timestamps. Recognizing an action correctly involves the extraction of the appropriate number of segments from the skeletal sequences. It varies across the different action classes that make activity recognition a challenging task. • Handling Missing Keypoints: Missing some of the key points among any of the consecutive frames makes recognizing a particular action difficult. One simple solution to this problem is to input zero value for the missing data, which affects the system’s accuracy. To deal with this challenge, the following two approaches [58] can be adopted: – Feature Propagation: If there are no significant changes in pose between two consecutive timestamp, features for the missing keypoints of the current timestamp can be imported from the previous one. – Keypoint Dropout: A certain percentage of the keypoints are dropped out intentionally in the model’s training phase to make it robust while dealing with missing keypoints. • Continuity of Class-wise Data: Most of the available skeleton datasets contain class-wise discontinuous data. To recognize a particular action class requires detecting a sequence of skeletal data-frames, which is a challenging task for a skeleton-based action recognition system. • Bottlenecks of Real-time Implementation: Action recognition data which are developed using Kinect or depth sensors are very much noisy. For that purpose, skeleton data is extracted from RGB images. However, it requires expensive devices such as sensors, cameras, and computer systems with high-end computational and storage capabilities, which greatly affects the action recognition system’s performance in real-time. Devices with low processing speed and less storage increase latency in different action recognition pipeline stages, which degrades the system’s real-time performance.
2.10 Conclusion In this chapter, we provided an overview of the existing scientific grounds and potential promising technologies in the relevant fields of the skeleton-based system for human activity recognition. Like any other emerging field, the scientific growth in this domain is getting curbed by the methodical impediments. Furthermore, researches are exploring this field and approaching the challenges simultaneously. We began our review by highlighting three most promising tracking systems of today: Depth Sensors, OpenPose Toolbox, and Body Marker. We then drew focus on the four most employed skeleton joint architectures—15, 18, 20, 25, and 32 skeleton joint distributions for the skeleton tracking system. Following this, we delineated different skeleton and action representation methods. Next, a comprehensive study on the basic preprocessing techniques was depicted. From that point, we concentrated on the
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
75
fundamental feature extraction techniques—Hand-crafted feature and Deep Learning features and its subdivisions. Moreover, in the last section of our study marked different action recognition methods. Machine Learning-based recognition, Neural Network-based recognition, Long Short-Term Memory (LSTM) based recognition, Graph-based recognition, and a few distinctive specialized models such as—Spatial reasoning and Temporal Stack Learning, Reinforcement Learning, Deep Progreesive Reinforcement Learning (DPRL), Convolutional Sequence Generation, and Attention Network were some notable ones. An in-depth study on the relevant previous works ascertained that advancing through the steps of a skeleton-based action recognition pipeline is quite challenging as the heterogeneity in the methods demands dynamic approaches. This chapter provides an overview of the state-of-art works in this field for future researchers. After exploring these works, we can see that there are some multimodal and large datasets for which the existing model did not perform efficiently. Moreover, graphbased recognition methods did not align with the multiview invariant features of this burgeoning real-time dataset application. There is a huge scope to work on this multimodal view-invariant dataset and traversing the effectiveness of deep learning models.
References 1. Baek, S., Kwang, I.K., Kim, T.-K.: Augmented skeleton space transfer for depth-based hand pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8330–8339 (2018) 2. Mehta, D., Sridhar, S., Sotnychenko, O., Rhodin, H., Shafiei, M., Seidel, H.-P., Weipeng, X., Casas, D., Theobalt, C.: Vnect: Real-time 3d human pose estimation with a single rgb camera. ACM Trans. Graph. (TOG) 36(4), 1–14 (2017) 3. Ling, J., Tian, L., Li, C.: 3d human activity recognition using skeletal data from rgbd sensors. In: International Symposium on Visual Computing, pp. 133–142. Springer (2016) 4. Balakrishnan, S., Rice, J.M., Walker, S.H., Carroll, A.S., Dow-Hygelund, C.C., Goodwin, A.K., Mullin, J.M., Rattenbury, T.L., Rooke-Ley, J.M., Schmitt, J.M., et al.: Action detection and activity classification, May 31 2016. US Patent 9,352,207 5. Wang, J., Liu, Z., Ying, W., Yuan, J.: Learning actionlet ensemble for 3d human action recognition. IEEE Trans. Pattern Analy. Mach. Intell. 36(5), 914–927 (2013) 6. Wang, L., Gu, T., Tao, X., Lu, J.: Sensor-based human activity recognition in a multi-user scenario. In: European Conference on Ambient Intelligence, pp. 78–87. Springer (2009) 7. Batabyal, T., Chattopadhyay, T., Mukherjee, D.P.: Action recognition using joint coordinates of 3d skeleton data. In: 2015 IEEE International Conference on Image Processing (ICIP), pp. 4107–4111. IEEE (2015) 8. Kong, Y., Fu, Y.: Bilinear heterogeneous information machine for rgb-d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1054–1062 (2015) 9. Seidenari, L., Varano, C., Berretti, S., Bimbo, A., Pala, P.: Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 479–485 (2013)
76
S. Sarker et al.
10. Pham, H.-H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Exploiting deep residual networks for human action recognition from skeletal data. Comput. Vis. Image Underst. 170, 51–66 (2018) 11. Presti, L.L., Cascia, M.L.: 3d skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) 12. Chen, Y., Tian, Y., He, M.: Monocular human pose estimation: a survey of deep learning-based methods. Comput. Vis. Image Underst. 192, 102897, 03 (2020) 13. Zhang, A., Ma, X., Song, R., Rong, X., Tian, X., Tian, G., Li, Y.: Deep learning based human action recognition: a survey. In: 2017 Chinese Automation Congress (CAC), pp. 3780–3785. IEEE (2017) 14. Asadi-Aghbolaghi, M., Clapés, A., Bellantonio, M., Escalante, H.J., Ponce-López, V., Baró, X., Guyon, I., Kasaei, S., Escalera, S.: Deep learning for action and gesture recognition in image sequences: a survey. In: Gesture Recognition, pp. 539–578. Springer (2017) 15. Wang, L., Huynh, D.Q., Koniusz, D.Q.: A comparative review of recent kinect-based action recognition algorithms. IEEE Trans. Image Process. 29, 15–28 (2019) 16. Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: Vision-based human action recognition: an overview and real world challenges. Forensic Sci. Int.: Digital Investig. 32, 200901 (2020) 17. Cao, Z., Simon, T., Wei, S.-E., Sheikh, S.-E.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017) 18. Simon, T., Joo, H., Matthews, I., Sheikh, Y.: Hand keypoint detection in single images using multiview bootstrapping. In: CVPR (2017) 19. Wei, S.-E., Ramakrishna, S.-E., Kanade, T., Sheikh. Y.: Convolutional pose machines. In: CVPR (2016) 20. Rahmani, H., Mian, A.: 3d action recognition from novel viewpoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1506–1515 (2016) 21. Vieira, A.W., Nascimento, E.R., Oliveira, G.L., Liu, Z., Campos, M.F.M.: Stop: space-time occupancy patterns for 3d action recognition from depth map sequences. In: Iberoamerican Congress on Pattern Recognition, pp. 252–259. Springer (2012) 22. Cavazza, J., Zunino, A., San Biagio, M., Murino, V.: Kernelized covariance for action recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 408–413. IEEE (2016) 23. Materzynska, J., Xiao, J., Herzig, R., Xu, H., Wang, X., Darrell, T.: Something-else: compositional action recognition with spatial-temporal interaction networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1049–1059 (2020) 24. Yang, J., Liu, Wu, Yuan, J.: Mei, T: Hierarchical soft quantization for skeleton-based human action recognition. IEEE Trans, Multimedia (2020) 25. Huang, J., Xiang, X., Gong, X., Zhang, B., et al.: Long-short graph memory network for skeleton-based action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 645–652 (2020) 26. Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional lstm network for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019) 27. Yan, S., Li, Z., Xiong, Y., Yan, H., Lin, D.: Convolutional sequence generation for skeletonbased action synthesis. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4394–4402 (2019) 28. Zhao, R., Wang, K., Su, K., Ji, Q.: Bayesian graph convolution lstm for skeleton based action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 6882–6892 (2019) 29. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12026–12035 (2019) 30. Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition. arXiv preprint arXiv:2003.07564 (2020)
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
77
31. Zhu, G., Zhang, L., Li, H., Shen, P., Afaq Ali Shah, S., Bennamoun, M.: Topology-learnable graph convolution for skeleton-based action recognition. Pattern Recogn. Lett. (2020) 32. Chen, Y., Ma, G., Yuan, C., Li, B., Zhang, H., Wang, F., Hu, W.: Graph convolutional network with structure pooling and joint-wise channel attention for action recognition. Pattern Recogn., p. 107321 (2020) 33. Huang, L., Huang, Y., Ouyang, W., Wang, L. et al.: Part-level graph convolutional network for skeleton-based action recognition (2020) 34. Tang, Y., Tian, Y., Lu, J., Li, P., Zhou, J.: Deep progressive reinforcement learning for skeletonbased action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5323–5332 (2018) 35. Caetano, C., Brémond, F., Schwartz, W.R.: Skeleton image representation for 3d action recognition based on tree structure and reference joints. In: 2019 32nd SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), pp. 16–23. IEEE (2019) 36. Ke, Q., Bennamoun, M., An, A., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017) 37. Liliana [Lo Presti], Marco [La Cascia]: 3d skeleton-based human action classification: a survey. Pattern Recogn. 53, 130–147 (2016) 38. Chaudhry, R., Ofli, F., Kurillo, G., Bajcsy, R., Vidal, R.: Bio-inspired dynamic 3d discriminative skeletal features for human action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 471–478 (2013) 39. Slama, R., Wannous, H., Daoudi, M., Srivastava, A.: Accurate 3d action recognition using learning on the grassmann manifold. Pattern Recogn. 48(2), 556–567 (2015) 40. Li, X., Zhang, Y., Zhang, J.: Improved key poses model for skeleton-based action recognition. In: Pacific Rim Conference on Multimedia, pp. 358–367. Springer (2017) 41. Cai, L., Liu, C., Yuan, R., Ding, H.: Human action recognition using lie group features and convolutional neural networks. Nonlinear Dyn., pp. 1–11 (2020) 42. Ghorbel, E., Demisse, G., Aouada, D., Ottersten, B.: Fast adaptive reparametrization (far) with application to human action recognition. IEEE Signal Process. Lett. 27, 580–584 (2020) 43. Huang, Z., Wan, C., Probst, T., Van Gool, L.: Deep learning on lie groups for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6099–6108 (2017) 44. de Boissiere, A.M., Noumeir, R.: Infrared and 3d skeleton feature fusion for rgb-d action recognition. arXiv preprint arXiv:2002.12886 (2020) 45. Lee, I., Kim, D., Kang, S., Lee, S.: Ensemble deep learning for skeleton-based action recognition using temporal sliding lstm networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1012–1020 (2017) 46. Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5832– 5841 (2017) 47. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2117–2126 (2017) 48. Li, R., Fu, H., Lo, W., Chi, Z., Song, Z., Wen, D.: Skeleton-based action recognition with key-segment descriptor and temporal step matrix model. IEEE Access 7, 169782–169795 (2019) 49. Rahmani, H., Bennamoun, M.: Learning action recognition model from depth and skeleton videos. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5833–5842 (2017) 50. Nie, Q., Wang, J., Wang, X., Liu, Y.: View-invariant human action recognition based on a 3d bio-constrained skeleton model. IEEE Trans. Image Process. 28(8), 3959–3972 (2019) 51. Li, S., Jiang, T., Tian, Y., Huang, T.: 3d human skeleton data compression for action recognition. In: 2019 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4 (2019)
78
S. Sarker et al.
52. Nie, W., Wang, W., Huang, X.: Srnet: Structured relevance feature learning network from skeleton data for human action recognition. IEEE Access 7, 132161–132172 (2019) 53. Shi, L., Zhang, Y., Cheng, J., Lu, H.: Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7912–7921 (2019) 54. Li, S., Jiang, T., Huang, T., Tian, Y.: Global co-occurrence feature learning and active coordinate system conversion for skeleton-based action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 586–594 (2020) 55. Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015) 56. Su, K., Liu, X., Shlizerman, E., Predict & cluster: Unsupervised skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9631–9640 (2020) 57. Zhang, P., Lan, C., Zeng, W., Xing, J., Xue, J., Zheng, N.: Semantics-guided neural networks for efficient skeleton-based human action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1112–1121 (2020) 58. Raj, B.N., Subramanian, A., Ravichandran, K., Venkateswaran, N.: Exploring techniques to improve activity recognition using human pose skeletons. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision Workshops, pp. 165–172 (2020) 59. Huang, J., Huang, Z., Xiang, X., Gong, X., Zhang, B.: Long-short graph memory network for skeleton-based action recognition. In: The IEEE Winter Conference on Applications of Computer Vision (WACV), March 2020 60. Huynh, D.Q.: Metrics for 3d rotations: Comparison and analysis. J. Math. Imaging Vis. 35(2), 155–164 (2009) 61. Zhu, W., Lan, C., Xing, J., Zeng, W., Li, Y., Shen, L., Xie, X.: Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. In: Thirtieth AAAI Conference on Artificial Intelligence (2016) 62. Morais, R., Le, V., Tran, T., Saha, B., Mansour, M., Venkatesh, S.: Learning regularity in skeleton trajectories for anomaly detection in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11996–12004 (2019) 63. Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-d posture data. IEEE Trans. Human-Mach. Syst. 45(5), 586–597 (2014) 64. Naveenkumar, M., Domnic, S.: Skeleton joint difference maps for 3d action recognition with convolutional neural networks. In: International Conference on Recent Trends in Image Processing and Pattern Recognition, pp. 144–150. Springer (2018) 65. Wang, P., Li, W., Gao, Z., Zhang, J., Tang, C., Ogunbona, P.O.: Action recognition from depth maps using deep convolutional neural networks. IEEE Trans. Human-Mach. Syst. 46(4), 498– 509 (2015) 66. Yang, X., Zhang, C., Tian, Y.L.: Recognizing actions using depth motion maps-based histograms of oriented gradients. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1057–1060 (2012) 67. Li, B., Dai, Y., Cheng, X., Chen, H., Lin, Y., He, M.: Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep cnn. In: 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pp. 601–604. IEEE (2017) 68. Huynh-The, T., Hua, C.-H., Tu, N.A., Kim, J.-W., Kim, S.-H., Kim, D.-S.: 3d action recognition exploiting hierarchical deep feature fusion model. In: 2020 14th International Conference on Ubiquitous Information Management and Communication (IMCOM), pp. 1–3. IEEE (2020) 69. Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., Tian, Q.: Actional-structural graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3595–3603 (2019)
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
79
70. Liu, J., Liu, Y., Wang, Y., Prinet, V., Xiang, S., Pan, C.: Decoupled representation learning for skeleton-based gesture recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5751–5760 (2020) 71. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 72. Cortes, Corinna, Vapnik, Vladimir: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995) 73. Vemulapalli, R., Arrate, F., Chellappa, R.: Human action recognition by representing 3d skeletons as points in a lie group. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 588–595 (2014) 74. Boulahia, S.Y., Anquetil, E., Kulpa, R., Multon, F.: Hif3d: Handwriting-inspired features for 3d skeleton-based action recognition. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 985–990. IEEE (2016) 75. Li, X., Zhang, Y., Liao, D.: Mining key skeleton poses with latent svm for action recognition. Appl. Comput. Intell. Soft Comput. (2017) 76. Xu, D., Xiao, X., Wang, X., Wang, J.: Human action recognition based on kinect and pso-svm by representing 3d skeletons as points in lie group. In: 2016 International Conference on Audio, Language and Image Processing (ICALIP), pp. 568–573. IEEE (2016) 77. Liu, M., He, Q., Liu, H.: Fusing shape and motion matrices for view invariant action recognition using 3d skeletons. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3670–3674. IEEE (2017) 78. Weng, J., Weng, C., Yuan, J.: Spatio-temporal naive-bayes nearest-neighbor (st-nbnn) for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4171–4180 (2017) 79. Tang, N.C., Lin, Y.-Y., Hua, J.-H., Weng, M.-F., Mark Liao, H.-Y.: Human action recognition using associated depth and skeleton information. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4608–4612. IEEE (2014) 80. Ubalde, S., Gómez-Fernández, F., Goussies, N.A., Mejail, M.: Skeleton-based action recognition using citation-knn on bags of time-stamped pose descriptors. In: 2016 IEEE International Conference on Image Processing (ICIP), pp. 3051–3055. IEEE (2016) 81. Li, Y., Guo, T., Xia, R., Liu, X.: A novel skeleton spatial pyramid model for skeleton-based action recognition. In: 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 16–20. IEEE (2019) 82. Liu, Z., Zhang, C., Tian, Y.: 3d-based deep convolutional neural network for action recognition with depth sequences. Image Vis. Comput. 55, 93–100 (2016) 83. Wang, H., Wang, L.: Beyond joints: Learning representations from primitive geometries for skeleton-based action recognition and detection. IEEE Trans. Image Process. 27(9), 4382– 4394 (2018) 84. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1963–1978 (2019) 85. Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with hierarchical spatial reasoning and temporal stack learning network. Pattern Recogn., p. 107511 (2020) 86. Yang, D., Li, M.M., Fu, H., Fan, J., Leung, H.: Centrality graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:2003.03007 (2020) 87. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3d action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018) 88. Tian, D., Lu, Z.-M., Chen, X., Ma, L.-H.: An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimedia Tools Appl., 1–19 (2020) 89. Liu, A.-A., Yu-Ting, S., Jia, P.-P., Gao, Z., Hao, T., Yang, Z.-X.: Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans. Cybern. 45(6), 1194–1208 (2014)
80
S. Sarker et al.
90. Yang, Y., Deng, C., Tao, D., Zhang, S., Liu, W., Gao, X.: Latent max-margin multitask learning with skelets for 3-d action recognition. IEEE Trans. Cybern. 47(2), 439–448 (2016) 91. Nguyen, X.S., Brun, L., Lézoray, O., Bougleux, S.: A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12036–12045 (2019) 92. Zhang, T., Zheng, W., Cui, Z., Zong, Y., Li, C., Zhou, X., Yang, J.: Deep manifold-to-manifold transforming network for skeleton-based action recognition. IEEE Trans, Multimedia (2020) 93. Devanne, M., Wannous, H., Berretti, S., Pala, P., Daoudi, M., Bimbo, D.: Alberto: 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold. IEEE Trans. Cybern. 45(7), 1340–1352 (2014) 94. Li, J., Xie, X., Pan, Q., Cao, Y., Zhao, Z., Shi, G.: Sgm-net: Skeleton-guided multimodal network for action recognition. Pattern Recogn., 107356 (2020) 95. Liu, J., Wang, G., Duan, L.-Y., Abdiyeva, K., Kot, A.C.: Skeleton-based human action recognition with global context-aware attention lstm networks. IEEE Trans. Image Process. 27(4), 1586–1599 (2017) 96. Zheng, W., Li, L., Zhang, Z., Huang, Y., Wang, L.: Relational network for skeleton-based action recognition. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 826–831. IEEE (2019) 97. Mahasseni, B., Todorovic, S.: Regularizing long short term memory with 3d human-skeleton sequences for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3054–3062 (2016) 98. Han, Y., Chung, S.-L., Ambikapathi, A., Chan, J.-S., Lin, W.-Y., Su, S.-F.: Robust human action recognition using global spatial-temporal attention for human skeleton data. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2018) 99. Song, S., Lan, C., Xing, J., Zeng, W., Jiaying, L.: Spatio-temporal attention-based lstm networks for 3d action recognition and detection. IEEE Trans. Image Process. 27(7), 3459–3471 (2018) 100. Cheng, K., Zhang, Y., He, X., Chen, W., Cheng, J., Lu, H.: Skeleton-based action recognition with shift graph convolutional network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 183–192 (2020) 101. Li, M., Chen, S., Zhao, Y., Zhang, Y., Wang, Y., Tian, Q.: Dynamic multiscale graph neural networks for 3d skeleton based human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 214–223 (2020) 102. Liu, Z., Zhang, H., Chen, Z., Wang, Z., Ouyang, W.: Disentangling and unifying graph convolutions for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 143–152 (2020) 103. Gao, X., Li, K., Zhang, Y., Miao, Q., Sheng, L., Xie, J., Xu, J.: 3d skeleton-based video action recognition by graph convolution network. In: 2019 IEEE International Conference on Smart Internet of Things (SmartIoT), pp. 500–501. IEEE (2019) 104. Li, C., Cui, Z., Zheng, W., Chunyan, X., Ji, R., Yang, J.: Action-attending graphic neural network. IEEE Trans. Image Process. 27(7), 3657–3670 (2018) 105. Song, Y.-F., Zhang, Z., Wang, L.: Richly activated graph convolutional network for action recognition with incomplete skeletons. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 1–5. IEEE (2019) 106. Ye, F., Tang, H., Wang, X., Liang, X.: Joints relation inference network for skeleton-based action recognition. In: 2019 IEEE International Conference on Image Processing (ICIP), pp. 16–20. IEEE (2019) 107. Zhang, X., Xu, C., Tao, D.: Context aware graph convolution for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14333–14342 (2020) 108. Zhang, G., Zhang, X.: Multi-heads attention graph convolutional networks for skeleton-based action recognition. In: 2019 IEEE Visual Communications and Image Processing (VCIP), pp. 1–4. IEEE (2019)
2 Skeleton-Based Activity Recognition: Preprocessing and Approaches
81
109. Si, C., Jing, Y., Wang, W., Wang, L., Tan, T.: Skeleton-based action recognition with spatial reasoning and temporal stack learning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 103–118 (2018) 110. Zare, A., Moghaddam, H.A., Sharifi, A.: Video spatiotemporal mapping for human action recognition by convolutional neural network. Pattern Anal. Appl. 23(1), 265–279 (2020) 111. Cho, S., Maqbool, M., Liu, F., Foroosh, H.: Self-attention network for skeleton-based human action recognition. In: The IEEE Winter Conference on Applications of Computer Vision, pp. 635–644 (2020) 112. Jiang, M., Pan, N., Kong, J.: Spatial-temporal saliency action mask attention network for action recognition. J. Vis. Commun. Image Represent., p. 102846 (2020) 113. Yang, Z., Li, Y., Yang, J., Luo, J.: Action recognition with spatio-temporal visual attention on skeleton image sequences. IEEE Trans. Circuits Syst. Video Technol. 29(8), 2405–2415 (2019) 114. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P. et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017) 115. Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: Ntu rgb+d: a large scale dataset for 3d human activity analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, June 2016 116. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: Ntu rgb+d 120: a largescale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach, Intell (2019) 117. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) 118. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308 (2017) 119. Dong, J., Gao, Y., Lee, H.J., Zhou, H., Yao, Y., Fang, Z., Huang, B.: Action recognition based on the fusion of graph convolutional networks with high order features. Appl. Sci. 10(4), 1482 (2020) 120. Liu, J., Wang, G., Hu, P., Duan, L.-Y., Kot, A.C.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1647–1656 (2017) 121. Liu, M., Liu, H., Chen, C.: Enhanced skeleton visualization for view invariant human action recognition. Pattern Recogn. 68, 346–362 (2017) 122. Jian-Fang, H., Zheng, W.-S., Ma, L., Wang, G., Lai, J., Zhang, J.: Early action prediction by soft regression. IEEE Trans. Pattern Ana. Mach. Intell. 41(11), 2568–2583 (2018) 123. Liu, J., Shahroudy, A., Wang, G., Duan, L.-Y., Kot, A.C.: Skeleton-based online action prediction using scale selection network. IEEE Trans. Pattern Anal. Mach. Intell. 42(6), 1453–1467 (2019) 124. Papadopoulos, K., Ghorbel, E., Aouada, D., Ottersten, B.: Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv preprint arXiv:1912.09745, 2019 125. Huynh-The, T., Hua, C.-H., Tu, N.A., Kim, D.-S.: Learning geometric features with dual– stream cnn for 3d action recognition. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2353–2357. IEEE (2020) 126. Zhang, X., Xu, C., Tian, X., Tao, D.: Graph edge convolutional neural networks for skeletonbased action recognition. IEEE Trans. Neural Networks Learn, Syst (2019) 127. Li, B., Li, X., Zhang, Z., Fei, W.: Spatio-temporal graph routing for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence 33, 8561–8568 (2019)
Chapter 3
Contactless Human Activity Analysis: An Overview of Different Modalities Farhan Fuad Abir, Md. Ahasan Atick Faisal, Omar Shahid, and Mosabber Uddin Ahmed
Abstract Human Activity Analysis (HAA) is a prominent research field in this modern era which has enlightened us with the opportunities of monitoring regular activities or the surrounding environment as per our desire. In recent times, Contactless Human Activity Analysis (CHAA) has added a new dimension in this domain as these systems perform without any wearable device or any kind of physical contact with the user. We have analyzed different modalities of CHAA and arranged them into three major categories: RF-based, sound-based, and vision-based modalities. In this chapter, we have presented state-of-the-art modalities, frequently faced challenges with some probable solutions, and currently used applications of CHAA with future directions.
3.1 Introduction Human Activity Analysis (HAA) is the field of research concerning the interpretation of different human activities, e.g., human motion, hand gesture, audio signal from a sequence of sensed data and their neighboring environment. Over the last few years, HAA has been one of the most prominent topics in multiple research fields including signal processing, machine learning, deep learning, computer vision, mobile F. F. Abir (B) · Md. A. A. Faisal · M. U. Ahmed Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] Md. A. A. Faisal e-mail: [email protected] M. U. Ahmed e-mail: [email protected] O. Shahid Department of Robotics and Mechatronics Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected]
© Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_3
83
84
F. F. Abir et al.
computing, etc. With the development of this field, new methods of data collection have been rigorously explored everyday varying from body-worn sensors to contactless sensing. However, activity analysis based on body-attached devices sometimes becomes incompatible, even infeasible for implementation. On the contrary, with the comprehensive development of computing devices, miniaturized sensor system, and improved accuracy of wired and wireless communication networks, Contactless Human Activity Analysis (CHAA) has been prioritized and improved in different aspects. Contactless Human Activity Analysis (CHAA) is deemed to analyze body posture or any type of physical activity of humans by means of non-wearable or contactless devices. This approach can be employed in everyday environments like marketplaces, cafes because each individual does not need separate devices. Besides, CHAA is implementable from a distance to the individual. Furthermore, CHAA is a more financially feasible way of analysis on a mass scale because only one module can analyze thousands of people in a single session whereas the wearable methods would require each individual to possess wearable hardware [1]. In recent years, there has been significant development in high-performance sensors, physiological signal detection, behavior analysis, gesture recognition systems which comfortingly leading CHAA as an impactful research sector in this domain. As wireless signal transmitter and receiver have both become very precise and gained a higher rate of accuracy, WiFi signal, RF signal, sound-based techniques are used as a powerful tool for contactless activity analysis systems. In addition to that, RGB cameras or other types of cameras like infrared cameras, depth cameras have become affordable. Therefore, vision-based activity analysis is rendering a great contribution to CHAA. This chapter presents an introduction to the CHAA modalities. Previously, Ma et al. [2] have provided a comprehensive survey on WiFi-based activity analysis. In another work, Wang et al. [3] had reviewed a wide range of works on smartphone ultrasonic-based activity analysis. That work had stated different signal processing techniques, feature extraction methods, and various types of applications related to this field. But our study is not confined within a specific modality. We have explored the state-of-the-art approaches including the early modalities as well as the evaluation of these approaches. Therefore, this chapter will be beneficial to new researchers in this field. The rest of this chapter is organized as follows: Sect. 3.2 has explored a detailed evolution of the contactless activity analysis system. Different modalities and techniques related to CHAA have been elaborately described in Sect. 3.3. Section 3.4 has stated various challenges and some probable solutions relevant to this field. Section 3.5 has summarized the applications of this domain. And finally, we have drawn a conclusion in Sect. 3.6.
3 Contactless Human Activity Analysis: An Overview of Different Modalities
85
3.2 Historical Overview Due to the development of microelectronics and computer systems in the past two decades, there has been a massive increase in research in the field of Human Activity Recognition (HAR). Earlier works on HAR were mostly contact-based or in other words, utilized some kind of wearable sensors. In the late 90s, Foerster proposed the very first work on HAR using accelerometer sensors [4]. Although wearable sensorbased approaches are still popular these days, contactless approaches have gained significant attention in recent years. RF-based sensing can be dated back to the 1930s when the first radar system was developed [5]. Although these systems were designed primarily to detect and track large moving objects at large distances like aircraft, they laid the foundation for the modern RF-based activity recognition methods. The use of electromagnetic waves in HAR gained attention because of some big advantages over vision-based systems; for example, no line of sight requirement, ability to work in any kind of lighting environment, and ability to pass through obstacles. Earlier, Doppler radars were developed to detect and track humans through obstacles [6]. WiFi RSSI-based indoor localization of human subjects was introduced in 2000 [7]. The capabilities of RSSI-based systems were limited due to noise and low sensitivity. To tackle this problem, many researchers utilized Universal Software Radio Peripheral (USRP) in the following years to perform signal processing for tasks like tracking the 3D movement of human subjects [8] or monitoring heart rate [9]. Activity recognition using commercial WiFi devices came to interest after Channel State Information (CSI) was introduced by Halperin et al. [10]. Although these low-frequency WiFi signal-based methods were very successful for localization and tracking, they lacked the sensitivity to detect small movements required for some activities and gesture recognition tasks. In 2002, the discovery of Micro-Doppler Signatures in Doppler radars opened a new dimension in human gait analysis [11]. Using these microDoppler features, it was possible to detect human body movements more precisely. In 2016, Google developed a miniature millimeter wave radar called Soli that was capable of detecting hand gestures at submillimeter level accuracy [12]. The first ultrasound-based technology was invented by American naval architect Lewis Nixon in 1906 for detecting icebergs [13]. After proving its usability in obstacle detection, a French physicist Paul Langevin invented an ultrasound-based technology in 1917 for detecting underwater submarines [14]. After years of research and finetuning, the usability of ultrasound in short distance ranging was starting to come to light. In 1987, Elfes et al. [15] published their work on a mapping and navigation technique based on ultrasound technology. This was used in the navigation system of an autonomous mobile robot named Dolphin. Ultrasound was mostly used in ranging and obstacle detection back then. But after 2000, researchers tend to focus more on using the Doppler shift for activity recognition. But until this point, the research was based on standalone ultrasound devices. The main breakthrough in sound-based activity recognition was the use of smartphones as the detecting device. This enabled the widespread use of ultrasound-based activity recognition modalities. In 2007, Peng et al. [16] presented a high accuracy
86
F. F. Abir et al.
ultrasound-based ranging system using commercial off-the-shelf (COTS) mobile phones. Later in 2010, Filonenko et al. [17] explored the potential of COTS mobile phones in generating reliable ultrasound signals. As a result, in the last few years, smartphone-based active activity recognition methods as well as the passive ones have made breakthroughs. Apart from this, the vision-based approach has become very popular in this field. In the early days of the twenty first century, the advancement of computational resources gave a revolutionary shed to the researches related to image processing. By this time, object recognition and scene analysis provided results with higher accuracy due to the development of sophisticated learning algorithms. At the beginning of vision-based activity recognition, it was still image-based [18–20]. In this case, some activities were recorded using a camera and split into frames to make datasets and later on used some handcrafted feature extraction techniques to feed them into supervised learning algorithms. By the end of the first decade of the twenty first century, video-based activity recognition has come to light. From the viewpoint of datatypes, research on video based human activity analysis can be classified based on method of color (RGB) data [21, 22], depth data, and combination of color (RGB) and depth data (RGBD) [23, 24]. In the early period of machine learning approaches, these data were used to get handcrafted features and fed into a learning algorithm like decision tree, support vector machine, hidden Markov model, etc. There have been many proposed features for RGB data, for example, joint trajectory features [25], spatiotemporal interest point features [26], and spatiotemporal volume-based features [27]. On the contrary, depth sensors provided more stable data according to the background and the environment, therefore, it enabled real-time detection faster with pose estimation. During the last decade, innovation of the Kinect sensor gave a new insight for depth data and skeleton tracking. However, deep learning methods work well on video-based activity analysis, unlike machine learning approaches. These methods learn features automatically from images which is more robust and convenient for human activity analysis. Deep learning networks can learn features from single-mode data as well as multimodal data. Moreover, pose estimation methods to learn skeleton features from a scene by applying deep learning methods have drawn increased attention for vision-based activity analysis.
3.3 Primary Techniques Researches on Contactless Human Activity Analysis (CHAA) have been going on for quite a while now and different methods have been developed over the years. Some methods are more focused on specific goals and some are more generalized. These methods can be classified into video-based, RF-based, and ultrasonic-based approaches [2].
3 Contactless Human Activity Analysis: An Overview of Different Modalities
87
3.3.1 RF-Based Approaches Wireless signals have been used quite extensively for localization, tracking, and surveillance. RF-based approaches have evolved a lot over the years and recently have made their impact in the field of Contactless Human Activity Analysis (CHAA). At a high level, these techniques rely on a radio frequency transmitter and a receiver. Wireless signals emitted from the transmitter get reflected by the surrounding environment and subject of interest and reach the receiver. Useful features are then extracted from the received signal using different tactics and fed into classification algorithms to classify different kinds of human activity. Some RF-based approaches do not need additional device. These approaches rely on physical (PHY) layer properties like the Received Signal Strength Indicator (RSSI) and the Channel State Information (CSI) of the WiFi. These measurements can be easily extracted from commercially available WiFi Network Interface Cards (NIC). The Intel 5300 [10] and the Atheors 9580 [28] are two examples of such NICs. Some RF-based frameworks require custom hardware setup such as the Universal Software Radio Peripheral (USRP) to work with the Frequency Modulated Carrier Wave (FMCW) signals [8]. Other methods use a similar setup to measure the Doppler shift of the RF signal [29]. Apart from these, many other signal properties can be extracted from both commodity and customized hardware-based devices. But the most common signal properties that can be used in human activity analysis are signal power (RSSI), channel information (CSI), Doppler shift, and time of flight (ToF). These properties are discussed below in detail.
3.3.1.1
Received Signal Strength Indicator (RSSI)
Wireless signals are electromagnetic waves that propagate through the air at the speed of light. As the signal propagates further from the transmitter, its power decays exponentially with the distance. This power distance relationship can be represented as [30]: Pt G t G r λ2 , Pr = (4πd)γ where Pr is the received power at distance d, Pt is the transmitted power, G t and G r are transmitting and receiving antenna gain respectively, λ is the wavelength, and γ is the environmental attenuation factor. This formula is the basis of power-based ranging using wireless signals. RSSI is simply the received power and in terms of received voltage Vr , it is denoted as [31]: RSS I (d B) = 20log(|Vr |2 ) Because of its availability in mainstream wireless signal measurements, RSSI has been adopted by numerous localization and activity recognition methods. But characterizing the signal strength by the above equation does not work in a real-
88
F. F. Abir et al.
Fig. 3.1 Effects of shadowing and multipath on RSSI
life scenario. In practical cases, the received signal is a superposition of signals taking multiple reflection paths (multipath effect) and also deviates from the expected value due to obstacles (shadowing effect) as shown in Fig. 3.1. Seidel et al. took the shadowing effect into consideration and proposed the Log-normal Distance Path Loss (LDPL) model which can be written as [32]: P L(d)d B = P L(d0 )d B + 10γlog(
d ) + Xσ, d0
where P L(d) denotes the path loss at distance d, P(d0 ) denotes the path loss at distance d0 , γ is the path loss exponent and X σ reflects the effect of shadowing. It has been observed that the RSSI value changes dramatically in the presence of a human subject and the movement of the subject results in fluctuations of the RSSI value [33]. Useful features can be extracted from these RSSI fluctuations and used for activity recognition. Although RSSI measurement has been adopted in many RFbased activity recognition frameworks, it is unstable even in indoor conditions [34].
3.3.1.2
Channel State Information (CSI)
Wireless signals can take multiple reflection paths while traveling through the channel. Since the human body is a good reflector of wireless signals, the presence of a human subject adds more reflection paths to the environment. In the case of a moving subject, at a particular instance, some of these paths might interfere constructively or destructively. This creates fluctuations in RSSI value. The biggest drawback of RSSI-based measurement is its inability to capture this multipath effect. It gives a summed up view of the situation and fails to capture the bigger picture. Adopted in IEEE 802.11 a/g/n, Channel State Information (CSI) captures the effect of multiple path propagation and presents us with a broader view of the wireless channel. To take the multipath effect into consideration, the wireless channel can be considered as a temporal linear filter. This filter is called Channel Impulse Response (CIR) and it is related to transmitted and received signal as follows [35]:
3 Contactless Human Activity Analysis: An Overview of Different Modalities
89
y(t) = x(t) ∗ h(t), where x(t) is the transmitted signal, y(t) is the received signal, and h(t) is the Channel Impulse Response (CIR). The Channel Frequency Response (CFR) is simply the Fourier Transform of CIR and it is given by the ratio of received and transmitted signal in the frequency domain. The CSI monitored by WiFi NICs is characterized by the CFR value of the wireless channel. Complex valued CFR can be expressed in terms of amplitude and phase as follows [36]: H ( f, t) =
N
ai ( f, t)e− j2π f τi (t) ,
i=1
where ai ( f, t) represents the attenuation and initial phase shift of the i th path, and e− j2π f τi (t) is the phase shift that is related to the i th path with a propagation delay of τi (t). Using the above equation, it is possible to obtain the phase shift of a particular path taken by the wireless signal and perform activity analysis based on that. But, unfortunately, due to the mismatch of hardware between the transmitter and the receiver device, there will always be some non-negligible Carrier Frequency Offset (CFO). This CFO can result in a phase shift as large as 50π according to the IEEE 802.11n standards which will overshadow the phase shift resulting from the body movement [37]. Wang et al. [37] proposed a CSI-Speed model that leverages CSI power to resolve the effect of CFO. This model considers the CFR as a combination of static and dynamic components: 2πdi (t) ai ( f, t)e− j λ ), H ( f, t) = e− j2πΔ f t (Hs ( f ) + i∈Pd
where Hs ( f ) is the static CFR component, di (t) is the path length of the ith path, λ is the wavelength, and e− j2πΔ f t represents the phase shift for frequency offset Δf . According to the CSI-Speed model, CFR power changes when the subject of interest is in motion. The instantaneous CFR power can be expressed as follows: 2πdi (0) 2πvi t + + φsi λ λ i∈Pd 2πdi (0) − dt (0) 2π(vi − vt )t + + φik + 2|ai ( f, t)at ( f, t)|cos λ λ i,k∈Pd |ai ( f, t)|2 + |Hs ( f ))|2 , +
|H ( f, t)|2 =
2|Hs ( f )ai ( f, t)|cos
i∈Pd
90
F. F. Abir et al.
t (0) where 2πdλi (0) + φsi and 2πdi (0)−d + φik are constant values that represent the initial λ phase offsets. This equaltion shows that the total CFR power is a combination of a static offset and a set of sinusoidal oscillations. The frequencies of these oscillations are directly related to the speed of movements that create changes in the path length. Thus, the CSI speed model provides a quantitative way of relating CSI power with human activity.
3.3.1.3
Doppler Shift
Doppler shift is characterized by the frequency shift of wireless signals when the transmitter and the receiver are moving relative to each other. Doppler radars work on this principle of Doppler shift to track a moving object. This technique has been adopted by many security surveillance, object tracking, and navigation systems. Recently Doppler radars have caught the attention of many researchers for the purpose of human detection and activity recognition. Since the human body reflects wireless signals, it can be thought of as a wireless signal transmitter [38]. Thus, human activities that involve motion, create Doppler shifts in the wireless signals. By measuring these frequency shifts, it is possible to detect the activity or gesture. When the subject is moving towards the receiver, the resulting Doppler shift is positive and when the subject is moving away from the receiver, the resulting Doppler shift is negative. More generally, when a point object is moving at an angle of φ from the receiver with velocity v, the resulting Doppler shift is expressed as [39]: Δf =
2vcos(φ) f, c
where f is the carrier frequency of the wireless signal and c is the speed of light. From this equation, we can see that a human activity involving a motion of 0.5 m/sec will create a maximum frequency shift of 17 Hz for a 5 GHz wireless signal which is really small compared to the typical wireless signal (e.g. WiFi) bandwidth of 20 MHz. WiSee [38], a whole-home gesture recognition system, solves this issue by splitting the received signal into narrowband pulses of only a few Hertz. Human activity recognition requires not only capturing the whole-body motion but also capturing the limb motion. It has been observed that limb motion adds Micro-Doppler-Signatures in the received signal [11]. These signatures are considered to be the key feature for activity and gesture recognition using a Doppler radarbased framework. Figure 3.2 shows a hypothetical example of a Doppler spectrogram resulting from walking activity. In the figure, we can differentiate the Micro-Doppler signatures due to limb motion apart from the Doppler shift due to body motion. Useful features can be handcrafted from these spectrograms and used in classical machine learning algorithms [29] or they can be used directly as the input for Deep Convolutional Neural Networks (DCNN) [40] for activity classification. Higher frequency signals produce a higher Doppler shift which is essential for capturing small movements. Google’s project Soli [12] leverages a 60GHz radar to
3 Contactless Human Activity Analysis: An Overview of Different Modalities
91
Fig. 3.2 Sample Doppler spectrogram of “walking” activity. This shows the Micro-Doppler features due to limb movement
achieve high temporal resolution and classify hand gestures with submillimeter accuracy. This method is only useful when the subject of interest is in motion. Because, if the subject is stationary, there will be no Doppler shift.
3.3.1.4
Time of Flight (ToF)
Time of Flight (ToF) refers to the time it takes for the transmitted signal to reach the receiver. ToF provides a way to measure the distance of the subject from the sensing device; thus, can be a useful measurement for human activity analysis. Since the wireless signals propagate at the speed of light, it is very difficult to measure the ToF directly. Frequency Modulated Carrier Wave (FMCW) radar sweeps the carrier frequency in a sawtooth wave fashion. The reflected wave captures the ToF in the form of frequency shift of the carrier signal [41]. Lets consider f x (t) to be the transmitted signal, f y (t) to be the received signal and Δf to be the frequency shift introduced after reflecting back from a human subject. The ToF Δt can be expressed as: Δt =
Δf , m
where m is the slope of the frequency sweep. Unlike direct measurement, we can easily measure the Δf from FMCW radar and thus calculate the ToF [42]. Figure 3.3 demonstrates this property of FMCW operation. A number of activity recognition methods have been developed by utilizing the FMCW concept. WiTrack [8] is a 3D human tracking and activity recognition framework which leverages the FMCW technique. Implemented on USRP hardware, WiTrack can track a human subject with centimeter-level accuracy. Vital-Radio [9] utilizes FMCW-based distance measurement to monitor breathing and heart rate wirelessly.
92
F. F. Abir et al.
Fig. 3.3 Frequency shift between the transmitted and the received signal in FMCW operation
All the signal properties for human activity analysis discussed above, come with their own set of advantages and drawbacks. Although RSSI-based frameworks are the simplest, they suffer from instability and low accuracy. CSI provides more detailed information about human activity but processing the CSI stream can be difficult. Doppler and FMCW-based frameworks require custom hardware and can not be implemented using Commercial-Off-The-Shelf (COTS) devices. The choice of frequency also plays a vital role in RF-based systems. Higher frequency signals provide better sensitivity but their range is small compared to low-frequency signals. A robust RF-based human activity recognition framework should take these matters into consideration.
3.3.2 Sound-Based Approaches Human Activity Recognition (HAR) using sound signal has recently been explored quite extensively along with the other approaches. Based on the range of usable sound frequency for HAR, we have divided the study into two sub groups—ultrasound signal-based and audible signal-based approaches. Though the ultrasound range is widely used for activity analysis and recognition, study of the audible sound in this field has recently become a topic of interest. Based on the necessary components, we can divide the sound-based approaches of contactless human activity analysis into two major categories—active detection method and passive detection method. The active method consists of a sound generator and a receiver while the passive method consists of the receiver only. In most applications, ultrasound-based active methods are used because it does not hamper the day to day activities of humans. On the other hand, audible range is mostly used with the passive methods. Sound-based human activity analysis can be accomplished with different modalities and implementation techniques. The choice of modality depends mostly on the application and the hardware. Different modalities along with their applications are discussed in the following part.
3 Contactless Human Activity Analysis: An Overview of Different Modalities
3.3.2.1
93
Time of Flight (ToF)
ToF is a very trivial technique for ultrasound-based ranging. In last decade, a good number of researchers have explored the potential of this method for detecting small changes in distance of an object up to millimeters. The increased accuracy in ranging has opened new dimensions for activity recognition research. The basic principle of ToF is similar in both RF-based ToF and sound-based ToF but there are some differences in the measurement and analysis techniques. Compared to the RF-based ToF, sound-based ToF measurement is more direct. The transmitter transmits ultrasound pulses which get reflected back by a nearby human in the line of sight of the signal propagation. The propagation velocity of sound wave is very low compared to that of the RF signal. Hence, in contrast to the RF-based ToF, the processing unit associated with the receiver in sound-based ToF, computes the ToF directly by comparing the transmitted and received signal. This process is based on the following equation [43]: D=
1 vs Δt, 2
where D, Δt and vs denotes the total distance, the time of flight (ToF), and the propagation velocity of sound respectively. Small change in distance occurs during any physical activity. Based on the small change, a certain pattern of the signal can be found associated with a certain activity. Different features can be extracted from the ToF signal like Coefficient of Variation (COV) [44, 45], Range [45] and Linear Regression Zero-crossing Quotient (LRZQ) [44]. Afterward, the activities can be classified using classification models. Griffith et al. [45] used Support Vector Machine (SVM) and Biswas et al. [44] compared Nave Bayes, Logistic Regression and Sequential Minimal Optimizer (SMO) to classify human locomotor activities. Al-Naji et al. [43] explored a different domain where they used ToF to measure the displacement of thorax during inhale and exhale motion. Based on that, breathing rate (BR) was measured. Afterwards, comparing the BR of healthy subjects and respiratory patients, they identified different respiratory problems. Researchers have used this technique in a wide range of contactless human activity analysis applications.
3.3.2.2
Doppler Effect
Doppler shift is another well-known activity analysis technique which can is used both in RF-based and sound-based modalities. In both cases, the principle is the same—frequency shift due to the movement direction and velocity of an object relative to the transmitter. The differences lie in the type of required-hardware, analysis techniques and applications. With the development of sophisticated Doppler sensors and efficient processing components, nowadays very small change in frequency can
94
F. F. Abir et al.
Fig. 3.4 Sample spectrogram of walking toward the device and away from the acoustic transmitter
be detected. Which enabled this modality to analyze human activities ranging from walking pattern to the smallest finger movement. Like ToF, the sensing part of the system consists of a transmitter and a receiver. In some cases, the time domain signal is frequency demodulated before the frequency domain conversion [46–49]. Most researchers have used Fast Fourier Transform (FFT) for this purpose. Moreover, Discrete Cosine Transform (DCT) [48, 49] is also used which only consists of the real components unlike the Fourier Transform. Kalgaonkar et al. [47] used Goertzel algorithm to get the energy information from a narrow bandwidth (50–120 Hz) which is more efficient than FFT for narrow bandwidth. Spectrogram is a very useful tool to visualize the Doppler effect in frequency domain and analyze the correlation between the activities and the frequency information. Figure 3.4 shows a sample spectrogram with the frequency shift due to walking. The raw frequency domain signal is very difficult to work with and needs to be processed further. Firstly, a calibration method is often used in order to adjust the frequency response of the device. In case of two or more transmitters, the frequency information can be transformed into Euclidean space to get a velocity vector using the angles of transmitters [50]. A few challenges in this step are attenuating the noise, eliminating the multi-path echo, and adjusting device diversity. For these purposes, FFT normalization [51–53], Squared Continuous Frame Subtraction [52], Gaussian Smoothing [52], and Low Pass Filter [46–49] are commonly used. The preprocessed frequency domain signal is then segmented using different windowing techniques. 512-point [49], 1024-point [46–48], and 2048-point [52, 54] Hamming windows are used in most studies. The increased point per window provides more accuracy but takes more time to process. Features are extracted subsequently from each frequency
3 Contactless Human Activity Analysis: An Overview of Different Modalities
95
bin for classification. Some common features are direction, duration, average speed, spatial range [52, 54], and Power Spectral Density (PSD) [49, 55]. Afterward, the feature data is passed through classification algorithms such as Gaussian Mixture Model (GMM) [46, 48, 49], Naive Bayes (NB) [51], Random forest (RnF) [50, 51], Support Vector Machine (SVM) [51] and so on. Fu et al. [51] also implemented Convolutional Neural Network (CNN) which achieved a higher accuracy over RnF, NB and SVM. Moreover, some researchers used their own mathematical model [52, 54, 55] for classification which also showed good results.
3.3.2.3
Phase Shift
Phase of a wave is another information which can be used in activity recognition. The phase changes if it is reflected upon a moving object in its propagation path. Path length calculation based on the phase shift shows better ranging accuracy than ToF-based methods. Compared to Doppler shift calculation, this method shows low latency. Wang et al. [56] used a trajectory-tracking method named Low Latency Acoustic Phase (LLAP) which analyzes the change in phase of the ultrasound signal reflected by moving fingers. The In-phase (I) and Quadrature (Q) components of the received signal change with the hand movements. The complex signal consists of two vectors: the static vector represents the signal for no hand movement and the dynamic vector represents the signal corresponded to the hand movement. The received signal is transformed into two versions. One is multiplied with cos(2π f t) and the other is multiplied with the phase shifted version − sin(2π f t). Then each signal is passed through a Cascaded Integrator Comb (CIC) filter which removes the high frequency components and provides the In-phase (I) and Quadrature (Q) components of the received signal. They used a novel algorithm named Local Extreme Value Detection (LEVD) which takes the I/Q components separately and estimates the real and the imaginary part of the static vector. Afterward, to derive the dynamic vector φ(t), the static vector is subtracted from the baseband signal. The path length can be calculated from the equation below [56]: d (t) − d (0) = −
φd (t) − φd (0) λ, 2π
where the left side of the equation denotes the total path length in t time, φ(0) and φ(t) denotes the initial signal phase and the phase at time t. This gesture tracking method provides average accuracy up to 3.5 mm and 4.6 mm for 1-D and 2-D hand movements respectively. Moreover, in smartphones, processing latency of LLAP is about 15 ms [56]. Nandakumar et al. [57] proposed another phase shift based novel system, FingerIO, which can detect small changes in finger movement. This method implemented the properties of Orthogonal Frequency Division Multiplexing (OFDM) to detect the small shift in the reflected signal more precisely. It is shown in Fig. 3.5a. The system
96
F. F. Abir et al.
OFDM signal structure
Change in echo profile during gesture performance
Fig. 3.5 Implementation of OFDM properties with phase shift measurement for subcentimeter movement detection
generates OFDM signals within 18–20 kHz range. The generated signal consists of 84 samples of OFMD symbol and 200 samples of silence as cyclic suffix. This duration of silence is enough for any reflected signal within 1m range to get back to the receiver. To generate the echo profile, the received signal is correlated with the transmitted signal. A sample echo profile is shown in Fig. 3.5b. Here, each peak denotes the start of an echo. The distance of the reflecting object can be interpreted from this echo profile. This process gives a decent distance measurement error of 1.4–2.1 cm. The finger movement is identified by comparing two consecutive echo profiles. When a finger movement occurs, the peak of the echo profile is shifted. Comparing with a threshold value, the finger movement is detected. To fine-tune the measurement, FFT is implemented in the approximate region of the peak change in the echo profile. Then the linear phase shift is used in the FFT output to estimate the beginning of the echo with accuracy up to 0.8 cm.
3.3.2.4
Passive Signal Processing
The principle of this detection method is to receive and analyze the acoustic signals produced by human activity, i.e., keystroke. The framework only consists of an acoustic signal receiver. Analyzing the received signal in both time and frequency domain, useful features can be derived which can further be used in the classification of certain activities. This method of human activity analysis has been explored in the last few years mainly due to the improved classification tools. UbiK is a text entry method proposed by Wang et al. [58] which detects the keystrokes on a surface. Taking advantage of the dual microphone in mobile phones, this system extracts multi-path features from the recorder audio and locates the stroke distance from a mapped keyboard outline. A sample setup is shown in Fig. 3.6a. The
3 Contactless Human Activity Analysis: An Overview of Different Modalities
Passive method for keystroke detection
97
Amplitude Spectral Density (ASD) profiles
Fig. 3.6 Sample setup and ASD profiles of an acoustic-based keystroke detection method using printed keyboard on wooden surface
user taps all the keys on the printed keyboard which is recorded as the training strokes. Afterward, the keystrokes can be detected with their novel keystroke localization algorithm based on the distinct acoustic features of each keystroke. But this framework has two basic conditions to work properly: the surface must make audible sound when the keystroke occurs and the position of the printed keyboard and the mobile phone cannot be changed. The Amplitude Spectrum Density (ASD) profiles of the acoustic signals provide distinct information about the tapped key location. The difference in ASD profiles for two different keystokes are presented in Fig. 3.6b. Chen et al. [59] proposed another system named Ipanel which records, analyzes, and classifies the acoustic signals from the finger sliding on to the nearby surface to the recorder. This system can classify surface writing of 26 alphabets, 10 numbers and 7 gestures with overall recognition accuracy above 85%. Both time and frequency domain features have been extracted from the acoustic signals. Firstly, ASD is implemented as a frequency domain feature which did not show overall good accuracy with most machine learning classifiers. Afterward, the time domain Mel Frequency Cepstral Coefficients (MFCC) information with K-Nearest Neighbors (KNN) classifier has been implemented and the system achieved 92% classification accuracy only for the gestures. Lastly, the spectrograms of the acoustic signals have been fed into Convolutional Neural Network (CNN) as images. This technique gave the best overall accuracy among the three. In another study, Du et al. [60] have proposed a deep-learning based handwriting recognition system named WordRecorder. This system detects the words written using pen and paper based on the acoustic signals made during writing. The acoustic signal is recorded by a smartwatch (Huawei smartwatch 1) and then sent to the smartphone. The smartphone preprocesses the acoustic signal using DC component removal, zero padding and Low Pass Filtering. Then the processed signals are passed through CNN which detects letter by letter. With the help of an extra word suggestion module, the predicted letters are converted into words.
98
F. F. Abir et al.
Among the above-mentioned sound-based modalities of CHAA, the first three modalities—ToF, Doppler Shift, and Phase Shift are active modalities. In order to minimize the interference from the environmental audible sound and avoid the continuous audible chirp, ultrasound range is preferred over the audible range. On the other hand, the passive signal processing modality is based on the nearby audible signal produced by human activity and the receiver must be in a close range of the activity performed. Moreover, the ToF and Doppler Shift can be employed based on both RF and sound signal. The selection of signal type is based on application scope, available hardware, and environmental dynamics. All the four modalities can be implemented by specialized hardware or general-purpose devices—smartphones, laptops, tablets, etc. To increase the usability, the recent research is more focused to develop the modalities using general-purpose devices.
3.3.3 Vision-Based Approach This approach has been very popular and powerful tool for contactless activity recognition over the last decade for its usability and accessibility. Vision-based approaches can be classified into two types depending on data type and they are: video data and skeleton data.
3.3.3.1
Video-Based Activity Recognition
Video is a sequence of frames. So, to understand the video, we need to extract frames from the video first. Afterward, different methods are applied on extracted frames to identify where and what is shown in the frames. Figure 3.7 shows a simple working diagram of video-based activity recognition. Vrigkas et al. [61] presented a comprehensive survey of existing research works that are relevant to vision-based approaches for activity analysis and classified them into two main categories: unimodal and multi-modal approaches. Unimodal methods use data through a single modality are further classified into space-time based, rule-based, shape-based, and stochastic methods. On the other hand, multimodal approaches use different sources for data and group them into three types: socialnetworking, behavioral, and effective methods. And the type of activity recognition based on video is mostly dependent on viewpoints. Single viewpoint datasets are less challenging for classification as it deals with less environmental noise. On the other hand, multi-view datasets are more challenging, robust, and realistic. In vision-based HAR, the first task is to segment all background objects using different types of object segmentation algorithms. Following the object segmentation, distinct characteristics of the scene are extracted as feature set [62] and used as input for classification. Moreover, video object segmentation methods are two types based on its environment- background construction-based methods and foreground extraction-based methods. In case of background construction-based methods, the
3 Contactless Human Activity Analysis: An Overview of Different Modalities
99
Fig. 3.7 Workflow diagram of video based HAR
camera setup remains static relative to the background, hence all information related to the background is obtained in advance and the developed model perform well for object segmentation. On another note, if videos are captured through moving camera, then the scene might have variability and dynamic features. Hence, the background model cannot be built in advance [63], and it is recommended to obtain the model online. Experimental setup provides a huge contribution on how the methodology should be designed to have a proper classification technique. Background setup, illumination, static and dynamic features of environment, camera view—every point is important. Figure 3.8 shows a ideal setup for multi-view video based HAR. Light dependency is one of the challenges faced by the traditional cameras as most of them are not illumination invariant. The development of depth cameras have a huge contribution on night vision. For surveillance purpose, it has become one of the most popular techniques. Liu et al. [64] have reviewed different approaches of activity recognition using RGB-D (RGB+Depth) data.
3.3.3.2
Skeleton Data-Based Activity Recognition
Skeleton data simply refers to the joint coordinate of human body. Every joint has a position which can be described by three coordinate values. Existing skeleton-based action recognition approaches can be categorized into two types based on the feature extraction techniques: joint-based approaches and body part-based approaches. Jointbased approaches recognize the human skeleton simply as a set of coordinates and its been inspired by the experiment of Johanson [65] which is known by the classic moving lights display experiment. These approaches intend to predict the motion
100
F. F. Abir et al.
Fig. 3.8 Experimental Setup for Multi-View Video based HAR
through a model using individual joint coordinate or combinations of various joints with various features like joint angle with positions, joint orientations with respect to a particular axis, pair-wise relative joint positions, etc. Besides, body part-based approaches consider the skeleton as a connected set of rigid segments. However, these approaches either try to predict the temporal sequence of individual body parts or focus on the connected pairs of rigid segments and predict the temporal sequence of joint angles. In Fig. 3.9, we have classified the methods for 3D action representation into three major categories: joint-based representations, that capture the relation between body joints through extraction of alternative features; mined joint-based descriptors, that are useful to distinguish distinct characteristics among actions by learning what body parts are involved; and dynamics-based descriptors, that are used in developing the dynamic sequences of 3D trajectories. On another note, the joint-based representations can be categorized into spatial descriptors, geometric descriptors, and key posebased descriptors depending on the characteristics of extracted skeleton sequences to the corresponding descriptor. Background setup and camera viewpoint is very important in case of collecting skeleton data. Generally, skeleton data is collected using Microsoft Kinect sensor or any other depth measuring camera. Here, illumination acts as a big factor to track skeleton. Huge amount of light is not appreciable in case of collecting skeleton data. In general case, experimental setup is free from background complexity, dynamic feature or any other disturbance. This kind of environment is possible to create in experimental studio or laboratory. But in real world scenario, there are problems like background complexity, dynamic feature, variable subject or objects interruption,
3 Contactless Human Activity Analysis: An Overview of Different Modalities
101
Fig. 3.9 Experimental setup to collect skeleton data
etc. Those problems have different solutions to deal with. In laboratory experimental setup, only ideal skeleton data is considered. Despite of the lack of vision-based open datasets, there are many publicly available datasets for research purposes, which have been recorded in constrained environments. For example, RGBD-HuDaAct [66], Hollywood [67], Hollywood-2 [68], UCF sports [69] are some of datasets consist of interactive actions between humans. These are all RGB datasets. On the contrary, Weizmann [70], MSRC-Gesture [71], NTU-RGBD [72] represents some popular dataset of highly primitive actions of daily activities. Furthermore, there are a few datasets of medical activities that are accessible for research purpose, such as USC CRCNS [73], HMD [74], UI-PRMD [75].
3.4 Frequent Challenges Three of the most widely used contactless activity recognition modalities have been discussed in the previous sections. One modality may show the greatest accuracy in a controlled environment, i.e., lab-environment but experience performance loss in real-life scenarios. Another one might work best in real life scenario but the computational cost is not simply feasible. According to Table 3.1, we can see each modality
102
F. F. Abir et al.
Table 3.1 Challenges of different modalities Challenges RF-based modalities Line of sight Ambient condition Multipath effect System robustness Range constraint Noise cancellation Device dependency Intra and inter class variability Dataset unavailability
Moderate Moderate Crucial Variable Crucial Crucial Moderate Present Crucial
Sound-based modalities
Vision-based modalities
Crucial Moderate Crucial Variable Crucial Crucial Moderate Present Crucial
Crucial Crucial Not applicable Variable Moderate Moderate Minimal Present Moderate
has its own advantages and disadvantages over others. The frequent challenges that come with these approaches are discussed in the following discussion. Line-of-Sight (LoS) Requirement: Some contactless activity recognition methods require the sensing device and the subject to be in line of sight of each other to work properly. Vision and ultrasound-based methods only work in LoS condition. Although some RF-based approaches can perform even in non-line-of-sight (NLoS), their accuracy drops drastically in NLoS conditions [8]. Ambient Condition: Video-based approaches are mostly dependent on the background and illumination conditions of the environment. Below a threshold level of illumination, the whole framework may fail [76]. Variations in ambient condition need to be solved by algorithms so that it can be illumination invariant. Multipath Effect: RF and sound-based human localization and activity recognition systems rely on analysis of the received signal. The transmitted signal can take multiple paths to reach the receiver. Some of these alternate paths are created due to the reflection from stationary objects like walls or roofs and some of them are due to the reflection from the actual human subject. Separating the signals reflecting back from the subject can be challenging. Presence of Multiple humans makes thing more complicated. System Robustness: It is very difficult to detect activity while tracking a subject. Usability and accuracy of the systems largely depend on the distance and velocity of the subject with respect to the sensing module [59]. Some existing works on gesture recognition and vital sign measurement require the human subject to stay stationary to perform activity analysis [9, 43]. Range Constraint: Millimeter wave-based systems are very successful at detecting small movements or gestures. But the range of these high frequency signals are comparatively low. Existing works require the subject to be in close proximity to the transmitter. Any obstacle between the transmitter and the subject may cause
3 Contactless Human Activity Analysis: An Overview of Different Modalities
103
system failure [77]. WiFi-based frameworks usually work in indoor situations and limited by the range of the WiFi router. On the other hand, sound-based approaches also face range issues. The active ultrasound-based detection methods work up to 4 m distance from the device [16]. But in applications for micro hand gesture recognition, the working range is below 30 cm from the device [56]. Noise: Removing the noise from the received signal is a key part of these activity recognition methods. The noise can be both internal and external, and it is impossible to remove them completely. Electromagnetic waves in the environment appear as noise for RF-based approaches. Selecting the proper noise threshold can be very challenging. In case of sound-based approaches, sound signals from different equipment can be at that same frequency range which may cause performance degradation [60]. Device Dependency: Some WiFi-based activity recognition methods relying on the RSSI or CSI, can be completely device-free. Doppler and ToF-based approaches require custom hardware like the Universal Software Radio Peripheral (USRP) [42]. Sound-based approaches use both COTS devices like laptops, smartphones, tablets as well as custom made devices. The custom made devices mostly use around 40 kHz of ultrasound signals [45–49]. On the other hand, the smartphone or other COTS device-based methods use 18–22 kHz range [52, 53, 58, 60]. The device is chosen based on the application. Vision-based approaches need a camera device whether its RGB camera or depth camera depending on the application. Intra and Inter Class Variability: In vision-based approach, an action may have huge variance in its own class. For example, if we capture sign language data, there might be some signs that may have various types of scene or may have very rare amount of similarity in between them. These types of variability are so much challenging to detect [78]. On the contrary, different classes of action may have similarities which make them complex to classify. Dataset Unavailability: One of the most common problems of this research domain is the lack of standardized dataset. In most cases, datasets are being recorded using personal research instruments and they are not open for all to use, specially for RF and sound-based CHAA. Furthermore, there are some privacy issues related to giving access of video or image data as they contain personal information of the participant. Mostly, the challenge become severe in case of medical activity analysis [79]. Therefore, medical data are very rare to have open access. Due to the diverse challenges of each modality, any researcher or developer must evaluate the application criteria carefully at the very beginning and choose the optimal modality accordingly.
3.5 Application Domain The above mentioned modalities can be applied for different purposes. With the development of smart appliances and IoT, the field of Contactless Human Activity
104
F. F. Abir et al.
Analysis (CHAA) has been widened in last few years. At the same time applicationbased CHAA studies have come into focus. Different application fields of the three CHAA modalities are discussed in this section. Safety Surveillance: Surveillance cameras can detect any type of unusual activity by vision-based activity analysis modalities. The advancements of computer vision have given us the opportunity to track walking pattern, eye movement which are significant tools for surveillance. For example, someone is carrying a gun in a crowd or running from the scene by stealing something- all can be detected with vision-based approaches. Furthermore, Doppler radars are widely used for intrusion detection [6, 7]. Some WiFi frameworks do not require LoS condition and can work through walls and obstacles which can be very useful for security surveillance systems [80–82]. RASID [83] uses standard WiFi hardware to detect intrusion. Combined with its ability to adapt to the environment, RASID can achieve a F-measure of 0.93. Moreover, Ding et al. [84] proposed a device-free intrusion detection system using WiFi CSI with an average accuracy of 93%. Daily Activity Monitoring: Daily activity monitoring is one of the basic applications of any CHAA modality. Ultrasound-based approaches have shown promise in monitoring daily activities like sitting, walking and sleep-movements [85]. Kim et al. [40] used Micro-Doppler spectrograms as input to a deep convolutional neural network to classify 4 activities with 90.9% accuracy. E-eyes [86] is a device-free activity recognition system that uses WiFi-CSI to classify different daily activities. WiReader [87] uses energy features from WiFi-CSI along with an LSTM network to recognize handwriting. Moreover, the researches have shown the usability of ultrasound-based approaches for detecting talking faces [48], voice activity [47], and even gait analysis [47]. On the other hand, advancement of computer vision allows us to monitor our daily activities with higher precision than other approaches. Our daily activities like walking, running, standing, jumping, swimming, hand-shaking [88, 89] as well as micro activities like cutting food, drinking water can also be monitored using computer vision [90]. Elderly and Patient Care: In case of elderly care, vision-based approaches play a significant role. It can detect whether something unusual has happened to the subject or not. For example, fall detection [91] is one of the prime applications for elderly care. Although static human subjects do not affect WiFi CSI in time domain, a sudden fall results in a sharp change in CSI. WiFi-based fall detection techniques, i.e., WiFall [92] and RT-Fall [93] have used this concept and utilized different machine learning algorithms to distinguish falling from other activities with relatively high accuracy. Compressed features from Ultra-Wideband (UBW) have also been used for this purpose [94]. Moreover, vision-based approaches play a vital role in patient rehabilitation. Extracting body joints from RGB images using pose net has been very popular in case of fitness monitoring or rehabilitation. Pose net method extracts exact coordinates of body joints with high accuracy. Those joint coordinates can be used to monitor exercise [95]. For example, if someone needs to move his hand or other parts of the body with a certain amount of angle, this approach can evaluate the exercise if it is going perfectly or not.
3 Contactless Human Activity Analysis: An Overview of Different Modalities
105
Digital Well-being: It involves vital signals, i.e, heartbeats and breathing rate monitoring, sleep tracking and checking fitness activities. Some RF-based frameworks provide a contactless approach for heart rate and breathing rate estimation. Some of these frameworks utilize RSSI measurement [96, 97], some use fine-grained WiFi CSI [98, 99] and in another study doppler radar [100] is used. Adib et al. [9] proposed an FMCW-based approach that measures chest motion during inhalation and exhalation to calculate breathing rate. EQ-Radio [101] converts FMCW-based breathing rate measurement into human emotions using machine learning algorithms. SleepPoseNet [102] utilizes Ultra-Wideband (UBW) radar for Sleep Postural Transition (SPT) recognition which can be used to detect sleep disorders. Ultrasoundbased breathing rate monitoring study based on the human thorax movement has shown capability to monitor abnormal breathing activities [43]. Vision-based sleep or drowsiness detection is a significant application in monitoring the working environment. For example, monitoring a driver’s condition plays an important role to evaluate how much that person is fit at duty time. EZ-Sleep [103] provides a contactless insomnia tracking system that utilizes RF signal leveraging RF-based localization to find bed position and an FMCW-based approach to keep track of sleeping schedule. Day to day fitness activities like bicycle, toe-touch and squat can be detected and monitored using ultrasound-based system [51]. Moreover, totally contactless methods of acquiring Electrocardiogram (ECG) signal and detecting Myocardial Infarction (MI) have also been explored [104, 105]. Infection Prevention and Control: Due to the outbreak of COVID-19, the researchers have explored different rigorous approaches for Infection Prevention and Control (IPC) [106]. In order to limit the spread of the disease in a mass scale, the vision-based modalities have shown the most potential. CCTV footage of the public spaces, e.g., offices, markets, streets can be analyzed for monitoring socialdistancing [107], detecting face-masks [108, 109], detecting high body temperature [110, 111], etc. In very short period of time, some of the researches have been converted into commercial products. NVIDIA’s Transfer Learning Toolkit (TLT) and DeepStream SDK have been used to develop a real-time face-mask detection system [112, 113]. Stereolabs has shown usability of their 3D cameras in real-time social distancing monitoring [114]. Chiu et al. [115] presented a thermography technique which was used on a mass scale during previous SARS breakout. The system was used on 72,327 patients and visitors entering Taipei Medical University—Wan Fang Hospital in Taiwan for 1 month. Negishi et al. [116] proposed a vision-based framework which can monitor respiration and heart-rate along with skin temperature. Gesture Control: Nowadays with the widespread use of Internet of Things (IoT) and availability of smartphones and smartwatches, the contactless device navigation as well as smart home appliances control have opened new dimensions of hand gesture recognition applications. Sound-based CHAA researches explored the area of gesture control extensively using smartphones [53, 56] over last decade. Moreover, systems like AudioGest [52], SoundWave [54], MultiWave [50] have explored the possibility of using laptops and desktop computers. These gesture control methods include both static and dynamic hand gestures as well as small finger gestures. Systems like
106
F. F. Abir et al.
FingerIO [57] have presented air-writing recognition capability of CHAA. On the other hand, UbiK [58] and Ipanel [59] have explored surface-writing detection using fingers. WiFinger [117] is a WiFi-based gesture recognition system that leveraged the fine-grained CSI to recognize 9 digits with an accuracy of 90.4%. WiFinger can be used as a number text input with devices like smartphones and laptops. Googles project Soli [12] uses 60 GHz millimeter wave radar to detect finger gestures with very high accuracy. In computer vision domain, the invention of the Kinect sensor has brought about drastic changes in the entertainment industry. It had made easier to make 3D environment game and real-time interaction with the gaming environment [118]. Thus, with the increase research on CHAA modalities, new application scopes are being discovered every now and then.
3.6 Conclusion Human activity recognition (HAR) is a vast area of research with various categories. While contact-based methods have been there for a long time, from application point of view, they have considerable limitations. Therefore, Contactless Human Activity Analysis (CHAA) has been a very popular approach by leveraging the properties of wireless signals, ultrasound, video, and skeleton data. It can be used in a wide range of application fields with easy techniques and less complexity. These modalities have been discussed in the previous sections. Some of these modalities can be more effective than others depending on the requirements and applications. Video-based approaches require proper ambient lighting and LoS conditions to work properly. In case the environment does not satisfy these conditions, RF or soundbased approaches might be more suitable options. RF-based approaches do not need LoS but sound-based approaches have greater usability in day to day life due to the widespread use of smartphones. On the other hand, with proper environment, video-based approaches can achieve more accuracy compared to other methods. Flexibility, feasibility, and hardware requirements should also be given a good amount of thought before implementing one of these methods. In this chapter, we have given an overview of the evolution and the current state of all these approaches which will be a beneficiary guide for the new researchers in this field.
References 1. Hussain, Z., Sheng, M., Zhang, W.E.: Different approaches for human activity recognition: a survey. arXiv preprint arXiv:1906.05074 (2019) 2. Ma, J., Wang, H., Zhang, D., Wang, Y., Wang, Y.: A survey on wi-fi based contactless activity recognition. In: Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communica-
3 Contactless Human Activity Analysis: An Overview of Different Modalities
3.
4. 5. 6.
7.
8.
9.
10. 11.
12.
13.
14. 15. 16.
17.
18. 19. 20. 21.
22.
107
tions, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), pp. 1086–1091. IEEE (2016) Wang, Z., Hou, Y., Jiang, K., Zhang, C., Dou, W., Huang, Z., Guo, Y.: A survey on human behavior recognition using smartphone-based ultrasonic signal. IEEE Access 7, 100 581– 100 604 (2019) Foerster, F., Smeja, M., Fahrenberg, J.: Detection of posture and motion by accelerometry: a validation study in ambulatory monitoring. Comput. Human Behav. 15(5), 571–583 (1999) Watson-Watt, R.: Radar in war and in peace (1945) Frazier, L.M.: Radar surveillance through solid materials. In: Command, Control, Communications, and Intelligence Systems for Law Enforcement, vol. 2938. International Society for Optics and Photonics, pp. 139–146 (1997) Bahl, P., Padmanabhan, V.N.: Radar: an in-building rf-based user location and tracking system. In: Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies (Cat. No. 00CH37064), vol. 2, pp. 775–784. IEEE (2000) Adib, F., Kabelac, Z., Katabi, D., Miller, R.C.: 3d tracking via body radio reflections. In: 11th {USENIX} Symposium on Networked Systems Design and Implementation ({NSDI} 14), pp. 317–329 (2014) Adib, F., Mao, H., Kabelac, Z., Katabi, D., Miller, R.C.: Smart homes that monitor breathing and heart rate. In: Proceedings of the 33rd annual ACM Conference on Human Factors in Computing Systems, pp. 837–846 (2015) Halperin, D., Hu, W., Sheth, A., Wetherall, D.: Tool release: gathering 802.11 n traces with channel state information. ACM SIGCOMM Comput. Commun. Rev. 41(1), 53 (2011) Geisheimer, J.L., Greneker III, E.F., Marshall, W.S.: High-resolution doppler model of the human gait. In: Radar Sensor Technology and Data Visualization, vol. 4744. International Society for Optics and Photonics, pp. 8–18 (2002) Lien, J., Gillian, N., Karagozler, M.E., Amihood, P., Schwesig, C., Olson, E., Raja, H., Poupyrev, I.: Soli: Ubiquitous gesture sensing with millimeter wave radar. ACM Trans. Graph. (TOG) 35(4), 1–19 (2016) Anitha, U., Malarkkan, S., Premalatha, J., Prince, P.G.K.: Study of object detection in sonar image using image segmentation and edge detection methods. Indian J. Sci. Technol. 9(42) (2016) Katzir, S.: Who knew piezoelectricity? rutherford and langevin on submarine detection and the invention of sonar. Notes and Records of the Royal Society 66(2), 141–157 (2012) Elfes, A.: Sonar-based real-world mapping and navigation. IEEE J. Robot. Autom. 3(3), 249–265 (1987) Peng, C., Shen, G., Zhang, Y., Li, Y., Tan, K.: Beepbeep: a high accuracy acoustic ranging system using cots mobile devices. In: Proceedings of the 5th International Conference on Embedded Networked Sensor Systems, pp. 1–14 (2007) Filonenko, V., Cullen, C., Carswell, J.: Investigating ultrasonic positioning on mobile phones. In: 2010 International Conference on Indoor Positioning and Indoor Navigation, pp. 1–8.. IEEE (2010) Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Comput. Vis. Image Understan. 73(3), 428–440 (1999) Gavrila, D.M.: The visual analysis of human movement: a survey. Computer Vis. Image Understan. 73(1), 82–98 (1999) Krüger, V., Kragic, D., Ude, A., Geib, C.: The meaning of action: a review on action recognition and mapping. Adv. Robot. 21(13), 1473–1501 (2007) Liu, A.-A., Xu, N., Nie, W.-Z., Su, Y.-T., Wong, Y., Kankanhalli, M.: Benchmarking a multimodal and multiview and interactive dataset for human action recognition. IEEE Trans. Cybern. 47(7), 1781–1794 (2016) Liu, A.-A., Su, Y.-T., Nie, W.-Z., Kankanhalli, M.: Hierarchical clustering multi-task learning for joint human action grouping and recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(1), 102–114 (2016)
108
F. F. Abir et al.
23. Yang, X., Tian, Y.: Super normal vector for activity recognition using depth sequences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 804– 811 (2014) 24. Li, M., Leung, H., Shum, H.P.: Human action recognition via skeletal and depth based feature fusion. In: Proceedings of the 9th International Conference on Motion in Games, pp. 123–132 (2016) 25. Burghouts, G., Schutte, K., ten Hove, R.-M., van den Broek, S., Baan, J., Rajadell, O., van Huis, J., van Rest, J., Hanckmann, P., Bouma, H., et al.: Instantaneous threat detection based on a semantic representation of activities, zones and trajectories. Signal Image Video Process. 8(1), 191–200 (2014) 26. Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatiotemporal interest point (stip) detector. Visual Comput. 32(3), 289–306 (2016) 27. Nguyen, T.V., Song, Z., Yan, S.: Stap: Spatial-temporal attention-aware pooling for action recognition. IEEE Trans. Circuits Syst. Video Technol. 25(1), 77–86 (2014) 28. Xie, Y., Li, Z., Li, M.: Precise power delay profiling with commodity wi-fi. IEEE Trans. Mobile Comput. 18(6), 1342–1355 (2018) 29. Kim, Y., Ling, H.: Human activity classification based on micro-doppler signatures using a support vector machine. IEEE Trans. Geosci. Remote Sens. 47(5), 1328–1337 (2009) 30. Rappaport, T.S., et al.: Wireless communications: principles and practice 2 (1996) 31. Patwari, N., Wilson, J.: Spatial models for human motion-induced signal strength variance on static links. IEEE Trans. Inform. Forensics Secur. 6(3), 791–802 (2011) 32. Seidel, S.Y., Rappaport, T.S.: 914 mhz path loss prediction models for indoor wireless communications in multifloored buildings. IEEE Trans. Antennas Propagation 40(2), 207–217 (1992) 33. Yuan, Y., Zhao, J., Qiu, C., Xi, W.: Estimating crowd density in an rf-based dynamic environment. IEEE Sensors J. 13(10), 3837–3845 (2013) 34. Wu, K., Xiao, J., Yi, Y., Gao, M., Ni, L.M.: Fila: Fine-grained indoor localization. In: Proceedings IEEE INFOCOM, pp. 2210–2218. IEEE (2012) 35. Yang, Z., Zhou, Z., Liu, Y.: From rssi to csi: indoor localization via channel response. ACM Comput. Surv. (CSUR) 46(2), 1–32 (2013) 36. Tse, D., Viswanath, P.: Fundamentals of Wireless Communication. Cambridge University Press (2005) 37. Wang, W., Liu, A.X., Shahzad, M., Ling, K., Lu, S.: Understanding and modeling of wifi signal based human activity recognition. In: Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, pp. 65–76 (2015) 38. Pu, Q., Gupta, S., Gollakota, S., Patel, S.: Whole-home gesture recognition using wireless signals. In: Proceedings of the 19th Annual International Conference on Mobile Computing & Networking, pp. 27–38 (2013) 39. Soumekh, M.: Synthetic Aperture Radar Signal Processing. Wiley, New York, vol. 7 (1999) 40. Kim, Y., Moon, T.: Human detection and activity classification based on micro-doppler signatures using deep convolutional neural networks. IEEE Geosci. Remote Sens. Lett. 13(1), 8–12 (2015) 41. Griffiths, H.: New ideas in fm radar. Electron. Commun. Eng. J. 2(5), 185–194 (1990) 42. Liu, J., Liu, H., Chen, Y., Wang, Y., Wang, C.: Wireless sensing for human activity: a survey. IEEE Commun. Surv, Tutorials (2019) 43. Al-Naji, A., Al-Askery, A.J., Gharghan, S.K., Chahl, J.: A system for monitoring breathing activity using an ultrasonic radar detection with low power consumption. J. Sensor Actuator Netw. 8(2), 32 (2019) 44. Biswas, S., Harrington, B., Hajiaghajani, F., Wang, R.: Contact-less indoor activity analysis using first-reflection echolocation. In: 2016 IEEE International Conference on Communications (ICC), pp. 1–6. IEEE (2016) 45. Griffith, H., Hajiaghajani, F., Biswas, S.: Office activity classification using first-reflection ultrasonic echolocation. In: 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE 2017, 4451–4454 (2017)
3 Contactless Human Activity Analysis: An Overview of Different Modalities
109
46. Kalgaonkar,K., Raj, B.: Acoustic doppler sonar for gait recognition. In: 2007 IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 27–32. IEEE (2007) 47. Kalgaonkar, K., Hu, R., Raj, B.: Ultrasonic doppler sensor for voice activity detection. IEEE Signal Process. Lett. 14(10), 754–757 (2007) 48. Kalgaonkar, K., Raj, B.: Recognizing talking faces from acoustic doppler reflections. In: 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 1–6.IEEE (2008) 49. Kalgaonkar, K., Raj, B.: One-handed gesture recognition using ultrasonic doppler sonar. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1889– 1892. IEEE (2009) 50. Pittman, C.R., LaViola, J.J.: Multiwave: Complex hand gesture recognition using the doppler effect. Graphics Interface, pp. 97–106 (2017) 51. Fu, B., Kirchbuchner, F., Kuijper, A., Braun, A., Vaithyalingam Gangatharan, D.: Fitness activity recognition on smartphones using doppler measurements. In: Informatics, vol. 5, no. 2. Multidisciplinary Digital Publishing Institute, p. 24 (2018) 52. Ruan, W., Sheng, Q.Z., Yang, L., .Gu, L., Xu, P., Shangguan, L.: Audiogest: enabling finegrained hand gesture detection by decoding echo signal. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 474–485 (2016) 53. Qifan, Y., Hao, T., Xuebing, Z., Yin, L., Sanfeng, Z.: Dolphin: ultrasonic-based gesture recognition on smartphone platform. In: 2014 IEEE 17th International Conference on Computational Science and Engineering, pp. 1461–1468. IEEE (2014) 54. Gupta, S., Morris, D., Patel, S., Tan, D.: Soundwave: using the doppler effect to sense gestures. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1911–1914 (2012) 55. Wang, T., Zhang, D., Wang, L., Zheng, Y., Gu, T., Dorizzi, B., Zhou, X.: Contactless respiration monitoring using ultrasound signal with off-the-shelf audio devices. IEEE Internet Things J. 6(2), 2959–2973 (2018) 56. Wang, W., Liu, A.X., Sun, K.: Device-free gesture tracking using acoustic signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 82–94 (2016) 57. Nandakumar, R., Iyer, V., Tan, D., Gollakota, S.: Fingerio: using active sonar for fine-grained finger tracking. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 1515–1525 (2016) 58. Wang, J., Zhao, K., Zhang, X., Peng, C.: Ubiquitous keyboard for small mobile devices: harnessing multipath fading for fine-grained keystroke localization. In: Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services, pp. 14–27 (2014) 59. Chen, M., Yang, P., Xiong, J., Zhang, M., Lee, Y., Xiang, C., Tian, C.: Your table can be an input panel: Acoustic-based device-free interaction recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 3(1), 1–21 (2019) 60. Du, H., Li, P., Zhou, H., Gong, W., Luo, G., Yang, P.: Wordrecorder: accurate acoustic-based handwriting recognition using deep learning. In: IEEE INFOCOM 2018-IEEE Conference on Computer Communications, pp. 1448–1456. IEEE (2018) 61. Vrigkas, M., Nikou, C., Kakadiaris, I.A.: A review of human activity recognition methods. Front. Robot. AI 2, 28 (2015) 62. Jalal, A., Kamal, S., Kim, D.: Shape and motion features approach for activity tracking and recognition from kinect video camera. In: 2015 IEEE 29th International Conference on Advanced Information Networking and Applications Workshops, pp. 445–450. IEEE (2015) 63. Lin, W., Sun, M.-T., Poovandran, R., Zhang, Z.: Human activity recognition for video surveillance. In: IEEE International Symposium on Circuits and Systems. IEEE 2008, 2737–2740 (2008) 64. Liu, B., Cai, H., Ju, Z., Liu, H.: Rgb-d sensing based human action and interaction analysis: a survey. Pattern Recogn. 94, 1–12 (2019)
110
F. F. Abir et al.
65. Nie, Q., Wang, J., Wang, X., Liu, Y.: View-invariant human action recognition based on a 3d bio-constrained skeleton model. IEEE Trans. Image Process. 28(8), 3959–3972 (2019) 66. Ni, B., Wang, G., Moulin, P.: Rgbd-hudaact: a color-depth video database for human daily activity recognition. In: 2011 IEEE International Conference on Computer Vision Workshops (ICCV workshops). IEEE, pp. 1147–1153 (2011) 67. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008) 68. Marszalek, M., Laptev, I., Schmid, C.: Actions in context. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2929–2936. IEEE (2009) 69. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE Conference on Computer Vision and Pattern Recognition. IEEE 2008, pp. 1–8 (2008) 70. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. In: Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, vol. 2, pp. 1395–1402. IEEE (2005) 71. Fothergill, S., Mentis, H., Kohli, P., Nowozin, S.: Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1737–1746 (2012) 72. Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach, Intell (2019) 73. Carmi, R., Itti, L.: The role of memory in guiding attention during natural vision. J. Vis. 6(9), 4 (2006) 74. Corbillon, X., De Simone, F., Simon, G.: 360-degree video head movement dataset. In: Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 199–204 (2017) 75. Vakanski, A., Jun, H.-P., Paul, D., Baker, R.: A data set of human body movements for physical rehabilitation exercises. Data 3(1), 2 (2018) 76. Ramanathan, M., Yau, W.-Y., Teoh, E.K.: Human action recognition with video data: research and evaluation challenges. IEEE Trans. Human-Mach. Syst. 44(5), 650–663 (2014) 77. Wang, S., Song, J., Lien, J., Poupyrev, I., Hilliges, O.: Interacting with soli: exploring finegrained dynamic gesture recognition in the radio-frequency spectrum. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, pp. 851–860 (2016) 78. Bulling, A., Blanke, U., Schiele, B.: A tutorial on human activity recognition using body-worn inertial sensors. ACM Comput. Surv. (CSUR) 46(3), 1–33 (2014) 79. Ahad, M.A.R., Antar, A.D., Shahid, O.: Vision-based action understanding for assistive healthcare: a short review. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops 2019, 1–11 (2019) 80. Wilson, J., Patwari, N.: See-through walls: Motion tracking using variance-based radio tomography networks. IEEE Trans. Mobile Comput. 10(5), 612–621 (2010) 81. Adib, F., Katabi, D.: See through walls with wifi!. In: Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, pp. 75–86 (2013) 82. Chetty, K., Smith, G.E., Woodbridge, K.: Through-the-wall sensing of personnel using passive bistatic wifi radar at standoff distances. IEEE Trans. Geosci. Remote Sens. 50(4), 1218–1226 (2011) 83. Kosba, A.E., Saeed, A., Youssef, M.: Rasid: a robust wlan device-free passive motion detection system. In: 2012 IEEE International Conference on Pervasive Computing and Communications, pp. 180–189. IEEE (2012) 84. Ding, E., Li, X., Zhao, T., Zhang, L., Hu, Y.: A robust passive intrusion detection system with commodity wifi devices. J. Sens. 2018, (2018) 85. Fu, B., Karolus, J., Grosse-Puppendahl, T., Hermann, J., Kuijper, A.: Opportunities for activity recognition using ultrasound doppler sensing on unmodified mobile phones. In: Proceedings of the 2nd International Workshop on Sensor-Based Activity Recognition and Interaction, pp. 1–10 (2015)
3 Contactless Human Activity Analysis: An Overview of Different Modalities
111
86. Wang, Y., Liu, J., Chen, Y., Gruteser, M., Yang, J., Liu, H.: E-eyes: device-free locationoriented activity identification using fine-grained wifi signatures. In: Proceedings of the 20th Annual International Conference on Mobile Computing and Networking, pp. 617–628 (2014) 87. Guo, Z., Xiao, F., Sheng, B., Fei, H., Yu, S.: Wireader: adaptive air handwriting recognition based on commercial wi-fi signal. IEEE Internet Things J. (2020) 88. Reddy, K.K., Shah, M.: Recognizing 50 human action categories of web videos. Mach. Vis. Appl. 24(5), 971–981 (2013) 89. Soomro, K., Zamir, A.R., Shah, M.: A dataset of 101 human action classes from videos in the wild. Center Res. Comput. Vis. 2 (2012) 90. Jhuang, H., Garrote, H., Poggio, E., Serre, T., Hmdb, T.: A large video database for human motion recognition. In: Proceedings of IEEE International Conference on Computer Vision, vol. 4, no. 5, 2011, p. 6 91. Mubashir, M., Shao, L., Seed, L.: A survey on fall detection: principles approaches. Neurocomputing 100, 144–152 (2013) 92. Wang, Y., Wu, K., Ni, L.M.: Wifall: Device-free fall detection by wireless networks. IEEE Trans. Mobile Comput. 16(2), 581–594 (2016) 93. Wang, H., Zhang, D., Wang, Y., Ma, J., Wang, Y., Li, S.: Rt-fall: A real-time and contactless fall detection system with commodity wifi devices. IEEE Trans. Mobile Comput. 16(2), 511–526 (2016) 94. Sadreazami, H., Mitra, D., Bolic, M., Rajan, S.: Compressed domain contactless fall incident detection using uwb radar signals. In: 18th IEEE International New Circuits and Systems Conference (NEWCAS). IEEE 2020, pp. 90–93 (2020) 95. Kendall, A., Grimes, M., Cipolla, R.: Posenet: a convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015) 96. Patwari, N., Brewer, L., Tate, Q., Kaltiokallio, O., Bocca, M.: Breathfinding: a wireless network that monitors and locates breathing in a home. IEEE J. Selected Topics Signal Process. 8(1), 30–42 (2013) 97. Abdelnasser, H. Harras, K.A., Youssef, M.: Ubibreathe: a ubiquitous non-invasive wifi-based breathing estimator. In: Proceedings of the 16th ACM International Symposium on Mobile Ad Hoc Networking and Computing, pp. 277–286 (2015) 98. Liu, J., Chen, Y., Wang, Y., Chen, X., Cheng, J., Yang, J.: Monitoring vital signs and postures during sleep using wifi signals. IEEE Internet Things J. 5(3), 2071–2084 (2018) 99. Wang, X., Yang, C., Mao, S.: Phasebeat: exploiting csi phase data for vital sign monitoring with commodity wifi devices. In: 2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS), pp. 1230–1239. IEEE (2017) 100. Islam, S.M., Boric-Lubecke, O., Lubekce, V.M.: Concurrent respiration monitoring of multiple subjects by phase-comparison monopulse radar using independent component analysis (ica) with jade algorithm and direction of arrival (doa). IEEE Access 8, 73 558–73 569 (2020) 101. Zhao, M., Adib, F., Katabi, D.: Emotion recognition using wireless signals. In: Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking, pp. 95–108 (2016) 102. Piriyajitakonkij, M., Warin, P., Lakhan, P., Leelaarporn, P., Kumchaiseemak, N., Suwajanakorn, S., Pianpanit, T., Niparnan, N., Mukhopadhyay, S.C., Wilaiprasitporn, T.: Sleepposenet: multi-view learning for sleep postural transition recognition using uwb. IEEE J, Biomedical Health Inform (2020) 103. Hsu, C.-Y., Ahuja, A., Yue, S., Hristov, R., Kabelac, Z., Katabi, D.: Zero-effort in-home sleep and insomnia monitoring using radio signals. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 1(3), 1–18 (2017) 104. Weeks, J., Elsaadany, M., Lessard-Tremblay, M., Targino, L., Liamini, M., Gagnon, G.: A novel sensor-array system for contactless electrocardiogram acquisition. In: 42nd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pp. 4122–4125. IEEE (2020)
112
F. F. Abir et al.
105. Zhang, J., Chen, Y., Chen, T., et al.: Health-radio: towards contactless myocardial infarction detection using radio signals. IEEE Trans, Mobile Comput (2020) 106. Ulhaq, A., Khan, A., Gomes, D., Pau, M.: Computer vision for covid-19 control: a survey. arXiv preprint arXiv:2004.09420 (2020) 107. Yang, D., Yurtsever, E., Renganathan, V., Redmill, K., Özgüner, U.: a vision-based social distancing and critical density detection system for covid-19. Image video Process, DOI (2020) 108. Jiang, M., Fan, X.: Retinamask: a face mask detector. arXiv preprint arXiv:2005.03950 (2020) 109. Ge, S., Li, J., Ye, Q., Luo, Z.: Detecting masked faces in the wild with lle-cnns. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2682–2690 (2017) 110. Lahiri, B., Bagavathiappan, S., Jayakumar, T., Philip, J.: Medical applications of infrared thermography: a review. Infrared Phys. Technol. 55(4), 221–235 (2012) 111. Somboonkaew, A., Prempree, P., Vuttivong, S., Wetcharungsri, J., Porntheeraphat, S., Chanhorm, S., Pongsoon, P., Amarit, R., Intaravanne, Y., Chaitavon, K.: Mobile-platform for automatic fever screening system based on infrared forehead temperature. In: Opto-Electronics and Communications Conference (OECC) and Photonics Global Conference (PGC). IEEE 2017, pp. 1–4 (2017) 112. Github—nvidia-ai-iot/face-mask-detection: Face mask detection using nvidia transfer learning toolkit (tlt) and deepstream for covid-19. https://github.com/NVIDIA-AI-IOT/facemask-detection. Accessed 10 Oct 2020 113. Implementing a real-time, ai-based, face mask detector application for covid-19 | nvidia developer blog. https://developer.nvidia.com/blog/implementing-a-real-time-ai-based-facemask-detector-application-for-covid-19/. Accessed 10 Oct 2020 114. Using 3d cameras to monitor social distancing stereolabs. https://www.stereolabs.com/blog/ using-3d-cameras-to-monitor-social-distancing/. Accessed 10 Oct 2020 115. Chiu, W., Lin, P., Chiou, H., Lee, W., Lee, C., Yang, Y., Lee, H., Hsieh, M., Hu, C., Ho, Y., et al.: Infrared thermography to mass-screen suspected sars patients with fever. Asia Pacific J. Public Health 17(1), 26–28 (2005) 116. Negishi, T., Sun, G., Sato, S., Liu, H., Matsui, T., Abe, S., Nishimura, H., Kirimoto, T., "Infection screening system using thermography and ccd camera with good stability and swiftness for non-contact vital-signs measurement by feature matching and music algorithm. In: 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3183–3186. IEEE (2019) 117. Li, H., Yang, W., Wang, J., Xu, Y., Huang, L.: Wifinger: talk to your smart devices with finger-grained gesture. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pp. 250–261 (2016) 118. Altanis, G., Boloudakis, M., Retalis, S., Nikou, N.: Children with motor impairments play a kinect learning game: first findings from a pilot case in an authentic classroom environment. Interaction Design and Architecture (s) J.-IxD&A, vol. 19, no. 19, pp. 91–104 (2013)
Chapter 4
Signal Processing for Contactless Monitoring Mohammad Saad Billah, Md Atiqur Rahman Ahad, and Upal Mahbub
Abstract Monitoring human activities from a distance without actively interacting with the subjects to make a decision is a fascinating research domain given the associated challenges and prospects of building more robust artificial intelligence systems. In recent years, with the advancement of deep learning and high-performance computing systems, contactless human activity monitoring systems are becoming more and more realizable every day. However, when looked at closely, the basic building blocks for any such system is still strongly relying on the fundamentals of various signal processing techniques. The choices of a signal processing method depend on the type of signal, formulation of the problem, and higher-level machine learning components. In this chapter, a comprehensive review of the most popular signal processing methods used for contactless monitoring is provided, highlighting their use across different activity signals and tasks. Keywords Signal processing · Contactless monitoring · Activity signals
4.1 Introduction In recent times, contactless human monitoring has gained a lot of traction. Application areas of such systems include monitoring breathing pattern, respiratory rate and other
M. S. Billah (B) Nauto, Inc.,, Palo Alto, CA, USA e-mail: [email protected] M. A. R. Ahad Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] Department of Media Intelligent, Osaka University, Osaka, Japan U. Mahbub Qualcomm Technologies Inc., San Diego, CA, USA e-mail: [email protected] URL: http://www.springer.com/lncs © Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_4
113
114
M. S. Billah et al.
vital signs [1–3], event recognition [4], human motion classification [5] and analysing crowded scenes [6]. The vast proliferation of contactless sensors has enabled contactless activity monitoring with different activity signals. Among these, audio-based, light-based, and radio-frequency based sensors are most widely used. Different types of sensors has different strengths and weaknesses. For example, radio frequency and proximity sensors provide cheap contactless monitoring but suffers from low accuracy and high environmental inferences. Light based sensors such as camera, depth sensors, and LIDARs has accuracy and resolution but also expensive and requires high computational power for processing. Table 4.1 lists different aspects of the most popular signal sources for contactless monitoring. Table 4.1 lists different aspects of most popular signal sources for contactless monitoring. Signal processing is one of the most critical and fundamental blocks of contactless monitoring. From sensing the physical world to making a decision, e.g., recognizing, modeling, and understanding - signal processing techniques are used in every step in between. Figure 4.1 shows the typical lifetime of a signal in contactless monitoring that starts from activity signal acquisition by sensing the world using different contactless
Table 4.1 Different aspects of most popular signal sources for contactless monitoring Sensor Advantages/disadvantages Notable use-cases Audio-based – Speech – Acoustic – Ultrasound
Radio Frequency-based – RF – WiFi
Advantages – Moderate to High accuracy – Moderate to low cost – Ultrasound is precise for determining distances, highly sensitive to motion and has Long operating range
– Intelligent personal assistants (IPAs) [7] – Audio-based context/scene recognition [8, 9] – Human activity recognition [10] – Heart and respiration rate monitoring [2] – Office and indoor activity analysis [11, 12] – Rehabilitation support [13]
Disadvantages – Easily influenced by other audio signal/noise – Prone to false detection – Range limited – Privacy issues – Ultrasound is unidirectional, sensitive to temperature and angle of target and performance drops at very close proximity Advantages – Measuring vital signs [3] – Low cost – Indoor/outdoor localization and tracking – Simple computation Disadvantages – Environmental Inference (continued)
4 Signal Processing for Contactless Monitoring Table 4.1 (continued) Sensor Advantages/disadvantages Light-based – Infrared Sensors – Thermal Imaging Sensors – 2D Cameras – Depth Sensors and Hybrid Sensors
Advantages – High Accuracy Disadvantages – High cost – Privacy issues – Influenced by illumination, pose, occlusion and noise
Other Sensors – Passive Infrared Sensors (PIR) – Proximity Sensor
Advantages – Low cost – Simple computation Disadvantages – Low Accuracy – Limited usability
115
Notable use-cases – Activity recognition from thermal videos [14] – Facial expression analysis – Action recognition for robotics and HCI – Crowded scene analysis, anomaly detection – Pose prediction – Pedestrian detection for autonomous driving – Technology for assisted living such as fall detection – 3D body shape, face and hand modeling for augmented and virtual reality applications – Precise tracking of face and eye for autonomous driving scenarios – Liveliness detection for anti-spoofing of authentication systems – Activity recognition and tracking [15] – Collision avoidance technology for blind people and wheelchairs – Motion-based automatic control of switches for smart home systems
Fig. 4.1 Any contactless human activity analysis system usually follows this pipeline from left to right
sensors such as a microphone, camera, LIDAR, infra-red, or ultrasonic sensors. Different sampling and windowing techniques are used to acquire discrete signals from the continuous real world. The signal then goes through different pre-processing steps such as denoising and other filtering methods to enhance its quality. Features are then extracted from the signals to be used by different activity analysis algorithms. This chapter briefly discusses the signal processing steps and their applications.
116
M. S. Billah et al.
The chapter is organized as follows: Sect. 4.2 gives an overview of different sampling and windowing techniques. Section 4.3 discusses time and frequency domain processing techniques and their applications. In Sect. 4.4, some widely used feature descriptor, and their extraction techniques have been described. Different dimensionality reduction methods and their applications have been discussed in Sect. 4.5. Activity analysis algorithms are beyond the scope of this chapter and, therefore, not discussed.
4.2 Activity Signals Sampling and Windowing Techniques Signal sampling and windowing are two essential steps of signal processing that are applied during or right after the signal acquisition and play a critical role in the system’s performance. This section discusses applications of signal sampling and windowing methods and the impact of windowing in activity analysis.
4.2.1 Applications of Signal Sampling Sampling is the process of converting a continuous-time signal from the real, analog world to a discrete-time signal in the digital domain. The analog signal value is measured at certain time intervals to read ‘Samples’ for the digital domain. Analog signals are continuous in both amplitude and time, while the sampled digital signals are discrete in both. If a continuous signal is sampled at a frequency f s , the frequency components of the analog signal are repeated at the sample rate resulting in the discrete frequency response repeated at the origin, ± f s , ±2 f s , and so on. According to Nyquist–Shannon sampling theory [16], sampling needs to be at least at the Nyquist rate (2 x the maximum frequency of a signal f max ) or more for exact reproduction. Sampling below the Nyquist rate ( f N y ) causes information loss and aliasing. Unwanted components are introduced in the reconstructed signals during aliasing when signal frequencies overlap due to low sampling rate, and some frequencies of the original signal get lost in the process. Results of sampling a simple sine wave at different rates are shown in Fig. 4.2. In many real-life applications, noises represent the highest frequency component of a signal, and aliasing of those frequencies is undesired. Hence, low pass filtering is performed before sampling to prevent aliasing of the noise components. While ‘temporal aliasing’ occurs in signals sampled in the time domain (such as audio signals), it can also happen for spatially sampled signals, such as an image - a phenomenon referred to as ‘spatial aliasing’. Spatial sampling can cause jaggies on the edges as commonly seen on low-resolution versions of an image (example shown
4 Signal Processing for Contactless Monitoring
117
Fig. 4.2 Effect of sampling frequency. A 240 Hz sine wave is sampled using sampling frequencies of 2400 Hz, 1600 Hz, 800 Hz, 400 Hz, 200 Hz and 100 Hz (From top-left to bottom-right image). The aliasing cases are shown in the bottom row where f s < 2 × f max . Clearly the sampled signals in the bottom figures has unwanted components due to aliasing
in Fig. 4.3). Other artifacts of aliasing include wagon wheel effect1 for temporal sampling, temporal strobing when sampling in space-time, Moire´ effect [17] when sampling texture coordinates and sparkling highlights. For Spatio-temporal data like videos collected for surveillance, the temporal sampling rate needs to be high to prevent strobing effect and ensure no critical information is lost due to a lower sampling rate, which will otherwise defeat the purpose of a surveillance system. In Fig. 4.4 the histograms for average frame per second rate of video surveillance systems for two different years are shown.2 According to IPVM statistics 3 the average frame for video surveillance systems increased from 6 to 8 fps in 2011 to ≈10fps in 2016 statistics and then to 15fps in 2019. Understandably, commercial video surveillance systems are inclining towards higher frame rates to ensure high-quality seamless video streams for the customers. While an increased
1 https://michaelbach.de/ot/mot-wagonWheel/index.html. 2 Based
on https://ipvm.com/reports/frame-rate-surveillance-guide.
3 https://ipvm.com/reports/avg-frame-rate-2019.
118
M. S. Billah et al.
Fig. 4.3 a Original 1365 × 1365 pixel image obtained and modified from the Open Image Dataset V6 [18]. b Image down-sampled to 64 × 64 pixels by sampling every fourth sample and applying a box filter. The jagged patterns and high dimensional noise introduced by aliasing and the box filtering are clearly visible. c Down-sampled to 64 × 64 pixels using an anti-aliasing Lanczos filter [19]
Fig. 4.4 Histogram of average FPS for video surveillance. There is a clear trend of increasing FPS over the years
frame rate can lead to higher bandwidth requirements for such a system, depending on the compression methods used, bandwidth does not increase linearly with frame rate.4 Table 4.2 lists the specifications of some cameras used in the industry. As listed in the table, it can be seen that the newer models have more spatio-temporal sampling rate. Table 4.3 lists the specifications of LIDAR sensors.
4 https://ipvm.com/reports/frame-rate-surveillance-guide.
4 Signal Processing for Contactless Monitoring
119
Table 4.2 Specifications of some surveillance cameras used in the industry Model Release Resolution FPS Axis M3004 Sony SNC-EM600 Reolink RLC-423 Reolink RLC-410 Hanwha (Samsung) PNO-9080R
2012 2013 2015 2017 2016
1.0 MP 1.3 MP 5 MP 5 MP 12 MP
30 30 25 25 20
Table 4.3 Specifications of some LIDER sensors used in the industry Model Release Range Resolution Scan rate Accuracy Velodyne HDL 64 Velodyne Puck Ultra Quanergy M8
Weight
2007
120 m
0.08/0.4
2.2 M
2 cm
12.7 kg
2016
200 m
0.1/0.33
1.2 M
3 cm
0.925 kg
2016
150
0.03
1.26 M
3 cm
0.900 kg
4.2.2 Impact of Signal Windowing on Activity Analysis Windowing plays a vital role in activity analysis performance. Given the type of signal and application, the method and duration of windowing can vary widely. For example, in [20], the authors demonstrated that the size of the window plays a significant role in determining speech intelligibility and the optimum hamming window duration for speech reconstruction from short-term magnitude spectrum is 15–32 ms. When choosing a window for a 1-D signal, the following factors can be considered: – width of the main lobe, – spectral leakage from the attenuation of the side lobes, and – rate of attenuation of the side-lobes. In Fig. 4.5, the five time domain window functions, namely, rectangle, bartlett, hamming, hanning and blackman [21, 22], with their respective frequency domain responses are shown. The values of the window functions at the n-th sample for a window length of N where 0 ≤ n ≤ N are defined as follows:
120
M. S. Billah et al.
Fig. 4.5 Time (left) and Frequency (right) domain responses of five different window functions
Rectangular, w[n] = 1,
(4.1)
n − N /2 Bartlett, w[n] = 1 − , N /2
(4.2)
2πn , Hamming, w[n] = 0.54 − 0.46cos N 2πn , Hanning, w[n] = 0.5 − 0.5cos N 4πn 2πn + 0.08cos . Blackman, w[n] = 0.42 − 0.5cos N N
(4.3) (4.4) (4.5)
As can be seen in Fig. 4.5, the rectangle window has the narrowest main lobe but higher side lobe strength, while the other windows have wider main lobe but lower side lobes. Hence, a rectangular window would be a better choice to separate two signals with similar frequency and strength but worse choice for identifying two signals with different frequencies and strength due to the spectral leakage and lower rate of attenuation of the side lobes [22, 23]. The 1-D signal windowing techniques are extended to 2-D spatial windows, also known as kernels. The choice of a kernel depends on the type of the image processing task. A simple example would be Gaussian kernels that are widely used for image smoothing and de-noising [24]. An isotropic 2D Gaussian kernel of unit magnitude has the following form: G(x, y) =
1 − x 2 +y2 2 e 2σ 2πσ 2
(4.6)
where, x and y are the pixel indexes from the center and σ is the standard deviation. Temporal windowing, a.k.a temporal segmentation is an integral part of action recognition systems for real-time applications. Sliding windows are the most common windowing techniques for such scenarios [25]. However, based on specific usecases the length of the temporal window might or might not change dynamically. Also, the temporal overlap between consecutive windows are considered. Also, the
4 Signal Processing for Contactless Monitoring
121
size of the windows can be dynamically expanded or shrunk based on activity inference in some system [25]. In a macro-level view, the design choices are as follows: 1. Fixed-length window • Non-overlapping windows – No dynamic shrinking and/or expansion – Dynamic shrinking and/or expansion • Overlapping windows – No dynamic shrinking and/or expansion – Dynamic shrinking and/or expansion 2. Dynamic-length window • Non-overlapping windows – No dynamic shrinking and/or expansion – Dynamic shrinking and/or expansion • Overlapping windows – No dynamic shrinking and/or expansion – Dynamic shrinking and/or expansion When training machine learning systems, windowing plays an implicit yet vital role for most applications when creating mini-batches. In a recent work, the authors proposed a framework that uses a sliding-window data scheduler to achieve state-ofthe-art performance, for instance, classification task [26]. More examples use cases of windowing associated with deep learning include object localization [27, 28], autonomous navigation [29], window slicing and pooling techniques in deep neural networks [30] and modeling temporal patterns [31]. Now that we have established the importance of signal sampling and windowing techniques on conveniently acquiring the sensor data for digital processing, we move forward to discuss how time and frequency domain signal processing approaches are being utilized to extract or meaningful information from those data in the next section.
4.3 Time and Frequency Domain Processing for Contactless Monitoring Time and frequency domain techniques are applied to activity signals to analyze and enhance the signal. Different frequency domain transform techniques are frequently used in activity analysis. Time and frequency domain filtering is another important and widely used technique used for signal enhancement. This section first discusses the applications of frequency-domain transforms. The latter part of the section provides a brief introduction to filtering and some notable use cases.
122
M. S. Billah et al.
4.3.1 Applications of Frequency Domain Transforms Frequency domain transforms are commonly applied to activity signals to analyze and leverage the periodicity information for decision-making purposes. A very practical use-case is Remote photoplethysmography (rPPG) for monitoring heart-rate from surveillance videos [32, 33]. For example, in [34] the authors extracted the pixels of interest from the face images in consecutive video frames, took the average pixel values for each of the RGB channels, filtered-out low-frequency components and investigated the frequency-domain representation to find the frequency with maximum power which is a close approximation of the heart-rate. The most popular frequency-domain representation for such applications is the power spectral density (PSD), a measure of signal power at different frequencies. For speech analysis, such concentration of acoustic energy around a particular frequency, known as formants, has been used for a wide range of applications, including automatic speech recognition [35], voice activity detection [36] and speech enhancement [37]. When dealing with 1D temporal signals such as speech or ultrasound, one of the most popular analysis tools is the short-time Fourier Transfrom (STFT) which is frame-level frequency domain representation [38–40]. A visual extension of STFT is a Spectrogram (also known as sonographs/voicegrams/voiceprints), which is commonly plotted as a frequency vs time series 2D image where the pixel intensities represent the magnitude of the frequency component [41]. Spectrograms are calculated for short-time, overlapped, sliding windows of T time samples x = (x1 , x2 , . . . , x T ) where the temporal duration of the window is chosen to be small (typically 25–35 ms) to ensure that the speech within that frame will be stationary. The value of the Spectrogram at the k’th frequency bin is defined as Speck (x) = |
T t=1
eikt xt |2 = (
T t=1
cos(kt)xt )2 + (
T
sin(kt)xt )2 .
(4.7)
t=1
Spectrograms are convenient for visualizing the effects of speech enhancement as an be seen in Fig. 4.6 obtained with permission from [42]. In [42], the authors addressed the problem of acoustic echo cancellation from speech under noisy condition. Apart from the spectrogram to visualize the results, the authors also applied spectral subtraction [43] for noise reduction which involves transforming the noisy signal into frequency domain using Fast Fourier Transform (FFT) [44] on the short-term windows of the discretized speech signal and subtracting frequency-domain estimate of noise spectrum (usually obtained and updated from speech pauses) before reverting the signal to time domain samples using inverse FFT (IFFT). Typically, spectrograms use linear frequency scaling. Mel-frequency scales are developed inspired by the properties of the human auditory system to follow a quasilogarithmic spacing. Mel-frequency filters are non-uniformly spaced in the frequency domain, with more filters in the low-frequency region than higher frequency regions. Cepstral coefficients obtained for Mel-spectrum are popularly known as MFCC (Mel-
4 Signal Processing for Contactless Monitoring
123
Fig. 4.6 Spectrogram for a original, b echo and noise corrupted, and c enhance signal—reproduced with permission from [42]
frequency Cepstral Coefficients) features, which can be considered as ‘biologically inspired’ž speech features [45–47]. Following are notable use-cases of different variations of frequency-domain transforms in the contact-less human activity analysis domain: – Wavelet transform [48]: Data compression such as JPEG2000 image compression standard [49]; video-based human activity recognition [50, 51]; Doppler range control radar sensor-based fall detection [52]; WiFi signal-based human activity recognition [53]; audio compression [54]. – Discrete Cosine Transform [55]: 3D motion analysis [56]; audio compression [54]; – Laplace Transform [57–59]: Non-articulatory sound recognition [60]; – Z-transform [57, 58]: Speech recognition [61]; Speech modeling and analysis [62]; Pole-zero representation for linear physical system for analysis and filter design [63].
4.3.2 Time and Frequency Domain Filtering A filter is a function or operator that modifies a signal by performing mathematical operations to enhance or reduce certain aspects of the signal. If n-dimensional signal is represented as an n-dimensional function, then mathematically, a linearly-filtered 2D-signal can be represented as
124
M. S. Billah et al.
Fig. 4.7 Left—original image, middle—5 × 5 box filter kernel, right—filtered image
g(x, y) =
W
f (x + m, y + n)h(m, n).
(4.8)
m,n
Here, h is known as the filter kernel and h(m, n) is known as a kernel weight or filter coefficients. A simple filter kernel is the moving average or box filter that computes the average over a neighborhood or window. Figure 4.7 shows an example application of such a box filter, which is also a form of low-pass/blurring filter. Applications of signal filtering include enhancement such as denoising and resizing, information extraction such as texture and edge extraction, pattern detection such as template matching, etc. Figure 4.8 shows such examples where filtering is used to extract vertical and horizontal edges. An extension to the basic filters are adaptive filters whose coefficients change based on an objective or cost function (Eq. 4.8). These filters are used to modify input signals so that their output is a reasonable estimate of the desired signal. Examples include Least Mean Square (LMS) adaptive filters, Recursive Least Square (RLS) adaptive filters, adaptive Wiener filters, and adaptive anisotropic filters. Adaptive filtering has applications in active noise control [64–66], echo cancellation [67], biomedical signal enhancement [68], tracking [69], and equalization of communications channels. Some notable use cases of filtering, such as contrast stretching and histogram equalization, denoising, and convolutional filters, are briefly discussed next. Contrast Stretching and Histogram Equalization A large number of pixels occupy only a small portion of the available range of intensities in a poorly contrasted image. The problem can efficiently be handled by histogram modification, thereby reassigning each pixel with a new intensity value. That way, the dynamic range of gray levels is increased. Contrast Stretching and Histogram Equalization are such two contrast enhancement techniques.
4 Signal Processing for Contactless Monitoring
125
Fig. 4.8 Left—original image. Middle:top—kernel that emphasizes vertical edges, bottom—kernel that emphasizes horizontal edges. Right—output feature map corresponding to the kernel on the left
The idea behind contrast stretching is to increase the dynamic range of the gray levels in the image being processed [24]. Contrast stretching is a simple image enhancement technique that attempts to improve the contrast in an image by ‘stretching’ the range of intensity values it contains to span the desired range of values, e.g., the full range of pixel values that the image type concerned allows. Histogram Equalization is a method that increases the contrast of an image by increasing the dynamic range of intensity given to pixels with the most probable intensity values. The histogram equalization is a basic procedure that allows us to obtain a processed image with a specified intensity distribution. Sometimes, the distribution of the intensities of a scene is known to be not uniform. The goal of histogram equalization is to map each pixel’s luminance to a new value such that the output image has an approximately uniform distribution of gray levels. To find the appropriate mapping, the cumulative distribution function (CDF) of the original image’s pixel values is matched with a uniform CDF [70] (Fig. 4.9).
126
M. S. Billah et al.
Fig. 4.9 Top: left—original image, middle—image enhanced by contrast stretching, right— enhanced by histogram equalisation. Bottom: histogram of pixel values for the corresponding top row image
Denoising Denoising is the process of removing noise from a signal. Noise reduction techniques exist for both 1D signals such as speech and 2D signals such as images. Denoising is generally a pre-processing step used before extracting features from a signal. If we have a signal x that is corrupted with noise η as f¯(x, y) = f (x, y) + η(x, y) then a denoising filter h is a filter designed to estimate f such that f (x, y) =
W
f¯(x + m, y + n)h(m, n)
(4.9)
m,n
For example, a median filter is a denoising filter that performs very well on images containing binary noise such as salt and pepper noise. The median filter considers each pixel in the image in turn. It looks at its nearby neighbors to make sure that it
4 Signal Processing for Contactless Monitoring
127
Fig. 4.10 Left to right: Median filter of sizes 3 × 3, 5 × 5 and 7 × 7, respectively, are applied on a noisy image (left-most) for denoising
represents its surroundings by replacing it with the median of those values. It is a non-linear filter, and its output is the following f (x, y) = median f¯(x + m, y + n), (m, n) ∈ W
(4.10)
In general, the median filter allows a great deal of high spatial frequency detail to pass while remaining very effective at removing noise on images where less than half of the pixels in a smoothing neighborhood have been affected. One of the major problems with the median filter is that it is relatively expensive and slow to compute since finding the median requires sorting all the neighborhood values into numerical order. A common enhancement technique is to utilize the relative sorting information from the previous neighborhood window to the next (Fig. 4.10). Convolutional Filters Another application where this kind of filtering is central is the convolutional neural network or CNN [71]. Convolutional neural networks use multiple filters in parallel, where each kernel extracts a specific feature of the input. The convolutional layers are not only applied to the input, but they are also applied to the output of other layers. The outputs of these layers are called feature maps as they contain valuable information extracted from the input that helps the network perform its task. Unlike traditional computer vision, where the kernels are generally hand-crafted, CNN learns the kernels’ weight during the training of the network. For example, in [72] an excellent demo for visualizing the output of each convolution layer for a convolutional neural network trained to perform handwritten digit classification is presented. The input (a handwritten digit ‘4’), intermediate convolutional and fully connected layer output features as well as final predicted class for a convolutional neural network trained on the MNIST dataset [73] is shown in Fig. 4.11. The network used is the famous LeNet-5 proposed in [74]. It can be observed that the output of the lower level convolution layers (second and third rows from bottom) are visually interpretable such as edges and corners of the input image. In contrast, the visual information is abstracted out in the higher-level features produced by the fully connected layers (third and second rows from the top) to compress and convert the data in the output classification domain.
128
M. S. Billah et al.
Fig. 4.11 Input, intermediate features and classification output (bottom to top) of a CNN produced using the web tool provided by [72]
The time and frequency domain filtering techniques discussed in this section are heavily utilized for signal pre-processing and meaningful feature extraction. In the next two sections, we discuss the low and high-level feature extraction methods that directly apply different signal processing methods.
4.4 Feature Extraction A feature vector or descriptor encodes a signal in such a way that allows it to be compared with another signal. A local descriptor describes or encodes a path within the signal. Multiple local descriptors are used to encode or compare signals. Local descriptors are used in applications like activity recognition. A global descriptor describes the whole image. Global descriptors are generally used for applications like activity detection, and classification, etc.
4.4.1 Local Descriptors Local descriptors describe a feature based on unique patterns present in the neighborhood of the feature location. Some feature descriptor algorithm has its own feature detector. However, individual detectors can also be paired with different
4 Signal Processing for Contactless Monitoring
129
descriptors. For convenience, this section is organized into two subsections. Section 4.4.1.1 discusses the time/spatial domain features and Sect. 4.4.1.2 discusses the frequency domain features.
4.4.1.1
Time/Spatial Domain Features
Time or spatial domain features are extracted from the time or spatial domain representation of the signal. Some of the widely used low-level features and their applications are briefly discussed next. They are generally easy to define and extract and have weaker requirements for invariant extraction [75]. Latter part of the section discusses some of the widely used high-level local feature descriptors. Zero-Crossing Rate (ZCR) is a time-domain feature that measures the signal’s noisiness. It is the rate of sign-changes of the signal. For the i − th frame of length N with samples xi (n) where n = 0, 1, . . . , (N − 1), the ZCR is defined as Zi =
N −1 1 |sgn[x(n)] − sgn[x(n − 1)]|, 2N n=1
(4.11)
where, sgn[x] is the sign function defined as sgn[x] =
−1 if x ≤ 0 1 if x > 0.
(4.12)
Kim et al. [76] proposed a new model for speech recognition in noisy environments that uses ZCR. It is also used in speech-music discrimination [77], music genre classification [78], and several other applications. The signal envelope of an oscillating signal is the smooth curve outlining its extremes. Speech signal envelope and its change are used in speech recognition applications [79]. The short term energy of a signal is another simple time-domain feature. If a signal window contains N samples, then the short-term power is computed according to the equation: N 1 |xi (n)|2 (4.13) E= N n=1 The short-term power exhibit high variation over successive speech window i.e., power envelope rapidly alternates between high and low power states. Therefore, an alternative statistic, which is independent of the signal intensity, is the standard deviation by mean value ratio is also used. Signal power-based features are used in speech activity detection applications [80]. Edge is an important feature used in computer vision. The Edges in an image are associated with discontinuity in the image intensity, which generally corresponds to
130
M. S. Billah et al.
Fig. 4.12 Output of Canny edge detector for an image
Fig. 4.13 Harris corner detection algorithm applied on an image
discontinuities in depth, variations in material properties or scene illumination, etc. Canny [81], Sobel [82], Prewitt [83] are some of the most notable edge detectors (Fig. 4.12). Corner features are frequently used in motion detection, video tracking, and object recognition. A corner is defined as the intersection of two edges. In the region around a corner, the image gradient has two or more dominant directions. Corners are easily recognizable in an image when looking through a small window, and shifting the window in any direction gives a large change in intensity. The Shi-Tomasi detector [84] and the Harris detector [85] are examples of two popular corner detectors (Fig. 4.13). Scale Invariant Feature Transform (SIFT) [86] is one of the most popular feature descriptors for images among the high-level local descriptors. SIFT has scale invariance property. The feature extracted by the SIFT algorithm is called feature descriptor, which consists of a normalized 128-dimensional vector, and it describes a feature point in terms of location, scale, orientation. SIFT feature is used in activity analysis such behaviour detection [87], activity recognition [88], etc.
4 Signal Processing for Contactless Monitoring
131
Another edge/gradient-based feature detector inspired by the SIFT is speeded up robust features (SURF) [89]. The SURF approach’s main interest lies in its fast computation of operators using box filters, thus enabling real-time applications such as tracking and object recognition [90–93]. Despite their excellent performance, both SIFT and SURF are quite memory intensive algorithms (512 bytes and 256 bytes respectively per feature point), which makes them infeasible for resource-constrained applications. Binary Robust Independent Elementary Features (BRIEF) provides a shortcut to find binary string from the floating-point feature descriptors [94]. One crucial point is that BRIEF is a feature descriptor; it does not offer any method to find the features, so a feature detector like SIFT, SURF, or FAST [95] has to be used to locate the keypoints. Gunduz et al. extracted crowd dynamics using BRIEF features in [96]. An efficient alternative to SIFT and SURF that provides better performance than BRIEF is Oriented FAST and Rotated BRIEF (ORB) descriptor. BRIEF performs poorly with rotation, so ORB steers BRIEF and according to the orientation of keypoints. ORB features has been used in activity forecasting [97] and motion detection [98] applications. Binary Robust Invariant Scalable Keypoints (BRISK) [99], Fast Retina Keypoint (FREAK) [100], KAZE [101], and Accelerated-KAZE (AKAZE) [102], are some other widely used feature descriptors. Figure 4.14 shows the performance of different low level feature detectors.
4.4.1.2
Frequency Domain Features
The spectral centroid and the spectral spread are two measures of spectral position and shape of a signal. The spectral centroid is the center of gravity of the spectrum, and the spectral spread is the second central moment of the spectrum. These features are useful in audio analysis tasks such as audio brightness prediction [103], audio timbre measurement [104], etc. Spectral Entropy is another frequency domain feature. To compute spectral entropy, first the signal spectrum is divided into L sub-bands, the total spectral energy then normalizes the energy el of each sub-band, and the entropy is finally computed as
L−1 el el . (4.14) H =− L−1 log L−1 l=0 el l=0 el l=0 The standard deviation of sequences of spectral entropy is used to classify sound classes [105, 106]. Other applications include music fingerprinting [107], encoding [108], signal monitoring [109], etc. A variant of spectral entropy called chromatic entropy has also been used to efficiently discriminate between speech and music.
132
M. S. Billah et al.
Fig. 4.14 Low level feature detection, from top left to bottom right, original, Shi-Tomasi, SIFT, SURF, FAST, and ORB
Other examples of 1-D low-level features include spectral flux [110], spectral rolloff, etc. Frequency domain techniques can be used in images in the same way as onedimensional speech signals. However, images do not have their information encoded in the frequency domain, making this technique much less useful to understand information encoded in images [111].
4 Signal Processing for Contactless Monitoring
133
4.4.2 Global Descriptors Among different available global descriptors, Motion History Image (MHI) and its variants are widely explored for various human action recognition and applications [112–115]. MHI template or image can incorporate the motion sequence or video’s full motion information in a compact manner [114]. It has been a popular template, especially for representing a single person’s action or motion information. A binarized representation of MHI is called a Motion Energy Image (MEI) [114]. The MEI retains the entire motion area or locations where there was any motion information in the entire video sequence. The MHI provides the history of the motion information and direction or flow of the motion. On the other hand, the MEI retains the motion region or area—thereby, it provides the energy or the points of motion areas. A smarter silhouette sequence can allow us to get a better MHI template. The MHI also provides the temporal changes and directions of the motion. For example, suppose a video has sitting to standing sequences. In that case, the produced MHI can give a final image where past or initial motion information becomes less-brighter than the later or final motion regions with brighter pixel values. From these, we can assume that the motion is from a lower to the upper direction. The MHI representation is less sensitive to shadows, silhouette noises, or minor missing parts. Figure 4.15’s (top row) depicts five Motion History Images for an action for the first 10 frames (as shown in the 1st column), until 15 frames, until 34 frames, until 36 frames, and until the end of 46 frames from the beginning [112]. The respective Motion Energy Images are demonstrated in Fig. 4.15 (bottom row) for the same action. These are computed from a gesture from the Kaggle Gesture Challenge ‘ChaLearn’ Database. The MHI can be used for Action recognition and analysis, Gait recognition, Gesture recognition, Video analysis, Surveillance, Face-based depressive symptomatology [116] analysis, Fall detection [117], Visualization of the hypoperfusion (decreased blood flow) in a mouse brain [118], Depth image-based action recognition
Fig. 4.15 Examples of the computation of the MHI (top row) and the MEI (bottom row) images for a gesture at different temporal states from the beginning of the action [112]
134
M. S. Billah et al.
and removal of self-occlussion [119], Body movement trajectory recognition [120], Biospeckle assessment of growing bacteria [121], and Emotion recognition [122]. It has also been explored for gaming and other interactive applications and real-time, as the computational cost is minimal. There have been a number of variants at the top of the MHI. For example, Average Motion Energy (AME) [123], Mean Motion Shape (MMS) [123], Motion-shape Model, modified-MHI, Silhouette History Image (SHI) [124], Silhouette Energy Image (SEI) [124], Hierarchical Motion History Histogram (HMHH) [125], Directional Motion History Image (DMHI) [126, 127], Multi-level Motion History Image (MMHI) [128], Edge MHI [129], Hierarchical Filtered Motion (HFM) [130], Landmark MHI [131], Gabor MHI [116], Enhanced-MHI [117], Local Enhanced MHI (LEMHI) [122, 132], etc. are exploited for human action recognition. For gait recognition with the MHI/MEI, Dominant Energy Image (DEI) [133], Motion Energy Histogram (MEH) [134], Gait Moment Energy Image (GMI) [135], Moment Deviation Image (MDI) [135] are explored along with the most-widely explored approach for gait recognition called Gait Energy Image (GEI) [136]. Till-todate, the GEI becomes the unparalleled leader for gait recognition methods. Motion Color Image (MCI) [137], Volume Motion Template (VMT) [138], Silhouette History Image (SHI), Silhouette Energy Image (SEI), etc. are exploited for gesture recognition. Motion History Volume (MHV) [139, 140] and Motion Energy Volume (MEV) are explored to detect unusual behavior for the application of video surveillance. Volumetric Motion History Image (VMHI) [141, 142] is another model similar to the VMT [138], or the MHV [140] as 3D model of the MHI template for other applications. MHI and its variants have also seen some applications in the deep learning domain. A recent work explored the MHI in deep learning [143] for gesture recognition. They fed the MHI into a 2D CNN-based VGGNet, parallel with the 3D DenseNet model, to recognize some gestures. Depressive symptomatology is assessed by using a variant of the MHI called Gabor MHI [116], and they explored deep learning in their method. In another approach, the MHI is used with ResNet classifier to detect cyclists’ earlystart intention [144]. For emotion recognition, a Local Enhanced MHI (LEMHI) is fed into a CNN network in [122, 132]. However, in the future, the MHI or its variants can be explored more along with deep learning approaches by the researchers. Convolutional Neural Networks (CNNs) are also successfully being used to generate global descriptors. Autoencoder networks [145] learns a compact representation/descriptor of the input data which is used in dimensionality reduction [146] and clustering [147]. State-of-the-Art classifier models such as ResNet [148], InceptionNet [149] and RetinaNet [150] are also used in global descriptor learning [151] and has demonstrated superior performance over traditional embeddings [152].
4 Signal Processing for Contactless Monitoring
135
4.5 Dimensionality Reduction Methods Working with high-dimensional data is not always suitable due to high computational requirements and raw data sparsity. Dimensionality reduction is transforming data from a high dimensional space to low dimensional space while retaining some of its meaningful properties. If we have data with dimensionality D lying on the space S that has intrinsic dimensionality d, where d < D and often d 101 22
30 30
35
95 40 N/A 55
255 400
# Falls
Acted Acted Youtube, Movie Acted
Acted Acted
Acted
Acted Acted Acted Acted
Acted Acted
Real/ Acted
10 6 N/A 1
5 5
10
4 20 40 10
17 50
# Subjects
Legend: Ref.— Reference; N/A—Not Available; IMU—Inertial Measurement Unit; IR—Infrared;
[122] [123] [124] [125]
[121] [37]
[120]
Dataset
References
Table 8.5 Publicly available datasets on contactless fall detection approaches
N/A N/A N/A Adult
>26 Young
N/A
30–40 25–35 10–35 N/A
22–58 21–40
Age
Indoor Indoor Both Indoor
Indoor Indoor
Indoor
Indoor Indoor Indoor Indoor
Indoor Indoor
Environment
Video Camera Kinect Video camera Video camera
Thermal camera Kinect, IMU Kinect
IR, IMU, EEG Kinect, accelerometer Kinect, IR Kinect Kinect Webcam
Sensing unit
8 Contactless Fall Detection for the Elderly 225
226
M. J. A. Nahian et al.
• Vision-based systems are ineffective in the dark, as surveillance cameras struggles in low-light and varied illumination states [33, 128]. • Subjects require to remain in the line of sight, while being monitored. • Occlusion problem [58] is another important constraint though many methods are claiming to overcome this problem. • Vision-based fall detection system always uses complex techniques of computer vision and image processing to track an individual, requiring substantial computing and storage capacity to operate a real-time algorithm. So in those cases computational cost is bit higher [33, 50]. • Moreover, vision-based methods are case specific and depend on different scenarios [14]. • Their performance varies from situation to situation [129]. • Camera-based sensors may raise privacy concerns in private rooms [128]. While considering acoustic sensor-based methods, there are some limitations to work on e.g., • • • • • • •
These approaches work only in directional states (ultrasonic). Other audio signals/noise influence the performance of these approaches. Acoustic sensors are very much temperature-sensitive. These sensors demand that the angle of the target needs to be precise (ultrasonic). Sound-based approaches are very much susceptible to false detection. Acoustic receivers can be interrupted easily by noise. These systems are costly to retrofit in indoor environments.
In RFID-based fall detection system has several challenges too. Some of the important issues are: • Transmission of the RFID signal may result in intervention with other electromagnetic signals present in the home environment. This probable interruption has an impact on overall performance [130]. • If any intruder gets access to personal unencrypted data with RFID tags, then privacy and security will be the matter of an important concern. • Previously, RFID technology was very expensive to implement but these days it is very much cost-efficient and secured. However, in the developing countries, regulatory policies and social exclusion are two of the biggest challenges to face. • The multipath effect (e.g., ghost target) is an issue for radar-based detection systems. Clutter in homely settings succors to generate those effects. • People or any other interference in the area of target has strong probability of impacting on performance, create complexities and produce false alarms. Some recommendations are made for advancement of contactless fall detection approaches for the future: • A general evaluation guideline is required which will be the standard for every fall detection system.
8 Contactless Fall Detection for the Elderly
227
• Large real-life datasets considering different fall events in diverse scenarios should be publicly available for the scientific community for the development and research purposes. As best result requires best dataset to train and evaluate. Simulated fall (94% of fall dataset) results false detection [131]. It is difficult to gather real fall data but a long-term experiments needs be conducted in nursing homes using wearable and contactless sensor-based methods [132–136]. • Redesigning of current systems may be a new direction for future research. Novel sensors may result the state-of-art performance. If novel sensor is difficult to develop soon so sensor fusion techniques can be used to develop a hybrid system for example many researchers are working on combined audio and video-based fall detection. Combining wearable and contactless might be an future research area to focus [137]. • To take a constructive approach by recognizing fall risk factors and improving fall prevention mechanisms might be a better option rather than identifying and responding to critical events in the correct time. • Self-learn activities need to be adapted in order to decrease false alarms. Hence, the adaptive fall detection and prevention system that has adaptive capabilities is a new development to detect or predict fall. • More and more complex scenarios need to be considered like [31] in visionbased approaches. Dependency on light need to be reduced so depth images can be explored [33]. Preprocessing steps need to be reduced, it will help in sorting issues like occlusion, shadow and cluttering [54]. • In RF-based approaches, computer software specified radios can be used to retrieve the same information where the number of sub-carriers can be changed as per the particular environment requirements. In acoustic-based approaches, analyzing surrounding noise and learning about their features will help to classify more accurately [88]. • Multiple radar sensors can be utilized in indoor settings in RADAR-based system and it may be examined as a future work, including the impoverishment of the compatibility of different systems operating in tandem. A method incorporating multi-static radar systems with different nodes with a space-separated transmitter and receiver with capability of covering large indoor area i.e. extended range provides favorable performance in outdoor settings while classifying micro-Dopplers [138–142].
8.6 Conclusion For elderly people aged between 65 and above, the risk of falling increases with age. Injury-related death for elderly people can mostly be attributed to falling down. That is why the scientific community has been giving particular attention to detecting elderly falls in recent years. Fall-down detection is an active research topic to monitor our seniors at home alone. Many methods have been proposed by different researchers but unfortunately, the targeted goal has not been achieved yet in terms of
228
M. J. A. Nahian et al.
accuracy in classification. Among the two main types of fall detection approaches, in this chapter, we discussed contactless fall detection approaches which are vision, radar, acoustic, floor sensor, radio frequency-based fall detection. More than 50 research articles on these contactless approaches have been analyzed in this chapter. By analyzing those papers, we came to learn about the current research practices in this field of fall detection. Observations on using AI methods and availability of public dataset have been noted and mentioned in the chapter. This chapter also points out many limitations of contactless approaches which require to be sorted soon. Most importantly, this chapter analyzes the ongoing approaches to provide potential future research direction for contactless elderly fall detection. Acknowledgements This research was supported by the Information and Communication Technology division of the Government of the People’s Republic of Bangladesh in 2018–2019.
References 1. Tinetti, M.E., Kumar, C.: The patient who falls:“it’s always a trade-off”. Jama 303(3), 258–266 (2010) 2. Islam, Z.Z., Tazwar, S.M., Islam, Z. Md., Serikawa, S., Ahad A.R. Md.: Automatic fall detection system of unsupervised elderly people using smartphone. In: 5th IIAE International Conference on Intelligent Systems and Image Processing, Hawaii, USA, (2017) 3. Hossain, T., Ahad, M.A.R., Inoue, S.: A method for sensor-based activity recognition in missing data scenario. Sens. 20(14), 3811 (2020) 4. Beard, J., Biggs, S., Bloom, D.E., Fried, L.P., Hogan, P.R., Kalache, A., Olshansky, S.J., et al.: Global population ageing: peril or promise? Technical report, Program on the Global Demography of Aging (2012) 5. Ayta, I.A., McKinlay, J.B., Krane, R.J.: The likely worldwide increase in erectile dysfunction between 1995 and 2025 and some possible policy consequences. BJU Int. 84(1), 50–56 (1999) 6. Luque, R., Casilari, E., Morón, M-J., Redondo, G.: Comparison and characterization of android-based fall detection systems. Sensors 14(10), 18543–18574 (2014) 7. Day, L.: Falls in older people: risk factors and strategies for prevention, by sr lord, c sherrington, and hb menz, (pp. 249; a $85.00). cambridge university press (private bag 31, port melbourne, vic 3207, australia) (2001). ISBN: 0-521-58964-9 (2003) 8. Stevens, J.A., Corso, P.S., Finkelstein, E.A., Miller, T.R.: The costs of fatal and non-fatal falls among older adults. Injury Prevent. 12(5):290–295 (2006) 9. Baraff, L.A., Penna, R.D., Williams, N., Sanders, A.: Practice guideline for the ed management of falls in community-dwelling elderly persons. Annals Emerg. Med. 30(4), 480–492 (1997) 10. Stevens , J.A., Sogolow, E.D.: Gender differences for non-fatal unintentional fall related injuries among older adults. Injury Prevent. 11(2), 115–119 (2005) 11. Baig, M.M., Afifi, S., GholamHosseini, H., Mirza, F.: A systematic review of wearable sensors and IoT-based monitoring applications for older adults—a focus on ageing population and independent living. 8 (2019) 12. Baig, M.M., Gholamhosseini, H., Connolly, M.J.: Falls risk assessment for hospitalised older adults: a combination of motion data and vital signs. Aging Clinical Exper. Res. 28(6), 1159– 1168 (2016) 13. Al Nahian, M.J., Ghosh, T., Uddin, M.N., Islam, M.M., Mahmud, M., Kaiser, M.S.: Towards artificial intelligence driven emotion aware fall monitoring framework suitable for elderly
8 Contactless Fall Detection for the Elderly
14. 15. 16. 17.
18. 19.
20.
21.
22. 23.
24. 25.
26.
27. 28. 29. 30. 31. 32.
33.
34.
229
people with neurological disorder. In: International Conference on Brain Informatics, pp. 275–286. Springer (2020) Mubashir, M., Shao, L., Seed, L.: A survey on fall detection: principles and approaches. Neurocomputing 100, 144–152 (2013) Zhang, C., Tian, Y.: Rgb-d camera-based daily living activity recognition. J. Comput. Vision Image Proc. 2(4), 12 (2012) Zitouni, M., Pan, Q., Brulin, D., Campo, E., et al.: Design of a smart sole with advanced fall detection algorithm. J. Sensor Technol. 9(04), 71 (2019) Thomas, S.S., Nathan, V., Zong, C., Soundarapandian, K., Shi, X., Jafari, R.: Biowatch: a noninvasive wrist-based blood pressure monitor that incorporates training techniques for posture and subject variability. IEEE J. Biomed. Health Inform. 20(5), 1291–1300 (2015) Wu, J., Li, H., Cheng, S., Lin, Z.: The promising future of healthcare services: when big data analytics meets wearable technology. Inform. Manag. 53(8), 1020–1033 (2016) Narendrakumar, A.: Reliable energy efficient trust based data transmission for dynamic wireless sensor networks. In: 2017 International Conference on Intelligent Computing, Instrumentation and Control Technologies (ICICICT), pp. 190–194. IEEE (2017) Kamoi, H., Toyoda, K., Ohtsuki, T.: Fall detection using uhf passive rfid based on the neighborhood preservation principle. In: 2018 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2018) Kyriazakos, S., Mihaylov, M., Anggorojati, B., Mihovska, A., Craciunescu, R., Fratu, O., Prasad, R.: ewall: an intelligent caring home environment offering personalized context-aware applications based on advanced sensing. Wireless Personal Comm. 87(3), 1093–1111 (2016) Lee, W.K., Yoon, H., Park, K.S.: Smart ecg monitoring patch with built-in r-peak detection for long-term hrv analysis. Annals Biomed. Eng. 44(7), 2292–2301 (2016) Etemadi, M., Inan, O.T., Heller, J.A., Hersek, S., Klein, L., Roy, S.: A wearable patch to enable long-term monitoring of environmental, activity and hemodynamics variables. IEEE Trans. Biomed. Circuits Syst. 10(2), 280–288 (2015) Chen, M., Ma, Y., Song, J., Lai, C.-F., Bin, H.: Smart clothing: connecting human with clouds and big data for sustainable health monitoring. Mobile Netw. Appl. 21(5), 825–845 (2016) Ghosh, A.M., Halder, D., Hossain, S.K.A.: Remote health monitoring system through iot. In: 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 921–926. IEEE (2016) Balamurugan, S., Madhukanth, R., Prabhakaran, V.M., Shanker, R.G.K.: Internet of health: applying iot and big data to manage healthcare systems. Int. Res. J. Eng. Technol. 310, 732– 735 (2016) Anaya, L.H.S., Alsadoon, A., Costadopoulos, N., Prasad, P.W.C.: Ethical implications of user perceptions of wearable devices. Sci. Eng. Ethics 24(1), 1–28 (2018) Sivathanu, B.: Adoption of internet of things (iot) based wearables for healthcare of older adults–a behavioural reasoning theory (brt) approach. J. Enabl. Technol. (2018) Adapa, A., Nah, F.F.H., Hall, R.H., Siau, K., Smith, S.N.: Factors influencing the adoption of smart wearable devices. Int. J. Human Comput. Inter. 34(5), 399–409 (2018) Habibipour, A., Padyab, A., Ståhlbröst, A.: Social, ethical and ecological issues in wearable technologies. Twenty Fifth Am. Confer. Inform. Syst. Cancun 2019, 1–10 (2019) Feng, Q., Gao, C., Wang, L., Zhao, Y., Song, T., Li, Q.: Spatio-temporal fall event detection in complex scenes using attention guided lstm. Pattern Recogn. Lett. 130, 242–249 (2020) Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. IEEE (2017) Kong, Y., Huang, J., Huang, S., Wei, Z., Wang, S.: Learning spatiotemporal representations for human fall detection in surveillance video. J. Visual Comm. Image Represent. 59, 215–230 (2019) Ahad, M.A.R.: Motion history images for action recognition and understanding. Springer Science & Business Media (2012)
230
M. J. A. Nahian et al.
35. Ahad, M.A.R.: Computer vision and action recognition: a guide for image processing and computer vision community for action understanding, vol. 5. Springer Science & Business Media (2011) 36. Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Machine Vision Appl. 23(2), 255–281 (2012) 37. Ma, X., Wang, H., Xue, B., Zhou, M., Ji, B., Li, Y.: Depth-based human fall detection via shape features and improved extreme learning machine. IEEE J. Biomed. Health Inform. 18(6), 1915–1922 (2014) 38. Kwolek, B., Kepski, M.: Improving fall detection by the use of depth sensor and accelerometer. Neurocomputing 168, 637–645 (2015) 39. Noury, N., Rumeau, P., Bourke, A.K., ÓLaighin, G., Lundy, J.E.: A proposal for the classification and evaluation of fall detectors. Irbm 29(6), 340–349 (2008) 40. Fan, Y., Levine, M.D., Wen, G., Qiu, S.: A deep neural network for real-time detection of falling humans in naturally occurring scenes. Neurocomputing 260, 43–58 (2017) 41. Goudelis, G., Tsatiris, G., Karpouzis, K., Kollias, S.: Fall detection using history triple features. In: Proceedings of the 8th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 1–7 (2015) 42. Yun, Y., Gu, I.Y.H.: Human fall detection in videos via boosting and fusing statistical features of appearance, shape and motion dynamics on riemannian manifolds with applications to assisted living. Comput. Vision Image Underst. 148, 111–122 (2016) 43. Cippitelli, E., Fioranelli, F., Gambi, E., Spinsante, S.: Radar and rgb-depth sensors for fall detection: a review. IEEE Sens. J. 17(12), 3585–3604 (2017) 44. Kong, X., Meng, Z., Nojiri, N., Iwahori, Y., Meng, L., Tomiyama, H.: A hog-svm based fall detection iot system for elderly persons using deep sensor. Procedia Comput. Sci. 147, 276–282 (2019) 45. Khan, M.S., Yu, M., Feng, P., Wang, L., Chambers, J.: An unsupervised acoustic fall detection system using source separation for sound interference suppression. Signal Proc.110, 199–210 (2015) 46. Popescu, M., Li, Y., Skubic, M., Rantz, M.: An acoustic fall detector system that uses sound height information to reduce the false alarm rate. In: 2008 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 4628–4631. IEEE (2008) 47. Li, Y., Ho, K.C., Popescu., M.: A microphone array system for automatic fall detection. IEEE Trans. Biomed. Eng. 59(5), 1291–1301 (2012) 48. Töreyin, B.U., Dedeo˘glu, Y., Çetin, A.E.: Hmm based falling person detection using both audio and video. In: International Workshop on Human-Computer Interaction, pp. 211–220. Springer (2005) 49. Geertsema, E.E., Visser, G.H., Viergever, M.A., Kalitzin, S.N.: Automated remote fall detection using impact features from video and audio. J. Biomech. 88, 25–32 (2019) 50. Qingzhen, X., Huang, G., Mengjing, Y., Guo, Y.: Fall prediction based on key points of human bones. Physica A Stat. Mech. Appl. 540, 123205 (2020) 51. Bradski, G., Kaehler, A.: Learning OpenCV: computer vision with the OpenCV library. " O’Reilly Media, Inc. (2008) 52. Bradski, G.: The opencv library. Dr Dobb’s J. Softw Tools 25, 120–125 (2000) 53. Chua, J.L., Chang, Y.C., Lim, W.K.: A simple vision-based fall detection technique for indoor video surveillance. Signal Image Video Proc. 9(3), 623–633 (2015) 54. Tran, T.H., Le, T.L., Hoang, V.N., Hai, V.: Continuous detection of human fall using multimodal features from kinect sensors in scalable environment. Comput. Methods Progr. Biomed. 146, 151–165 (2017) 55. Atrey, P.K., Kankanhalli, M.S., Cavallaro, A.: Intelligent multimedia surveillance: current trends and research. Springer (2013) 56. Ma, C., Shimada, A., Uchiyama, H., Nagahara, H., Taniguchi, R.: Fall detection using optical level anonymous image sensing system. Optics Laser Technol. 110, 44–61 (2019)
8 Contactless Fall Detection for the Elderly
231
57. El Kaid, A., Baïna, K., Baïna, J.: Reduce false positive alerts for elderly person fall videodetection algorithm by convolutional neural network model. Procedia Comput. Sci. 148, 2–11 (2019) 58. Iuga, C., Dr˘agan, P., Bu¸soniu, L.: Fall monitoring and detection for at-risk persons using a uav. IFAC Papers OnLine 51(10), 199–204 (2018) 59. Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271 (2017) 60. Fang, W., Zhong, B., Zhao, N., Love, E.D., Luo, H., Xue, J., Xu, S.: A deep learning-based approach for mitigating falls from height with computer vision: convolutional neural network. Adv. Eng. Inform. 39, 170–177 (2019) 61. Fang, Q., Li, H., Luo, X., Ding, L., Luo, H., Li, C.: Computer vision aided inspection on falling prevention measures for steeplejacks in an aerial environment. Autom. Constr. 93, 148–164 (2018) 62. Li, W., Tan, B., Piechocki, R.: Passive radar for opportunistic monitoring in e-health applications. IEEE J. Trans. Eng. Health Med. 6, 1–10 (2018) 63. He, M., Nian, Y., Zhang, Z., Liu, X., Hu, H.: Human fall detection based on machine learning using a thz radar system. In: 2019 IEEE Radar Conference (RadarConf), pp. 1–5. IEEE (2019) 64. Ding, C., Zou, Y., Sun, L., Hong, H., Zhu, X., Li, C.: Fall detection with multi-domain features by a portable fmcw radar. In: 2019 IEEE MTT-S International Wireless Symposium (IWS), pp. 1–3. IEEE (2019) 65. Li, H., Shrestha, A., Heidari, H., Kernec, J.L., Fioranelli, F.: Activities recognition and fall detection in continuous data streams using radar sensor. In: 2019 IEEE MTT-S International Microwave Biomedical Conference (IMBioC), vol. 1, pp. 1–4. IEEE (2019) 66. Su, B.Y., Ho, K.C., Rantz, M.J., Skubic, M.: Doppler radar fall activity detection using the wavelet transform. IEEE Trans. Biomed. Eng. 62(3), 865–875 (2014) 67. Yoshino, H., Moshnyaga, V.G., Hashimoto, K.: Fall detection on a single doppler radar sensor by using convolutional neural networks. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pp. 2889–2892. IEEE (2019) 68. Erol, B., Amin, M.: Effects of range spread and aspect angle on radar fall detection. In: 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), pp. 1–5. IEEE (2016) 69. Sadreazami, H., Bolic, M., Rajan, S.: Fall detection using standoff radar-based sensing and deep convolutional neural network. Express Briefs IEEE Trans. Circuits Syst. II (2019) 70. Chen, S., Fan, C., Huang, X., Cao, C.: Low prf low frequency radar sensor for fall detection by using deep learning. In: 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), pp. 400–404. IEEE (2019) 71. Erol, B., Amin, M.G.: Radar data cube analysis for fall detection. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2446–2450. IEEE (2018) 72. Dremina, M.K., Anishchenko, L.N.: Contactless fall detection by means of cw bioradar. In: 2016 Progress in Electromagnetic Research Symposium (PIERS), pp. 2912–2915. IEEE (2016) 73. Lu, H., Plataniotis, K.N., Venetsanopoulos, A.N.: Mpca: Multilinear principal component analysis of tensor objects. IEEE Trans. Neural Netw. 19(1), 18–39 (2008) 74. Högbom, J.A.: Aperture synthesis with a non-regular distribution of interferometer baselines. Astron. Astrophys. Suppl. Series 15, 417 (1974) 75. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 76. Wang, H., Zhang, D., Wang, Y., Ma, J., Wang, Y., Li, S.: Rt-fall: a real-time and contactless fall detection system with commodity wifi devices. IEEE Trans. Mobile Comput. 16(2), 511–526 (2016) 77. Yang, X., Xiong, F., Shao, Y., Niu, Q.: Wmfall: Wifi-based multistage fall detection with channel state information. Int. J. Distr. Sens. Netw. 14(10), 1550147718805718 (2018)
232
M. J. A. Nahian et al.
78. Narui, H., Shu, R., Gonzalez-Navarro, F.F., Ermon, S.: Domain adaptation for human fall detection using wifi channel state information. In: International Workshop on Health Intelligence, pp. 177–181. Springer (2019) 79. Toda, K., Shinomiya, N.: Machine learning-based fall detection system for the elderly using passive rfid sensor tags. In: 2019 13th International Conference on Sensing Technology (ICST), pp. 1–6. IEEE (2019) 80. Kaudki, B., Surve, A.: Human fall detection using rfid technology. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–5. IEEE (2018) 81. Borhani, A., Pätzold, M.: A non-stationary channel model for the development of nonwearable radio fall detection systems. IEEE Trans. Wireless Comm. 17(11), 7718–7730 (2018) 82. Ruan, W., Yao, L., Sheng, Q.Z., Falkner, N., Li, X., Gu., T.: Tagfall: towards unobstructive fine-grained fall detection based on uhf passive rfid tags. In: proceedings of the 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services on 12th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, pp. 140–149 (2015) 83. Kriegel, H.P., Schubert, M., Zimek, A.: Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 444–452 (2008) 84. Madhubala, J.S., Umamakeswari, A., Rani, B.J.A.: A survey on technical approaches in fall detection system. National J. Physiol. Pharmacy Pharmacol. 5(4), 275 (2015) 85. Droghini, D., Ferretti, D., Principi, E., Squartini, S., Piazza, F.: An end-to-end unsupervised approach employing convolutional neural network autoencoders for human fall detection. In: Italian Workshop on Neural Nets, pp. 185–196. Springer (2017) 86. Popescu, M., Mahnot, A.: Acoustic fall detection using one-class classifiers. In: 2009 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3505–3508. IEEE (2009) 87. Droghini, D., Ferretti, D., Principi, E., Squartini, S., Piazza, F.: A combined one-class svm and template-matching approach for user-aided human fall detection by means of floor acoustic features. Comput. Intell. Neurosci. 2017 (2017) 88. Droghini, D., Vesperini, F., Principi, E., Squartini, S., Piazza, F.: Few-shot siamese neural networks employing audio features for human-fall detection. In: Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, pp. 63–69 (2018) 89. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification using a siamese time delay neural network. In: Advances in neural information processing systems, pp. 737–744 (1994) 90. Droghini, D., Principi, E., Squartini, S., Olivetti, P., Piazza, F.: Human fall detection by using an innovative floor acoustic sensor. In: Multidisciplinary Approaches to Neural Computing, pp. 97–107. Springer (2018) 91. Principi, E., Droghini, D., Squartini, S., Olivetti, P., Piazza, F.: Acoustic cues from the floor: a new approach for fall classification. Expert Syst. Appl. 60, 51–61 (2016) 92. Principi, E., Olivetti, P., Squartini, S., Bonfigli, R., Piazza, F.: A floor acoustic sensor for fall classification. In: Audio Engineering Society Convention 138. Audio Engineering Society (2015) 93. Li, Y., Banerjee, T., Popescu, M., Skubic, M.: Improvement of acoustic fall detection using kinect depth sensing. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 6736–6739. IEEE (2013) 94. Adnan, S.M., Irtaza, A., Aziz, S., Obaid Ullah, M.O., Javed, A., Mahmood, M.T.: Fall detection through acoustic local ternary patterns. Applied Acoustics 140, 296–300 (2018) 95. Irtaza, A., Adnan, S.M., Aziz, S., Javed, A., Obaid Ullah, M., Mahmood, M.T.: A framework for fall detection of elderly people by analyzing environmental sounds through acoustic local ternary patterns. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1558–1563. IEEE (2017)
8 Contactless Fall Detection for the Elderly
233
96. Popescu, M., Coupland, S., Date, S.: A fuzzy logic system for acoustic fall detection. In: AAAI Fall Symposium: AI in Eldercare: New Solutions to Old Problems, pp. 78–83 (2008) 97. Buerano, J., Zalameda, J., Ruiz, R.S.: Microphone system optimization for free fall impact acoustic method in detection of rice kernel damage. Comput. Electr. Agri. 85, 140–148 (2012) 98. Zhuang, X., Huang, J., Potamianos, G., Hasegawa-Johnson, M.: Acoustic fall detection using gaussian mixture models and gmm supervectors. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 69–72. IEEE (2009) 99. Kumar, V., Yeo, B.C., Lim, W.S., Raja, J.E., Koh, K.B.: Development of electronic floor mat for fall detection and elderly care. Asian J. Scient. Res. 11, 344–356 (2018) 100. Clemente, J., Song, W., Valero, M., Li, F., Liy, X.: Indoor person identification and fall detection through non-intrusive floor seismic sensing. In: 2019 IEEE International Conference on Smart Computing (SMARTCOMP), pp. 417–424. IEEE (2019) 101. Feng, G., Mai, J., Ban, Z., Guo, X., Wang, G.: Floor pressure imaging for fall detection with fiber-optic sensors. IEEE Pervasive Comput. 15(2), 40–47 (2016) 102. Litvak, D., Zigel, Y., Gannot, I.: Fall detection of elderly through floor vibrations and sound. In: 2008 30th annual international conference of the IEEE engineering in medicine and biology society, pp. 4632–4635. IEEE (2008) 103. Minvielle, L., Atiq, M., Serra, R., Mougeot, M., Vayatis, N.: Fall detection using smart floor sensor and supervised learning. In: 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 3445–3448. IEEE (2017) 104. Liu, C., Jiang, Z., Xiangxiang, S., Benzoni, S., Maxwell, A.: Detection of human fall using floor vibration and multi-features semi-supervised svm. Sensors 19(17), 3720 (2019) 105. Chaccour, K., Darazi, R., el Hassans, A.H., Andres, E.: Smart carpet using differential piezoresistive pressure sensors for elderly fall detection. In: 2015 IEEE 11th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), pp. 225– 229. IEEE (2015) 106. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol. 1, pp. I–I. IEEE (2001) 107. Noble, W.S.: What is a support vector machine? p. 12 (2006) 108. Peterson, L.: K-nearest neighbor. Scholarpedia 4(2), 1883 (2009) 109. Safavian, S.R., Landgreb, D.:. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 21(3), 660–674 (1991) 110. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 10 (2001) 111. Stratonovich, R.L.: Conditional markov processes. In: Non-linear transformations of stochastic processes, pp. 427–453. Elsevier (1965) 112. Rish, I., et al.: An empirical study of the naive bayes classifier. In: IJCAI 2001 workshop on empirical methods in artificial intelligence, pp. 41–46 (2001) 113. Banna, M.H.A., Haider, M.A., Nahian, M.J.A., Islam, M.M., Taher, K.A., Kaiser, M.S.: Camera model identification using deep cnn and transfer learning approach. In: 2019 International Conference on Robotics, Electrical and Signal Processing Techniques (ICREST), pp. 626– 630. IEEE (2019) 114. Jain, A.K., Mao, J., Mohiuddin, K.M.: Artificial neural networks: a tutorial. Computer 29(3), 31–44 3 (1996) 115. Ahad, M.A.R., Antar, A.D., Ahmed, M.: Sensor-based benchmark datasets: comparison and analysis. In: IoT Sensor-Based Activity Recognition, pp. 95–121. Springer (2020) 116. Martínez-Villaseñor, L., Ponce, H., Brieva, J., Moya-Albor, E., Núñez-Martínez, J., PeñafortAsturiano, C.: Up-fall detection dataset: a multimodal approach. Sensors 19(9), 1988 (2019) 117. Tran, T.H., Le, T.L., Pham, D.T., Hoang, V.N., Khong, V.M., Tran, Q.T., Nguyen, T.S., Pham, C.: A multi-modal multi-view dataset for human fall analysis and preliminary investigation on modality. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 1947–1952. IEEE (2018) 118. Shahroudy, A., Liu, J., Ng, T.T., Wang, G.: NTU RGB+D: a large scale dataset for 3d human activity analysis. In: ICVPR (2016)
234
M. J. A. Nahian et al.
119. Baldewijns, G., Debard, G., Mertes, G., Vanrumste, B., Croonenborghs, T.: Bridging the gap between real-life data and simulated data by providing a highly realistic fall dataset for evaluating camera-based fall detection algorithms. Healthcare Technol. Lett. 3(1), 6–11 (2016) 120. Vadivelu, S., Ganesan, S., Murthy, O.V.R., Dhall, A.:Thermal imaging based elderly fall detection. In: Asian Conference on Computer Vision, pp. 541–553. Springer (2016) 121. Kwolek, B., Kepski, M.: Human fall detection on embedded platform using depth maps and wireless accelerometer. Comput. Methods Progr. Biomed. 117(3), 489–501 (2014) 122. Charfi, I., Miteran, J., Dubois, J., Atri, M., Tourki, R.: Optimized spatio-temporal descriptors for real-time fall detection: comparison of support vector machine and adaboost-based classification. J. Electr. Imaging 22(4), 041106 (2013) 123. Zhang, Z., Liu, W., Metsis, V., Athitsos, V.: A viewpoint-independent statistical method for fall detection. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 3626–3630. IEEE (2012) 124. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: Processings IEEE International Conferences Computer Vision (2011) 125. Auvinet, E., Multon, F., Saint-Arnaud, A., Rousseau, J., Meunier, J.: Fall detection with multiple cameras: an occlusion-resistant method based on 3-d silhouette vertical distribution. IEEE Trans. Inform. Technol. Biomed. 15(2), 290–300 (2010) 126. Hnat, T.W., Srinivasan, V., Lu, J., Sookoor, T.I., Dawson, R., Stankovic, J., Whitehouse, K.: The hitchhiker’s guide to successful residential sensing deployments. In: Proceedings of the 9th ACM Conference on Embedded Networked Sensor Systems, pp. 232–245 (2011) 127. Ozcan, K., Velipasalar, S., Varshney, P.K.: Autonomous fall detection with wearable cameras by using relative entropy distance measure. IEEE Trans. Human Machine Syst. 47(1), 31–39 (2016) 128. Zhang, Z., Conly, C., Athitsos, V.: A survey on vision-based fall detection. In: Proceedings of the 8th ACM international conference on PErvasive technologies related to assistive environments, pp. 1–7 (2015) 129. Ahad, M.A.R., Antar, A.D., Ahmed, M.: Sensor-based human activity recognition: challenges ahead. In: IoT Sensor-Based Activity Recognition, pp. 175–189. Springer (2020) 130. Kyriacou, E., Christofides, S., Pattichis, C.S.: Erratum to: Xiv mediterranean conference on medical and biological engineering and computing 2016. In: XIV Mediterranean Conference on Medical and Biological Engineering and Computing 2016, pp. E1–E14. Springer (2017) 131. Schwickert, L., Becker, C., Lindemann, U., Maréchal, C., Bourke, A., Chiari, L., Helbostad, J.L., Zijlstra, W., Aminian, K., Todd, C., et al.: Fall detection with body-worn sensors. Zeitschrift für Gerontologie und Geriatrie 46(8), 706–719 (2013) 132. Khan, S.S., Hoey, J.: Review of fall detection techniques: a data availability perspective. Med. Eng. Phys. 39, 12–22 (2017) 133. Basak, P., Sheikh, M.M., Tasin, S.M., Sakib, A.H.M.N., Tapotee, M.I., Baray, S.B., Ahad, M.A.R.: Complex Nurse Care Activity Recognition Using Statistical Features, 2020 ACM International Symposium on Wearable Computers (UbiComp/ISWC’20Adjunct), Mexico (2020) 134. Rahman, A., Nahid, N., Hasan, I., Ahad, M.A.R.: Nurse Care Activity Recognition: Using Random Forest to Handle Imbalanced Class Problem, 2020 ACM International Symposium on Wearable Computers (UbiComp/ISWC’20Adjunct), Mexico (2020) 135. Faisal M.A.A., Siraj, M.S., Abdullah, M.T., Shahid, O., Abir, F.F., Ahad, M.A.R.: A Pragmatic Signal Processing Approach for Nurse Care Activity Recognition using Classical Machine Learning, 2020 ACM International Symposium on Wearable Computers (UbiComp/ISWC’20Adjunct), Mexico (2020) 136. Islam M.S., Hossain, T., Ahad, M.A.R., Inoue S.: Exploring Human Activity by Using eSense Earable Device, 2nd Int. Conf. on Activity and Bheavior Computing (ABC), Japan (2020) 137. Koshmak, G., Loutfi, A., Linden, M.: Challenges and issues in multisensor fusion approach for fall detection. J. Sens. 2016 (2016)
8 Contactless Fall Detection for the Elderly
235
138. Fioranelli, F., Ritchie, M., Griffiths, H.: Multistatic human micro-doppler classification of armed/unarmed personnel. IET Radar Sonar Navigat. 9(7), 857–865 (2015) 139. Fioranelli, F., Ritchie, M., Griffiths, H.: Aspect angle dependence and multistatic data fusion for micro-doppler classification of armed/unarmed personnel. IET Radar Sonar Navigat. 9(9), 1231–1239 (2015) 140. Fioranelli, F., Ritchie, M., Griffiths, H.: Classification of unarmed/armed personnel using the netrad multistatic radar for micro-doppler and singular value decomposition features. IEEE Geosci. Remote Sens. Lett. 12(9), 1933–1937 (2015) 141. Fioranelli, F., Ritchie, M., Griffiths, H.: Centroid features for classification of armed/unarmed multiple personnel using multistatic human micro-doppler. IET Radar Sonar Navigat. 10(9), 1702–1710 (2016) 142. Fioranelli, F., Ritchie, M., Griffiths, H.: Analysis of polarimetric multistatic human microdoppler classification of armed/unarmed personnel. In: 2015 IEEE Radar Conference (RadarCon), pp. 0432–0437. IEEE (2015)
Chapter 9
Contactless Human Emotion Analysis Across Different Modalities Nazmun Nahid, Arafat Rahman, and Md Atiqur Rahman Ahad
Abstract Emotion recognition and analysis is an essential part of affective computing which plays a vital role nowadays in healthcare, security systems, education, etc. Numerous scientific researches have been conducted developing various types of strategies, utilizing methods in different areas to identify human emotions automatically. Different types of emotions are distinguished through the combination of data from facial expressions, speech, and gestures. Also, physiological signals, e.g., EEG (Electroencephalogram), EMG (Electromyogram), EOG (Electrooculogram), blood volume pulse, etc. provide information on emotions. The main idea of this paper is to identify various emotion recognition techniques and denote relevant benchmark data sets and specify algorithms with state-of-the-art results. We have also given a review of multimodal emotion analysis, which deals with various fusion techniques of the available emotion recognition modalities. The results of the existing literature show that emotion recognition works best and gives satisfactory accuracy if it uses multiple modalities in context. At last, a survey of the rest of the problems, challenges, and corresponding openings in this field is given.
9.1 Introduction Emotion is an instinctive or intuitive feeling derived from one’s mood, surrounding environment, or relationships with others. They are biological states associated with N. Nahid (B) · A. Rahman University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] A. Rahman e-mail: [email protected] M. A. R. Ahad Department of Electrical and Electronic Engineering, University of Dhaka, Dhaka, Bangladesh e-mail: [email protected] Department of Media Intelligent, Osaka University, Osaka, Japan © Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_9
237
238
N. Nahid et al.
the nervous system. According to psychologist Paul Ekman, there are six basic discrete emotions: happiness, sadness, disgust, fear, surprise, and anger [1, 2]. Besides these simple emotions, various complex emotions can be observed during our daily life. To express these complex emotions, dimensional models are used based on factors like valence, arousal, and intensity. Wilhelm Max Wundt, the pioneer of modern psychology, described emotions using a model of three dimensions. These dimensions are valence, arousal, and intensity of emotions. This model can be modified to a two-dimensional model where valence represents positive to negative emotions and arousal represents the active to passive scale. So happiness with excitement can be regarded as a pleasant mind with high activity and it is expressed with high arousal and high valence in this model [3]. So the emotions like, “pleasurable versus unpleasurable”, “arousing or subduing” and “strain or relaxation” can be classified using this model [4]. The physiological and psychological statuses of an individual are both influenced by emotions. So emotion recognition can be applied in various aspects of our day to day life like safe driving [5], health care focusing on mental health [6], social security [7], and so on. Emotions are typically communicated by means of different methods like reactions, language, conduct, body signals, act, and so on. Additionally, physiological procedures like breath, pulse, temperature, skin conductivity, temperature, muscle, and neural layer possibilities, and so on can likewise contain data about emotions. Based on acquisition methods, emotion recognition techniques can be grouped into two significant classes. One is a contactless method where human physical signs are observed without any physical intervention like audio-visual signals and the other one is using body contact methods to observe the change in physiological signs. The right method depends on the application area as well as on the emotions to be analyzed. But in most cases, the body contact emotion recognition methods fail to obtain the desired result because of inadequate information and easy susceptibility towards the noise of the bio-signals. So, multimodal systems like, combining both contacts less and body contact methods are used to compensate for low accuracy rates for certain emotions. Table 9.1 shows an overview of common methods for emotion recognition. Although several surveys had been done previously on various emotion recognition modalities separately, very few of them dealt with the combination of different modalities or multimodal emotion recognition on the basis of contact and contactless method. A comprehensive review of existing benchmark datasets, methods, results, and challenges is lacking in this research area. Marechal et al. [8] have given a review of existing projects and methods for detecting emotion in text, sound, image, video, and physiological signals. Sebe et al. [9] presented the existing methods of facial, voice, and physiological emotion recognition where each is utilized independently. Mukeshimana et al. [10] provided a brief survey on Human-Computer Interaction research that has been done using multimodal emotion recognition. Xu et al. [11] gave a survey that investigates EEG-based Brain-Computer Interaction of general and learning emotion recognition. Corneanu et al. [12] give an outline of automatic multimodal facial expression analysis with RGB, 3D, and thermal approaches. Samadiani et al. [13] gave a review of multimodal facial expression recognition sys-
9 Contactless Human Emotion Analysis Across Different Modalities
239
Table 9.1 Overview on benefits and limitations of common modalities: facial emotion recognition, speech emotion recognition, electrocardiography, electroencephalography, electromyography, electrodermal activity, respiration, skin temperature, heart rate variability Modality Benefits Limitations Facial Emotion Recognition (FER) Speech Emotion Recognition (SER) Electrocardiography (ECG)
Contactless, multi person tracking
Contactless, unobtrusive, casual Contact, measurement with smart watch possible Electroencephalography Contact, measurement of (EEG) impaired patients Electromyography (EMG) Contact, measurement of impaired patients Electrodermal Activity (EDA) Contact, good indicator for stress, distinction conflict and no conflict Respiration (RSP) Both contact and contactless, simple setup, indication of panic, fear, depression Skin Temperature (SKT) Both contact and contactless, versatile data acquisition (IR, through video, sensors) Heart Rate Variability (HRV) Both contact and contactless, versatile data acquisition,more comfortable,cheap
Affected by appearance and environment, does not allow free movement Requires communication, prone to background noise Prone tomovement artifacts Complex installation, high maintenance, lab conditions Measures only valence, difficult setup, lab conditions Measures only arousal, influenced by temperature and sweat Difficult distinction for broader emotional spectrum, only used in multi modal setup Measures only arousal, slow reacting bio signal, influenced by external temp Highly dependent on the position
tems combining three categories of sensors to improve accuracy. The categories are— detailed-face sensors, for example, eye trackers and non-visual sensors, for example, EEG, audio, depth sensors, and target-focused sensors, for instance, infrared thermal sensors. Shu et al. [14] discussed various emotion recognition models through physiological signals with published datasets, features, classifiers. Ko [15] gave a brief survey of Facial Emotion Recognition (FER) techniques which are composed of conventional FER approaches and Deep Learning based FER approaches. Sailunaz et al. [16] wrote a review article focusing on emotion detection models, datasets, and techniques using text and speech. For micro-expressions spotting and recognition Oh et al. [17] gave a review of state-of-the-art databases and algorithms. Emotion recognition is a very challenging problem to solve. Even humans sometimes fail to recognize emotion. From previous studies, we can know that the same emotion can be interpreted differently by different persons and it is more difficult for AI. There are a lot of reasons that make emotion recognition a very complex problem. Figure 9.1 shows the flow diagram for the general emotion recognition pipeline. Conventional machine learning and Deep Learning methods are the two main groups of methods applied in emotion recognition problems. For both of these
240
N. Nahid et al.
Fig. 9.1 Basic flow diagram of the general emotion recognition pipeline
methods, emotion recognition systems need huge datasets. This data can come from different modalities and environments. Video is one of the modalities and it can have various backgrounds, frame rates, and angles. Audio is another modality and it can have various pitch, echo, and noises from the surrounding environment. Data can also come from different nationalities, gender, and races. But most public datasets are not adequate with respect to race and gender diversity and contains a very small number of emotions. Again, the emotions of these datasets are not a natural response rather they are stimulated or acted. Another problem is the widely used algorithms are created for high-intensity emotions rather than low-intensity emotions. This gives inaccurate results for the people who suppress their emotions. So, building an emotion recognition system that is unbiased towards race and culture is a big challenge. For example, in 2015, Google Photo algorithms failed to recognize people with dark skin. The root of the problem is less diverse data and biased research team. The research team consists of those people who build the dataset and also test the dataset. So, during data collection, they focus on specific race or group and then they test the solutions on themselves. In spite of feeling the same emotions, people in different parts of the world express emotions differently. Western people show seven basic emotions and they express each of them with a unique combination of body movement, facial movement, and speech. However, the Japanese and Chinese people are very reticent. They express their emotions with eye activity and low voice. Also, emotional expression alters with gender and age. Researchers have studied signs of the seven key emotions thoroughly. But there are huge contradictions regarding the meaning of expressions. For instance, many studies considered scowl as an angry expression. But another study shows that 70% of the time scowl represents other emotions. The emotion recognition problem is a problem in the psychology field rather than programming. One has to be careful about the small number of indicators which can increase false-positive results. So, training data should be collected bearing these things in mind. A dataset should be chosen which is diverse with various subjects. Another issue is the data acquisition system which is a very important issue
9 Contactless Human Emotion Analysis Across Different Modalities
241
regarding the real-life applications of the emotion recognition system. As previously mentioned, one type of data acquisition system is contact based. These systems are not appropriate for real-life situations as they are very uncomfortable and not user friendly. Moreover, they can not produce satisfactory results individually without combining with other modalities. So, this survey focused on contact-less methods only. The main purpose of this survey is to give an overall brief review of different emotion recognition systems including various datasets and procedures. For that purpose, at first appropriate emotion recognition modalities are identified and then previous notable works and research gaps are presented. The rest of the paper is organized as follows—in Sect. 9.2 an overview of data sets and notable works of various emotion recognition modalities are given. This section is divided into several subsections. Section 9.2.1 contains a brief description of the conventional and new Facial Emotion Recognition (FER) techniques, state-of-the-art data sets and accuracies of different algorithms applied to them. Similarly, Sects. 9.2.2, 9.2.3, and 9.2.4 contains the description of data sets, methods, results of Speech Emotion Recognition (SER), physiological signals based emotion recognition, and multimodal emotion analysis respectively. Finally, in Sects. 9.3 and 9.4, the challenges and conclusion for emotion recognition in different modalities are given respectively.
9.2 Modalities Used for Emotion Analysis In this section different emotion recognition modalities are discussed along with their state-of-the-art data set, algorithms, methods, and results. Both conventional and newly proposed algorithms are discussed and a comparison of different algorithms using accuracy and precision are shown. So, this section gives an overall picture of emotion recognition in different modalities.
9.2.1 Facial Emotion Recognition (FER) Among all the methods, facial expressions are mostly used for emotion analysis. Facial expressions incorporate anger, fear, disgust, smile, sadness, and surprise. A smile represents happiness and is communicated with an eye with a bent shape. The sorrow or sadness is generally expressed as rising slanted eyebrows and grimace. Anger is appeared by pressed eyebrows, slim and extended eyelids. The pull-down eyebrows and a wrinkled nose represent disgust. Surprise and shock are communicated utilizing eye-extending and mouth expanding. The expression of fear is expressed by growing skewed eyebrows. Also, various other complex emotions can easily be identified by analyzing skin texture and movement of other facial muscles. Normally, three types of signals are seen in the face. They are—slow, rapid, static signals. The static signs include skin shading, face color, oily stores, face forms, bone
242
N. Nahid et al.
Fig. 9.2 An overview of facial emotion recognition systems
formation, area, and size of facial highlights, e.g., foreheads, mouth, eyes, nose. The slow signs are changes that happen gradually with time like lasting wrinkles. These incorporate changes for muscle tone and skin texture. The rapid signals contain face muscle movement, ephemeral changes in the facial impression, temporary wrinkles, and changes in the area and state of facial highlights. These movements on the face stay for a couple of moments. According to image-based feature representation, FER systems can be separated into two classes: static image FER and dynamic sequence FER. In static classes, the feature is extracted from one single picture, while in dynamics based techniques the contiguous frames are considered. Automatic FER approaches can be divided into two classes based on feature extraction and classification strategies. One is classical or conventional FER approach which comprises of three major steps: (1) Face and facial landmark and component detection (2) Feature extraction (3) Classification of expression In this procedure, initially face is identified from a picture with the help of various facial landmarks and components like—eyes, nose, ears, etc. Then various features are extracted from the important facial landmark positions and components. At last using the extracted features as input data, various classification algorithms like— AdaBoost, support vector machine (SVM), random forest recognize various emotions (Fig. 9.2). The main difference between deep learning-based methods and conventional machine learning is that deep learning can automatically learn the features from raw input data and produce intended results. This is also known as “End-to-End” learning. This strategy does not require any prior knowledge or domain expertise.
9 Contactless Human Emotion Analysis Across Different Modalities
9.2.1.1
243
Conventional FER System
This process includes three major steps: pre-processing of the face image, feature extraction from pre-processed data, and lastly classification. All of these techniques are discussed in detail below: (a) Pre-processing. To effectively recognize emotion, face image should be clear, appropriately scaled, in the right size, and should have appropriate brightness. Pre-processing is done to fulfill all of these conditions. To bring the image in the right size, cropping and scaling are done. To do this, at first, a midpoint should be selected and cropping should be done taking that point as reference. So, for this purpose, the nose is selected and other important parts are also considered. There are many methods for size reduction and Bessel downsampling is one of them. This method is usually chosen because it can preserve image quality. The Gaussian channel is another method for resizing the image. It is usually used because it can smooth the picture. There may be a lot of variation of illumination in different parts of the face image. To bring a uniform illumination in all parts, normalization is applied. So, for this purpose, median filters are generally used. Median filters are also used to find out the eye position. Eye position measurement is an important step towards building a FER system that is indifferent to personality differences. Face detection is a mandatory step before recognizing emotions. Viola-Jones algorithm is a widely used algorithm for face detection. Adaboost algorithm can detect the size and location of the face simultaneously. After face detection, comes another pre-processing step which is known as face alignment. Scale Invariant Feature Transform (SIFT) is a popular algorithm for face alignment. If we can localize or segment important parts of the face that are responsible for different emotions, we can build an effective FER system. The Region of Interest (ROI) algorithm is very appropriate for this purpose. Histogram equalization method is also a good method for reducing illumination variance in different parts of the pictures and it improves classification accuracy. (b) Feature extraction. Facial image contains various information and most of them are unnecessary and redundant. Feature extraction algorithms isolate the most important points and features of an image to distinguish between different emotions. They are many feature extraction algorithms that have been developed over the years. But the most important groups of feature extraction methods are— global and local method, geometric method, texture-based method, edge-based method, and patch-based method. Some of the texture-based methods are— Weber Local Descriptor (WLD), Weighted Projection-based LBP (WPLBP), Local Directional Number (LDN), Gabor filter, Local Binary Pattern (LBP), Vertical Time Backward (VTB), Local Directional Ternary Pattern (LDTP), KL-transform Extended LBP (KELBP), Discrete Wavelet Transform (DWT), etc. Histogram of Oriented Gradients (HOG), Line Edge Map (LEM), Graphicsprocessing unit based Active Shape Model (GASM), etc. are some of the edgebased methods. Independent Component Analysis (ICA), Principal Component Analysis (PCA), Stepwise Linear Discriminant Analysis (SWLDA), etc. are
244
N. Nahid et al.
some widely used global and local methods. Local Curvelet Transform (LCT), steerable pyramid representation, etc. use the different geometric component to extract features. Some important features can be identified from different facial movements and patch-based methods are mainly used for this reason. Patchbased methods are usually done in two steps, the first one is patch extraction and the second is patch matching. The patches are extracted and then translated to distance characteristics to perform patch matching. (c) Classification. After feature extraction the feature extracted data sequences go through some classification algorithms and finally those algorithms give the decisions about emotion. Some of the classification algorithms are—KNN (KNearest Neighbors), HMM (Hidden Markov Model), HCRF (Hidden Conditional Random Field), pairwise, CART (Classification and Regression Tree), ID3 (Iterative Dichotomiser 3) decision tree, LVQ (Learning Vector Quantization), SVM (Support Vector Machine), Euclidean distance, MDC, Chi-square test, Fisher discrimination dictionary, etc. Conventional machine learning methods can perform well even with low computing power and memory. For that reason, classical algorithms are still used [18].
9.2.1.2
Deep Learning-Based FER System
The main difference between deep learning-based approach and conventional approach is that deep learning-based approaches use deep neural networks to automatically learn features from raw input data instead of feature extraction and. It does not require any domain knowledge which makes it easy to apply in many scenarios. It also includes pre-processing like conventional FER which is described in the following paragraph. (a) Pre-processing. Face image Pre-processing is extremely important before it goes through the deep neural network for good accuracy. Among the many techniques, alignment and normalization are two important pre-processing steps. Active Appearance Model (AAM) is a popular model for facial landmark detection and alignment. Some parameters are required for this model which are extracted from global shape patterns and holistic facial appearance [25]. Various important parts of the image are separated and the discriminative response map fitting (DRMF) [27] and mixtures of trees (MoT) structured models [26] are utilized to determine if the desired parts exist there. The supervised descent method (SDM) [28], the face alignment [29] and the incremental face alignment [30] use cascaded regression. Recently some deep neural networks have been very popular and they are—Tasks-Constrained Deep Convolutional Network (TCDCN) [31] and Multi-task CNN (MTCNN) [32]. Cascades regression has gained much popularity for face alignment after the advent of deep learning and it is producing good results. Overfitting is one of the biggest concerns of deep learning algorithms. To reduce overfitting, data augmentation is introduced. Data augmentation can be done before training or during training. Random cropping
9 Contactless Human Emotion Analysis Across Different Modalities
245
from the corners and horizontal flipping are two augmentations that are done during training. The data augmentations that are done before training are called offline data augmentation. Some widely used augmentation are—scaling, noise, contrast, rotation, shifting, skew, and color jittering. Deep learning algorithms like 3D CNN (Convolutional Neural Network), GAN (Generative Adversarial Network), etc. can also perform data augmentation. Illumination and head poses are important factors to consider which produces large changes in images. To reduce these problems two typical face normalization methods are used: illumination normalization and pose normalization (frontalization). Homomorphic filtering based normalization, Difference of Gaussian (DoG), isotropic diffusion (IS), discrete cosine transform (DCT), etc. are some widely acclaimed algorithms. Among them, Homomorphic filtering based normalization yields the most consistent results. The combination of illumination normalization and homomorphic filtering based normalization has achieved great results. But there are some issues regarding directly applying histogram equalization. It gives unnecessary importance to local contrast. The authors of [33] proposed a method called weighted summation approach to tackle this problem. The authors of [34] showed that three methods: local normalization, global contrast normalization (GCN), and histogram equalization give satisfactory results. Another problem in this area is known as pose variation in the wild. In an uncontrolled natural setup, large pose variation is very common. To solve this problem, various pose normalization techniques are developed to give the frontal facial view. (b) Feature extraction and classification. Unlike conventional machine learning algorithms, Deep Learning based approaches exploit deep neural networks to extract the desired features from the data directly. Commonly used deep learningbased feature extraction techniques are mainly the Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Deep Belief Network (DBN), Deep Autoencoder (DAE), and Generative Adversarial Network (GAN). For CNN, AlexNet [21], VGGNet [22], GoogleNet [23] and ResNet [24] are some commonly used pre-trained networks in FER. In classical methods, the feature extraction and classification are done separately. But in deep learning, these are done simultaneously. So, this method is named as “End-to-End way”. At first, the weights of the network are initialized randomly. Data propagates through this network and gives a classification result which has a very large error. At the last stage of a deep neural network, a loss layer is added which measures the classification error. This error is backpropagated through the network to adjust the weights using different optimization methods. Then with appropriate weights, the network produces the correct result. Another method is to use deep neural networks to extract features and give the feature extracted sequence to various conventional algorithms like—random forests, support vector machine (SVM) to classify emotions. In Table 9.2, some notable databases are mentioned along with information on their language and access type, size, emotion, and some eminent works on them.
112,234 images
35,887 images
1766 images
TFD [95]
FER-2013 [96]
SFEW 2.0 [98]
CK [91] and CK+ [92]
Posed and stimulated, taken from movie
Posed and stimulated, taken from web
Posed, taken inside lab
MMI [93, 94]
Type
Size
Still Images with Posed, taken inside lab resolution of 720 × 576 and 2900 videos of 75 subjects, 238 video sequences of 28 male and female subjects CK: 210 adults between Posed and Stimulated, the ages of 18 and taken inside lab 50 years. CK+: Still images of 123 subjects with resolutions of 640 × 480 and 640 × 490
Database
Table 9.2 List of notable datasets for facial emotion recognition Authors
(continued)
Ouellet 14 [37], Li et al. 15 [38], Liu et al. 14 [39], Liu et al. 13 [40], Liu et al. 15 [41], Khorrami et al. 15 [42], Ding et al. 17 [43], Zeng et al. 18 [44], Cai et al. 17 [45], Meng et al. 17 [46], Liu et al. 17 [47], Yang et al. 18 [48], Zhang et al. 18 [49], Zhao and Pietikinen 09 [50], Song et al. 10 [51], Zhang et al. 11 [52], Poursaberi et al. 12 [53], Ji and Idrissi 12 [54], Ucar et al. 14 [55], Zhang et al. 14 [56], Mahersia and Hamrouni 15 [57], Happy et al. 15 [58], iswas 15 [59], Siddiqi et al. 15 [60], Cossetin et al. 16 [61], Salmam et al. 16 [62], Kumar et al. 16 [63] 6 basic expressions plus Reed et al. 14 [66], Devries et al. 14 [67], Khorrami neutral et al. 15 [42], Ding et al. 17 [43] 6 basic expressions plus Tang 13 [68], Devries et al. 14 [67], Zhang et al. neutral 15 [69], Guo et al. 16 [70], Kim et al. 16 [71], Pramerdorfer et al. 16 [72] 6 basic expressions plus Levi et al. 15 [80], Ng et al. 15 [81], Li et al. 17 [65], neutral Ding et al. 17 [43], Liu et al. 17 [47], Cai et al. 17 [45], Meng et al. 17 [46], Kim et al. 15 [82], Yu et al. 15 [83]
6 basic expressions plus contempt and neutral
6 basic expressions plus Liu et al. 13 [40], Liu et al. 15 [41], Mollahosseini et neutral al. 16 [64], Liu et al. 17 [47], Li et al. 17 [65], Yang et al. 18 [48], Poursaberi et al. 12 [53], Ji and Idrissi [54], Siddiqi et al. 15 [60], Kumar et al. 16 [63]
Emotion
246 N. Nahid et al.
BU-3DFE [99]
DISFA [100]
3D facial emotions of Posed, taken inside lab 100 subjects, image resolution of 1040 × 1329 27 subjects participated Stimulated, taken inside in 130,000 videos, lab image resolution of 1024 × 768
JAFFE [97]
Type
Size
213 images of ten Posed, taken inside lab different female Japanese models with image resolution of 256 × 256
Database
Table 9.2 (continued) Authors
12 AUs
EmotioNet [88], 3D Inception-ResNet [89], DRML [90]
6 basic expressions plus Liu et al. 14 [39], Hamester et al. 15 [73], Noh et al. neutral 07 [74], Bashyal et al. 08 [75], Song et al. 10 [51], Wang et al. 10 [76], Zhang et al. 11 [52], Poursaberi et al. 12 [53], Owusu et al. 14 [77], Ucar et al. 14 [55], Zhang et al. 14 [56], Dahmane and Meunier 14 [78], Mahersia and Hamrouni 15 [57], Happy et al. 15 [58], Biswas 15 [59], Siddiqi et al. 15 [60], Cossetin et al. 16 [61], Salmam et al. 16 [62], Kumar et al. 16 [63], Hegde et al. 16 [79] 6 basic expressions plus Mandal et al. [84], Mohammed et al. [85], Lee et neutral al. [86], Zheng [87]
Emotion
9 Contactless Human Emotion Analysis Across Different Modalities 247
248
N. Nahid et al.
The methods applied to these datasets are discussed in the following paragraphs. MMI: Liu et al. 13 [40] used V&J for pre-processing and CNN, RBM and Cascaded Network, and an additional classifier SVM and achieved 74.76% accuracy. Liu et al. 15 [41] used V&J for pre-processing and CNN, RBM and Cascaded Network, and an additional classifier SVM and achieved 75.85% accuracy. Mollahosseini et al. 16 [64] used IntraFace for pre-processing and CNN (Inception) and achieved 77.9% accuracy. Liu et al. 17 [47] used IntraFace for pre-processing and CNN and achieved 78.53% accuracy. Li et al. 17 [65] used IntraFace for pre-processing and CNN and an additional network SVM and achieved 78.46% accuracy. Yang et al. 18 [48] used MoT 3 for pre-processing and GAN (cGAN) and achieved 73.23% accuracy. Poursaberi et al. 12 [53] used GL Wavelet and KNN and achieved 91.9% accuracy. Ji and Idrissi [54] used LBP, VTB, Moments, and SVM and achieved 95.84% accuracy. Siddiqi et al. 15 [60] used SWLDA and HCRF and achieved 96.37% accuracy. Kumar et al. 16 [63] used WPLBP and SVM and achieved 98.15% accuracy. CK and CK+: Ouellet 14 [37] used CNN (AlexNet) and an additional classifier SVM and achieved 94.4% accuracy. Li et al. 15 [38] used RBM 4—V&J and achieved 96.8% accuracy. Liu et al. 14 [39] used DBN and CN and an additional network AdaBoost and achieved 96.7% accuracy. Liu et al. 13 [40] used V&J for pre-processing and CNN, RBM, and CN and achieved 92.05% accuracy. Liu et al. 15 [41] used V&J for pre-processing and CNN, RBM, and CN and achieved 93.70% accuracy. Khorrami et al. 15 [42] used zero-bias CNN and achieved 6 classes: 95.7% and 8 classes: 95.1% accuracy. Ding et al. 17 [43] used IntraFace for pre-processing and CNN and achieved 98.6% accuracy in 6 classes and 96.8% accuracy in 8 classes. Zeng et al. 18 [44] used AAM for pre-processing and DAE (DSAE) and achieved 95.79% accuracy. Cai et al. 17 [45] used DRMF for pre-processing and CNN and achieved 94.39% accuracy. Meng et al. 17 [46] used DRMF for pre-processing and CNN and MN and achieved 95.37% accuracy. Liu et al. 17 [47] used IntraFace for pre-processing and CNN and achieved 97.1% accuracy. Yang et al. 18 [48] used MoT 3 for pre-processing and GAN (cGAN) and achieved 97.30% accuracy. Zhang et al. 18 [49] used CNN and MN and achieved 98.9% accuracy. Song et al. 10 [51] used LBP-TOP and SVM and achieved 86.85% accuracy. Zhang et al. 11 [52] used Patch-based SVM and achieved 82.5% accuracy. Poursaberi et al. 12 [53] used GL Wavelet and KNN and achieved 91.9% accuracy. Ucar et al. 14 [55] used LCT and OSLEM and achieved 94.41% accuracy. Zhang et al. 14 [56] used GF and SVM and achieved 82.5% accuracy. Mahersia and Hamrouni 15 [57] used the steerable pyramid and Bayesian NN and achieved 95.73% accuracy. Happy et al. 15 [58] used LBP and SVM and achieved 93.3% accuracy. Biswas 15 [59] used DCT and SVM and achieved 98.63% accuracy. Siddiqi et al. 15 [60] used SWLDA and HCRF and achieved 96.37% accuracy. Cossetin et al. 16 [61] used LBP and WLD, the Pairwise classifier, and achieved 98.91% accuracy. Salmam et al. 16 [62] used SDM and CART and achieved 89.9% accuracy. Kumar et al. 16 [63] used WPLBP and SVM and achieved 98.15% accuracy. TFD: Reed et al. 14 [66] used RBM, MN and an additional SVM and achieved 85.43% accuracy. Devries et al. 14 [67] used MoT 3 for pre-processing and CNN
9 Contactless Human Emotion Analysis Across Different Modalities
249
Fig. 9.3 Average accuracy comparison of the notable FER systems in the mentioned datasets
and MN and achieved 85.13% accuracy. Khorrami et al. 15 [42] used zero-bias CNN and achieved 88.6% accuracy. Ding et al. 17 [43] used IntraFace for pre-processing and CNN and achieved 88.9% accuracy. Figure 9.3 shows the average accuracy comparison of the notable FER systems in the mentioned datasets. FER-2013: Tang 13 [68] used CNN and achieved 71.2% accuracy. Devries et al. 14 [67] used MoT 3 for pre-processing and CNN and MN and achieved 67.21% accuracy. Zhang et al. 15 [69] used SDM for pre-processing and CNN and MN and achieved 75.10% accuracy. Guo et al. 16 [70] used SDM for pre-processing and CNN and achieved 71.33% accuracy. Kim et al. 16 [71] used IntraFace for pre-processing and CNN and achieved 73.73% accuracy. Pramerdorfer et al. 16 [72] used CNN and achieved 75.2% accuracy. JAFFE: Liu et al. 14 [39] used DBN and CN and an additional classifier AdaBoost and achieved 91.8% accuracy. Hamester et al. 15 [73] used CNN and achieved 95.8% accuracy. Noh et al. 07 [74] used an action-based ID3 decision tree and achieved 75% accuracy. Bashyal et al. 08 [75] used GF and LVQ and achieved 88.86% accuracy. Song et al. 10 [51] used LBP-TOP and SVM and achieved 86.85% accuracy. Wang et al. 10 [76] used SVM and achieved 87.5% accuracy. Zhang et al. 11 [52] used patch-based SVM and achieved 82.5% accuracy. Poursaberi et al. 12 [53] used GL Wavelet and KNN and achieved 91.9% accuracy. Owusu et al. 14 [77] used GF and MFFNN and achieved 94.16% accuracy. Ucar et al. 14 [55] used LCT and OSLEM and achieved 94.41% accuracy. Zhang et al. 14 [56] used GF and SVM and achieved 82.5% accuracy. Dahmane and Meunier 14 [78] used HOG and SVM and achieved 85% accuracy. Mahersia and Hamrouni 15 [57] used the steerable pyramid and Bayesian NN and achieved 95.73% accuracy. Happy et al. 15 [58] used LBP and SVM and achieved 93.3% accuracy. Biswas 15 [59] used DCT and SVM and achieved 98.63% accuracy. Siddiqi et al. 15 [60] used SWLDA and HCRF and achieved 96.37% accuracy. Cossetin et al. 16 [61] used LBP and WLD, Pairwise classifier, and achieved 98.91% accuracy. Salmam et al. 16 [62] used SDM and CART and achieved 89.9% accuracy. Kumar et al. 16 [63] used WPLBP and SVM
250
N. Nahid et al.
and achieved 98.15% accuracy. Hegde et al. 16 [79] used GF, ED, and SVM and achieved 88.58% accuracy. SFEW 2.0: Levi et al. 15 [80] used MoT 3 for pre-processing and CNN and achieved 54.56% accuracy. Ng et al. 15 [81] used IntraFace for pre-processing and CNN and achieved 55.6% accuracy. Li et al. 17 [65] used IntraFace for pre-processing and CNN and an additional SVM and achieved 51.05% accuracy. Ding et al. 17 [43] used IntraFace for pre-processing and CNN and achieved 55.15% accuracy. Liu et al. 17 [47] used IntraFace for pre-processing and CNN and achieved 54.19% accuracy. Cai et al. 17 [45] used DRMF for pre-processing and CNN and achieved 59.41% accuracy. Meng et al. 17 [46] used DRMF for pre-processing and CNN and MN and achieved 54.30 % accuracy. Kim et al. 15 [82] used CNN and achieved 61.6% accuracy. Yu et al. 15 [83] used CNN and achieved 61.29% accuracy. BU-3DFE: Mandal et al. [84] and Mohammed et al. [85] used the feature extraction technique based on curvelet, Patched geodesic texture technique, and gradient feature matching and achieved 99.52% accuracy. Lee et al. [86] used Sparse Representation and achieved 87.85% accuracy. Zheng [87] used ALM and Group sparse reducedrank regression (GSRRR) and achieved 66.0% accuracy. DISFA: EmotioNet [88] first detected the landmarks using various algorithms. Then they utilized Euclidean distances, angles between them. They also used Kernel subclass discriminant analysis, and Gabor filters centered at the landmark points. DRML [90] used a simple feed-forward network to learn the important features of the face. 3D Inception-ResNet [89] used the LSTM network to capture temporal features along with spatial features and used those features to classify emotions.
9.2.2 Speech Emotion Recognition (SER) Speech emotion recognition (SER) system is based on the methods that focus on processing and classifying speech signals to recognize emotions. It has been around for over two decades. A bird’s eye view on several distinct areas of SER is shown in Fig. 9.4. The first step towards building a good speech emotion recognition system is to build a good quality dataset. A good quality dataset means it will be less noisy, correctly labeled, and has fewer missing data. Speech emotion recognition datasets can be collected in three ways: Acted, Induced, and Natural. Professional or semiprofessional actors record their voices in quiet studios for certain emotions to build acted speech database. Acted speech datasets are comparatively easier to create but they perform poorly in real-life situations as some complex emotions can not be recreated by acting. An emotional situation is simulated to record utterances of speakers to create induced speech datasets. They are comparatively closer to reallife situations. Extemporaneous or spontaneous utterances from real-life situations such as—radio shows, recordings from the call centers, speeches from daily lives
9 Contactless Human Emotion Analysis Across Different Modalities
251
Fig. 9.4 A bird’s eye view of speech emotion recognition systems
are recorded as natural speech datasets. Like FER, SER also includes the same algorithmic steps, and these steps are described in detail in the following paragraphs. (a) Pre-processing. Pre-processing is the very first step after collecting data. It comprises of steps like; framing, windowing, voice Activity detection, normalization, noise reduction and filtering, dimensionality reduction, and appropriate feature selection. Usually, an instantaneous portion of the speech signal does not contain any useful information. So, a short interval of 20–30 milliseconds is taken as a window and useful features are extracted. Relational information between consecutive windows can be obtained by overlapping 30–50% of these windows. As speech signal is continuous, some processing technique like Discrete Fourier Transform (DFT) has limitations in these processing steps. For Artificial Neural Networks fixed-size windowing is appropriate. Voice Activity Detection is the next step in this algorithmic pipeline. When a person is speaking, three situations can occur. These situations include voiced speech, unvoiced speech, and silence. Voice Activity Detection is the process of distinguishing voiced speech from unvoiced speech and silence. The methods used in this detection process are— auto-correlation, zero-crossing, and short-time energy. Generally, there is some difference between the speaker’s voice and the recorded voice. The signal amplitude of the speaker’s voices may vary on a larger scale than the recorded voice. To reduce this scale difference, normalization is introduced. This step ensures that the training algorithm can progress without being biased towards any specific feature and converge quickly to the desired accuracy. The next step is noise reduction and filtering. For noise reduction, Minimum Mean Square Error (MMSE) and Log-spectral Amplitude MMSE (LogMMSE) are the most frequently used methods [19]. For filtering different types of digital filters can be used. Finally, for dimensionality reduction, various algorithms
252
N. Nahid et al.
like—Principal Component Analysis (PCA), Autoencoders, Linear Discriminant Analysis (LDA) can be used. (b) Feature extraction. After pre-processing comes the step feature extraction. Though there are many commonly used features, there is no widely acknowledged feature set for classifying standard emotions. There are mainly four feature sets widely used in this field. They are—Prosodic features, Spectral features, Voice Quality features, Teager Energy Operator (TEO) based features. Prosodic features include—pitch, energy, and duration characteristics of the speech signal. TEO features have shown good performance in recognizing stress and anger emotions. Generally, these feature sets are used individually but they can be combined to achieve good accuracy. In Table 9.3, some notable databases are mentioned along with information on their language and access type, size, emotion, and some eminent works on them. The methods applied to these datasets are discussed in the following paragraphs. Berlin Emotional Database (EmoDB): Shen et al. [113] used the SVM classifier in this data set. They used different features like—energy, pitch, and the combination of linear prediction cepstrum coefficients (LPCC) and Mel Frequency cepstrum coefficients (MFCC), and this combination is known as LPCMCC. For the combination of energy and pitch, their system gave 66.02% accuracy and for LPCMCC it gave 70.7% accuracy. Their algorithm achieved the highest accuracy of 82.5% for the total combination of energy, pitch, and LPCMCC. Pan et al. [144] also applied SVM in this dataset using all the previously stated features including Mel Energy Spectrum Dynamic Coefficients (MEDC) features. Among many other combinations, they achieved the highest accuracy of 95.087% using the combination of MFCC, MEDC, and energy features. Xiao et al. [145] presented harmonic and Zipf feature-based two-stage emotion classifier to classify 6 emotions. At stage 1, classifier 1 is used for classifying active and non-active emotion and classifier 2 is used for classifying non-active emotions into median and passive classes. At stage 2, three classifiers are used to classify 6 emotions—anger, gladness, fear, neutral, sadness, boredom taking input from the previous stage. Their method achieved 68.6% accuracy in the EmoDB dataset without gender classification and 71.52% accuracy with gender classification. Albornoz et al. [108] also designed a two-stage method and achieved good results by detecting less number of classes using widely used standard classifiers. Specific emotion class can be detected using specific features and classifiers and that is why their method showed such a good result. They designed two kinds of 2 stage classifier. In variant 1, stage 1, three sets of emotions known as BNS, JAS, and Disgust are classified and then at stage 2, those three sets of emotions are further classified into 7 emotions (Boredom, Neutral, Sadness, Joy, Anger, Fear, Disgust). The same procedure is followed in variant 2 with the only difference in 2 sets of emotions (BNS, JAFD) in stage 1. Several combinations of Hidden Markov Model (HMM), Gaussian Mixture Models (GMM), and Multilayer Perceptron (MLP) are used. After several experiments, it was evident that for stage 1, HMM with 30 Gaussians gave the highest accuracy. At stage 2, Boredom, Neutral, and Sadness can be classified
Language
German
Mandarin
Danish
Mandarin
English
English
German
Database
EmoDB [101]
CASIA [102]
DES [103]
EESDB [104]
EMA [105]
LDC emotional speech database [106]
FAU Aibo emotion corpus [107]
Commercially available
Commercially available
Free to research use
Free toresearch use
Free
Commercially available
open access
Access type
Table 9.3 List of notable datasets for speech emotion recognition
Sadness, anger, surprise, happiness, fear, neutral Anger, happiness, Neutral, surprise, sadness
6 Emotions × 4 Speakers × 500 utterances 2 male and 2 female talked for 10 min 400 utterances × 16 Fear, happiness, anger, speakers (8 male, 8 female) disgust, neutral, sadness, surprise 1 male gave 14 sentences Sadness, neutral anger, and 2 female gave 10 happiness, sentences 470 utterances × 7 Disgust, fear, Hot anger, speakers (4 male, 3 female) cold anger, contempt, happiness, despair, elation, sadness, neutral, panic, pride, interest, shame, boredom Conversation between Emphatic, helpless, joyful, robot dog Aibo and 51 motherese, anger, bored, children for 9 h neutral, reprimanding, rest, surprised, touchy
Emotions Disgust, fear, Anger, boredom, happiness, sadness, neutral
Size 10 speakers × 7 emotions × 10 utterances
Authors
Deng et al. [122], Kwon et al. [123], Lee et al. [124]
Bitouk et al. [109]
Busso et al. [120], Grim et al. [121]
Mao et al. [111], Ververidis et al. [118], Zhang et al. [119] Wang et al. [116]
Albornoz et al. [108], Bitouk et al. [109], Borchert et al. [110], Mao et al. [111], Schuller et al. [112], Shen et al. [113], Deng et al. [114], Wu et al. [115], Wang et al. [116], Yang et al. [117] Wang et al. [116]
9 Contactless Human Emotion Analysis Across Different Modalities 253
254
N. Nahid et al.
with the highest accuracy using HMM and for recognizing Joy, Anger, and Fear, MLP gave a better performance. Kim et al. [146] showed a Multi-Task Learning (MTL) method which was tested within-corpora and cross-corpora setups. Using the MTL method they tackled two tasks, the main task is emotion recognition and the secondary task is detecting gender and naturalness. They tested two kinds of MTL, one is based on Long Short Term Memory (LSTM) and the other one is based on vanilla Deep Neural Network (DNN). After comparing with Single Task Learning (STL) they showed MTL achieved greater performance including gender and naturalness separately. Chen et al. [147] proposed a 3-D CNN-RNN (convolutional neural network and recurrent neural network) based attention model. As input, they used a 3-D log Mel Spectrogram which is passed through a CNN for feature extraction, and then the feature extracted sequence is passed through LSTM. LSTM is mainly used to extract temporal features of the speech signal. Then an attention model takes the output of LSTM to focus on important parts of the speech signals. Finally, the output of the attention model is passed through a fully connected layer and a softmax layer to recognize emotion. This method outperformed vanilla DNN with an improvement of 11.26% and also achieved an accuracy improvement of 3.44% compared to 2-D CNN-RNN. Deng et al. [114] proposed a sparse autoencoder to build an effective feature representation from target data and then transfer these learned features to reconstruct source data. Then the learned features are fed as input to an SVM to classify emotions. Experimental results showed that this method boosts performance significantly. For their experiments, Latif et al. [148] used the Deep Belief Network (DBN) since it has high generalization ability and gives satisfactory performance. They produced powerful approximation using universal approximators as basic building blocks. In comparison with sparse autoencoder and SVM, their method produced better results. Mao et al. [111] used automatically learned features by CNN and CNN classifiers to achieve 85.2% accuracy. Schuller et al. [112] used ZCR raw contours, pitch, first seven formants, strength, spectral growth, HNR, and linguistic features and got 76.23% recognition rate for all 276 features, 80.53% for top 75 features picked by SVM SFFS. Bitouk et al. [109] used spectral features in SVM and got 81.3% recognition accuracy. Wu et al. [115] used Prosodic features, speaking rate features, SVM-based ZCR, and TEO features and achieved 91.3% accuracy by using spectral modulation features and prosodic features. In the Bayesian classification, Yang et al. [117] used prosodic, spectral, and speech consistency features and got an overall identification score of 73.5%. Chinese Emotional Speech Corpus (CASIA): Pan et al. [144] used different features for example, MFCC, LPCC, MEDC with SVM classifier to classify 6 emotions. They achieved the highest accuracy of 91.30% using those features. Wang et al. [116] used Fourier Parameters and MFCC features in SVM to attain a 79% recognition rate. Danish Emotional Speech Database (DES): Xiao et al. [145] presented harmonic and Zipf feature-based two-stage emotion classifier to classify 6 emotions. At stage 1, classifier 1 is used for classifying active and non-active emotion and classifier 2 is
9 Contactless Human Emotion Analysis Across Different Modalities
255
Fig. 9.5 Average accuracy comparison of the SER systems in the mentioned datasets
used for classifying non-active emotions into median and passive classes. At stage 2, three classifiers are used to classify 6 emotions—anger, gladness, fear, neutral, sadness, boredom taking input from the previous stage. 81% recognition rate is obtained in this process. Mao et al. [111] used features that are Automatically learned by CNN and applied in CNN to obtain 79.9% accuracy. Ververidis et al. [118] used features such as—formants, pitch, energy contour to obtain a 48.5% recognition rate using Gaussian Mixture Model (GMM). They also tested the classification accuracy on males and females separately and achieved 56, 50.9% accuracy for males and females respectively. Zhang et al. [119] used unsupervised learning with energy, pitch, voice quality, and MFCC features. They achieved 66.8 and 58.2% accuracy for arousal and valence classification respectively. Chinese Elderly Emotional Speech Database (EESDB): In this dataset, Wang et al. [116] used four group feature sets in SVM and got a 76% recognition rate. The four group feature sets are—MFCC (Mel Frequency Cepstral Coefficients), FP (Fourier Parameters), MFCC+FP, and openSMILE feature set on corpus EMODB and EESDB. They also investigated heuristic algorithm Harmony Search (HS) for feature selection and their method showed a good recognition rate in all emotions except happiness. Figure 9.5 shows the average accuracy comparison of the SER systems in the mentioned datasets. Electromagnetic Articulography (EMA): Sahu et al. [149] proposed an adversarial training method to ensure that output probability distribution changes smoothly with input. They used manifold regularization to achieve this smoothness. They used 2 training procedures such as—adversarial training and virtual adversarial training. The first one is based on labels of training data whereas the second one uses output distribution of training data. They achieved an accuracy of 61.65% using DNN with adversarial training which outperformed vanilla DNN. In another work with this dataset, Grim et al. [121] used fuzzy estimator and SVM which takes Pitch related features, speaking rate related features, and spectral features as input. They achieved 0.19 mean error using this method.
256
N. Nahid et al.
LDC Emotional Speech Database: Kim et al. [146] showed a Multi-Task Learning (MTL) method which was tested within-corpora and cross-corpora setups. Using the MTL method they tackled two tasks where the main task is emotion recognition and the secondary task is detecting gender and naturalness. They tested two kinds of MTL, one is based on Long Short Term Memory (LSTM) and the other one is based on vanilla Deep Neural Network (DNN). After comparing with Single Task Learning (STL) they showed MTL achieved greater performance including gender and naturalness separately. Sahu et al. [149] proposed an adversarial training method to ensure that output probability distribution changes smoothly with input. They used manifold regularization to achieve this smoothness. They used 2 training procedures such as—adversarial training and virtual adversarial training. The first one is based on labels of training data whereas the second one uses output distribution of training data. They achieved an accuracy of 43.18% using DNN with adversarial training for this dataset. FAU AIBO Emotion Corpus: Lee et al. [124] proposed a hierarchical binary tree classifier approach for distinguishing 11 emotion classes. The input goes through multiple Bayesian Logistic Regression binary classification layers. This method achieved an improvement of 7.44% over vanilla SVM. Kim et al. [146] showed a Multi-Task Learning (MTL) method which was tested within-corpora and crosscorpora setups. Using the MTL method they tackled two tasks where the main task is emotion recognition and the secondary task is detecting gender and naturalness. They tested two kinds of MTL, one is based on Long Short Term Memory (LSTM) and the other one is based on vanilla Deep Neural Network (DNN). After comparing with Single Task Learning (STL) they showed MTL achieved greater performance including gender and naturalness separately. Latif et al. [148] used the Deep Belief Network (DBN) since it has a high generalization ability and gives satisfactory performance. They produced powerful approximation using universal approximators as basic building blocks. In comparison with sparse autoencoder and SVM, their method produced better results. Kwon et al. [124] used GSVM and HMM with Prosodic and Spectral features to achieve 42.3% and 40.9% accuracy for GSVM and HMM respectively.
9.2.3 Contactless Physiological Signal Based Emotion Recognition The main problem of contactless physiological emotion recognition systems is low accuracy and latency. But they are user friendly and do not limit human activity. So, they are very suitable for day-to-day life. In this chapter, we consider different signals like Heart Rate Variability (HRV), Skin Temperature (SKT), Respiration (RSP), etc. as physiological signals where both contact and non-contact methods are used. Various studies and methods are developed to model the relationship between
9 Contactless Human Emotion Analysis Across Different Modalities
257
physiological signals and emotions and some which are described in the following paragraphs. The time interval between successive heartbeats can vary with a common pattern and Heart Rate Variability (HRV) measures this variation. The autonomous nervous system of our body has two branches and they are—sympathetic and parasympathetic nervous system. These two branches control heart rate variability by giving a synergistic action. Parasympathetic nerves slow the heart rate whereas sympathetic nerves speed up it. The net effect of these processes is represented by heart rate. Stress, physical exercise can influence these changes [153, 154]. For HRV measurement the remote photoplethysmography (rPPG) method is a contactless method. Emotional states can be monitored by monitoring the respiration (RSP) rate. it is related to the respiratory system and thoracal activity. Different human emotions have different respiration depth and rate for example, deep and fast breath indicates excitement with happiness; shallow and fast breathing indicates anxiety; deep and slow breathing shows relaxation; slow and shallow breathing indicates calmness. A 2017 study developed some deep learning algorithms using the DEAP dataset where respiration rate was also present with other modalities. They achieved 73.06 and 80.78% accuracy for valence and arousal respectively [35]. Skin temperature is the most appropriate biosignal for automatic emotion recognition because it shows the reaction of the autonomic nervous system that does not depend on human control. Skin temperature is also a good indicator of human heart activity. Blood vessels may contract or expand due to various emotional states. This contraction or expansion phenomenon can be observed by measuring the thermal radiation of the skin surface. A person’s emotional state like he is relaxed or not can be identified through SKT or Skin Temperature. Studies showed that the skin temperature of a person was higher for the expression of low-intensity negative emotions compared to the expression of low-intensity positive emotions. Park et al. [36] developed a method to detect sadness and happiness using SKT and achieved an 89.29% recognition rate.
9.2.4 Multimodal Emotion Analysis Recent researches in the field of emotion recognition show that there is no method that is ideal for all cases and the best solution is multimodal analysis. By using several methods, they complement each other and allows them to achieve higher reliability of obtained results. Therefore, nowadays multimodal approaches are mostly used for emotion recognition. Also, many datasets include both audio and video and other data which helps to get better knowledge on emotional changes. In Table 9.4, some notable databases are mentioned along with information on their language and access type, size, emotion, and some eminent works on them.
258
N. Nahid et al.
Table 9.4 List of eminent datasets for multimodal emotion recognition Database Type Modalities Emotions
Authors
The Interactive Acted Emotional Dyadic Motion Capture Database (IEMOCAP) [125] Surrey Audio-Visual Acted Expressed Emotion (SAVEE) [128]
Audio/Visual
Happiness, anger, sadness, frustration, neutral
Han et al. [126], Lee et al. [124], Mirsamadi et al. [127]
Audio/Visual
Mao et al. [111]
eNTERFACE’05 Natural audio-visual emotion database [129] RECOLA speech Natural database [130]
Audio/Visual
Vera Am Mittag Database (VAM) [132]
Natural
Audio/Visual
Happiness, sadness, anger, disgust, fear, surprise, neutral Fear, happiness, anger, disgust, sadness, surprise Engagement, dominance, Agreement, performance, rapport Dominance, activation, and valence
TUM AVIC database [134]
Natural
Audio/Visual
SAMAINE database [135]
DEAP [20]
ASCERTAIN [137]
HUMAINE [152]
MAHNOB database [138]
Audio/Visual
Consent, breathing, garbage, laughter, hesitation Natural Audio/Visual Power, expectation, valence, activation Induced EEG, EMG, Valence, arousal, RSP, GSR, EOG dominance of and face video movie watching people Induced GSR, EEG, High/low ECG, HRV,facial valence and expressions arousal Natural, Induced Face, speech, Pervasive gesture emotion
Induced
Face, audio, eye gaze, physiological
Deng et al. [114], Zhang et al. [119] Trigeorgis et al. [131]
Deng et al. [114], Grim et al. [121], Schuller et al. [133], Wu et al. [115], Zhang et al. [119] Deng et al. [114], Zhang et al. [119]
Kaya et al. [136]
Chen et al. [139], Tong et al. [140]
Subramanian et al. [143]
Wollmer et al. [141], Caridakis et al. [142] Arousal, valence, Liu et al. [150], dominance and Koelstra et predictability al. [151]
9 Contactless Human Emotion Analysis Across Different Modalities
259
9.3 Challenges This paper explains much of the emotion recognition systems. There has been a lot of work in disclosing various methods of emotion identification, their repositories, and different approaches to learn about them. However, emotion identification also poses many difficulties and unresolved problems. Some of these problems and challenges are discussed in the following subsections.
9.3.1 Challenges of FER Conventional FER frameworks have different issues, particularly regarding accuracy. Though FER systems developed using deep learning methods have made extraordinary progress, a few issues remain that require further examination. First of all, an enormous dataset and high-end computer processors are required for training a complex deep neural network with many layers. Secondly, these enormous quantities of datasets are created by human annotation which requires huge time and effort. Thirdly, a significant amount of memory is requested, and without meeting these requirements rapid training and testing become impossible. Huge memory prerequisite and computational expenses make deep learning unsuitable for portable devices with limited computational power. Fourthly, a lot of aptitude and experience are required to choose appropriate hyperparameters, for example, the learning rate, filter size of the convolutional neural networks, and the number of layers. These hyperparameters may depend on other factors that make them especially costly for tuning. Lastly, it is often difficult to understand why or how the deep neural networks give such good performance.
9.3.2 Challenges of SER Albeit over the previous decade SER has developed a great deal, there are a few hindrances that should be eliminated. One of the most significant issues is the generation of the dataset. The majority of the datasets utilized for SER are acted or inspired that are recorded in controlled environments and quiet rooms. Nonetheless, real-life situations are very challenging and so the dataset becomes noise prone and difficult to handle. Natural datasets are available but their number and size are not sufficient to analyze or apply machine learning and deep learning algorithms. Ethical and legal issues are big obstacles in creating natural datasets. A large portion of the natural datasets are gathered from talk shows, tv, or radio recordings where participating people know in prior that their speech is recorded. So, they can not talk naturally and their speech does not resemble their true emotion. The dataset is collected and then labeled by human annotators and there remain a few important issues. Sometimes
260
N. Nahid et al.
huge difference occurs between the labeled emotion and the actual emotion felt by the participating humans. So, incorrect datasets are produced creating more complexity. Another issue is, there is no standard SER system that can work irrespective of any language and culture. For specific language and culture, one has to develop a specific SER system and there is no general system to tackle this problem.
9.3.3 Challenges of Contactless Physiological Signal Based Emotion Recognition Even though physiological signs are progressively developing to give more accurate results, there are as yet various issues in this field. As a matter of first importance, getting top-notch physiological data that truly reflect different emotional states require a well-furnished lab environment. Then again, the issue of getting authentic emotional feelings is difficult because it largely depends on the simulation environment. In this way, more work ought to be finished focusing on successful data generation. The simulation material and environment are manually set for all people regardless of individual feelings which may vary largely. So, there is no standard environmental setup and experimental procedure that can be applied while gathering data. Numerous studies have been conducted to find effective features of physiological signals which truly reflect emotional states. But, still, scientists could not find the most effective features of any physiological signal which may detect any emotion with high accuracy. Another issue is the small number of subjects during the experiment which limits the size of data. Usually, data are gathered from 5 to 20 subjects which are barely enough. Lastly, there are a few factors in the pre-processing and investigative strategies for picking the classifiers.
9.3.4 Challenges of Multimodal Emotion Recognition Multimodal emotion recognition yields relatively better performance than other modalities when they are used separately. However, there are some limitations and challenges in this method also. One of the challenging problems of multimodal emotion recognition is modeling the interactions between different modalities that alters the recognition of expressed emotion. This problem is known as inter-modality dynamics. For instance, if someone says—“This move is sick”, it may have a positive or negative connotation. But if the person also smiles saying the same thing, then it will have a positive connotation. Conversely, if he or she tells the same sentence with a frown, it will be regarded as a negative sentence. This was a simple example of bimodal interaction and things may get complicated in emotion recognition with many modalities. Though several efforts have been made to solve this issue, this is still an area to explore. The second challenge in this area is accurately exploring the
9 Contactless Human Emotion Analysis Across Different Modalities
261
changes and complexities within a specific emotion modality in different environments. This phenomenon is known as intra-modality dynamics. For example—the ambiguous characteristics of spoken language where sometimes proper rules and structure is often ignored introduce complexity. Other emotion recognition modalities like facial emotion recognition, physiological signal-based emotion recognition also have this type of problem. Previous works in multimodal emotion analysis were done using either feature level fusion or decision level fusion. The feature-level fusion is usually done at the early stage during pre-processing and decision level fusion is done at the late stage during classification. These fusion methods did not solve inter and intra-modality dynamics. So, these problems need considerable attention. Besides these challenges, there are other social and ethical issues regarding emotion recognition. An emotion recognition system should not violate the user’s privacy and individual data should be protected accordingly so that cyber-crime can not happen. So, these are some of the common challenges any emotion recognition system may face and more research is required to solve these problems. Human emotions are very complex and have many dimensions. Various combinations of emotion give rise to new emotional states but existing emotion recognition models can recognize only a few simple emotions. More studies are required to detect these complex emotions and we hope multimodal emotion recognition models can help us to resolve these problems leading to more advanced and sophisticated emotion recognition systems.
9.4 Conclusion In this paper, various emotion recognition modalities are identified along with various datasets and methods. These modalities include emotion recognition through face, speech, and various physiological signals. From the results of various studies, it is evident that automatic emotion analysis using multimodal approaches produce very good results. Such methodologies may lead to the rapid development of more easy to use systems which will help in daily life and work. Automatic multimodal emotion analysis requires sophisticated algorithms and modeling which are mainly based on classical machine learning and deep learning methods. The presented state-of-theart methods are successful but there are still limitations in current knowledge and challenges in multimodal emotion measurement and analysis. An emotion recognition system should be portable, non-invasive, and cheap. Novel emotion recognition systems may be more amicable, interacting with users in a proper friendly manner. In this paper, we have discussed some most significant issues of state-ofthe-art methods. Certainly, there is a requirement for further exertion of scientists and researchers toward more developed automatic emotion recognition systems. We hope new research will solve many existing challenges and problems leading to the successful application of emotion recognition systems in many areas.
262
N. Nahid et al.
References 1. Ekman, P., Oster, H.: Facial expressions of emotion. Ann. Rev. Psychol. 30(1), 527–554 (1979) 2. Ekman, P., Friesen, W.V., Ellsworth, P.: Emotion in the Human Face: Guidelines for Research and an Integration of Findings, 1st edn. Elsevier (1972) 3. Posner, J., Russell, J.A., Peterson, B.S.: The circumplex model of affect: an integrative approach to affective neuroscience, cognitive development, and psychopathology. Dev Psychopathol 17, 715–734 (2005) 4. Wundt, W.: Principles of physiological psychology. In: Readings in the History of Psychology, pp. 248–250. Appleton-Century-Crofts, Connecticut, USA (1948) 5. De Nadai, S., D’Incà, M., Parodi, F., Benza, M., Trotta, A., Zero, E., Zero, L., Sacile, R.: Enhancing safety of transport by road by on-line monitoring of driver emotions. In: 11th System of Systems Engineering Conference (SoSE), vol. 2016, pp. 1–4. Kongsberg (2016) 6. Guo, R., Li, S., He, L., Gao, W., Qi, H., Owens, G.: Pervasive and unobtrusive emotion sensing for human mental health. In: Proceedings of the 7th International Conference on Pervasive Computing Technologies for Healthcare, pp. 436–439, Italy, Venice (2013) 7. Verschuere, B., Crombez, G., Koster, E., Uzieblo, K.: Psychopathy and physiological detection of concealed information: a review. Psychologica Belgica 46, 99–116 (2006) 8. Marechal, C., et al.: Survey on AI-based multimodal methods for emotion detection. In: HighPerformance Modelling and Simulation for Big Data Applications, pp. 307-324. Springer (2019) 9. Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Multimodal approaches for emotion recognition: a survey. In: Proceedings of SPIE—The International Society for Optical Engineering (2004) 10. Mukeshimana, M., Ban, X., Karani, N., Liu, R.: Multimodal emotion recognition for humancomputer interaction: a survey. Int. J. Sci. Eng. Res. 8(4), 1289–1301 (2017) 11. Xu, T., Zhou, Y., Wang, Z., Peng, Y.: Learning emotions EEG-based recognition and brain activity: a survey study on BCI for intelligent tutoring system. In: The 9th International Conference on Ambient Systems. Networks and Technologies (ANT 2018) and the 8th International Conference on Sustainable Energy Information Technology (SEIT-2018), pp. 376–382, Porto, Portugal (2018) 12. Corneanu, C.A., Simón, M.O., Cohn, J.F., Guerrero, S.E.: Survey on RGB, 3D, thermal, and multimodal approaches for facial expression recognition: history, trends, and affect-related applications. IEEE Trans. Pattern Anal. Mach. Intell. 38(8), 1548–1568 (2016) 13. Samadiani, N., Huang, G., Cai, B., Luo, W., Chi, H., Xiang, Y., He, J.: A review on automatic facial expression recognition systems assisted by multimodal sensor data. Sensors 19(8), 1863–1890 (2019) 14. Shu, L., Xie, J., Yang, M., Li, Z., Liao, D., Xu, X., Yang, X.: A review of emotion recognition using physiological signals. Sensors 18(7), 2074–2115 (2018) 15. Ko, B.C.: A brief review of facial emotion recognition based on visual information. Sensors 18(7), 2074–2115 (2018) 16. Sailunaz, K., Dhaliwal, M., Rokne, J., Alhajj, R.: Emotion detection from text and speech: a survey. Soc. Netw. Anal. Mining 8(28) (2018) 17. Oh, Y., See, J., Anh, C.L., Phan, R.C., Baskaran, M.V.: A survey of automatic facial microexpression analysis: databases, methods, and challenges. Front. Psychol. 9, 1128–1149 (2018) 18. Suk, M., Prabhakaran, B.: Real-time mobile facial expression recognition system—a case study. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 132-137. Columbus, OH, USA (2014) 19. Pohjalainen, J., Ringeval, F., Zhang, Z., Orn Schuller, B.: Spectral and cepstral audio noise reduction techniques in speech emotion recognition. In: Proceedings of the 24th ACM international Conference on Multimedia, pp. 670–674 (2016) 20. Koelstra, S., et al.: DEAP: a database for emotion analysis using physiological signals. IEEE Trans. Affect. Comput. 3(1), 18–31 (2012)
9 Contactless Human Emotion Analysis Across Different Modalities
263
21. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 22. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint arXiv:1409.1556 23. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9 (2015) 24. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016) 25. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Trans. Pattern Anal. Mach. Intell. 23(6), 681–685 (2001) 26. Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2879-2886 (2012) 27. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative response map fitting with constrained local models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3451 (2013) 28. Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 532–539 (2013) 29. Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1685–1692 (2014) 30. Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Incremental face alignment in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1859–1866 (2014) 31. Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: European Conference on Computer Vision, pp. 94–108 (2014) 32. Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016) 33. Kuo, C.-M., Lai, S.-H., Sarkis, M.: A compact deep learning model for robust facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2121–2129 (2018) 34. Pitaloka, D.A., Wulandari, A., Basaruddin, T., Liliana, D.Y.: Enhancing CNN with preprocessing stage in automatic emotion recognition. Procedia Computer Science 116, 523–529 (2017) 35. Zhang, Q., Chen, X., Zhan, Q., Yang, T., Xia, S.: Respiration-based emotion recognition with deep learning. Comput. Indus. 92–93, 84–90 (2017) 36. Park, M.W., Kim, C.J., Hwang, M., Lee, E.C.: Individual emotion classification between happiness and sadness by analyzing photoplethysmography and skin temperature. In: Proceedings of the 2013 4th World Congress on Software Engineering, pp. 190–194 (2013) 37. Ouellet, S.: Real-time emotion recognition for gaming using deep convolutional network features (2014). arXiv preprint arXiv:1408.3750 38. Li, J., Lam, E.Y.: Facial expression recognition using deep neural networks. In: IEEE International Conference on Imaging Systems and Techniques (IST), pp. 1–6 (2015) 39. Liu, P., Han, S., Meng, Z., Tong, Y.: Facial expression recognition via a boosted deep belief network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1805–1812 (2014) 40. Liu, M., Li, S., Shan, S., Chen, X.: Au-aware deep networks for facial expression recognition. In: 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–6 (2013)
264
N. Nahid et al.
41. Liu, M., Li, S., Shan, S.: Au-inspired deep networks for facial expression feature learning. Neurocomputing 159, 126–136 (2015) 42. Khorrami, P., Paine, T., Huang, T.: Do deep neural networks learn facial action units when doing expression recognition? (2015). arXiv preprint arXiv:1510.02969v3 43. Ding, H., Zhou, S.K., Chellappa, R.: Facenet2expnet: regularizing a deep face recognition net for expression recognition. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 118–126 (2017) 44. Zeng, N., Zhang, H., Song, B., Liu, W., Li, Y., Dobaie, A.M.: Facial expression recognition via learning deep sparse autoencoders. Neurocomputing 273, 643–649 (2018) 45. Cai, J., Meng, Z., Khan, A.S., Li, Z., O’Reilly, J., Tong, Y.: Island loss for learning discriminative features in facial expression recognition. In: 13th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 302–309 (2018) 46. Meng, Z., Liu, P., Cai, J., Han, S., Tong, Y.: Identity-aware convolutional neural network for facial expression recognition. In: 12th IEEE International Conference on Automatic Face & Gesture Recognition, pp. 558–565 (2017) 47. Liu, X., Kumar, B., You, J., Jia, P.: Adaptive deep metric learning for identity-aware facial expression recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 522–531 (2017) 48. Yang, H., Ciftci, U., Yin, L.: Facial expression recognition by de-expression residue learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2168–2177 (2018) 49. Zhang, Z., Luo, P., Chen, C.L., Tang, X.: From facial expression recognition to interpersonal relation prediction. Int. J. Comput. Vis. 126(5), 1–20 (2018) 50. Zhao, G., Pietikinen, M.: Boosted multi-resolution spatiotemporal descriptors for facial expression recognition. Pattern Recogn. Lett. 30, 1117–1127 51. Song, M., Tao, D., Liu, Z., Li, X., Zhou, M.: Image ratio features for facial expression recognition application. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 40, 779–788 (2010) 52. Zhang, L., Tjondronegoro, D.: Facial expression recognition using facial movement features. IEEE Trans. Affect. Comput. 2(4), 219–229 (2011) 53. Poursaberi, A., Noubari, H.A., Gavrilova, M., Yanushkevich, S.N.: Gauss-Laguerre wavelet textural feature fusion with geometrical information for facial expression identification. EURASIP J. Image Video Process. 1–13 (2012) 54. Ji, Y., Idrissi, K.: Automatic facial expression recognition based on spatiotemporal descriptors. Pattern Recogn. Lett. 33, 1373–1380 (2012) 55. Ucar, A., Demir, Y., Guzelis, C.: A new facial expression recognition based on curvelet transform and online sequential extreme learning machine initialized with spherical clustering. Neural Comput. Appl. 27, 131–142 (2014) 56. Zhang, L., Tjondronegoro, D., Chandran, V.: Random Gabor based templates for facial expression recognition in images with facial occlusion. Neurocomputing 145, 451–464 (2014) 57. Mahersia, H., Hamrouni, K.: Using multiple steerable filters and Bayesian regularization for facial expression recognition. Eng. Appl. Artif. Intell. 38, 190–202 (2015) 58. Happy, S.L., Member, S., Routray, A.: Automatic facial expression recognition using features of salient facial patches. IEEE Trans. Affect. Comput. 6, 1–12 (2015) 59. Biswas, S.: An efficient expression recognition method using contourlet transform. In: Proceedings of the 2nd International Conference on Perception and Machine Intelligence, pp. 167–174 (2015) 60. Siddiqi, M.H., Ali, R., Khan, A.M., Park, Y., Lee, S.: Human facial expression recognition using stepwise linear discriminant analysis and hidden conditional random fields. IEEE Trans. Image Process. 24(4), 1386–1398 (2015) 61. Cossetin, M.J., Nievola , J.C., Koerich, A.L.: Facial expression recognition using a pairwise feature selection and classification approach. In: 2016 International Joint Conference on Neural Networks (IJCNN), Vancouver, BC, pp. 5149–5155 (2016) 62. Salmam, F.Z., Madani, A., Kissi, M.: Facial expression recognition using decision trees. In: 2016 13th International Conference on Computer Graphics, Imaging and Visualization (CGiV), Beni Mellal, pp. 125–130 (2016)
9 Contactless Human Emotion Analysis Across Different Modalities
265
63. Kumar, S., Bhuyan, M.K., Chakraborty, B.K.: Extraction of informative regions of a face for facial expression recognition. IET Comput. Vis. 10(6), 567–576 (2016) 64. Mollahosseini, A., Chan, D., Mahoor, M.H.: Going deeper in facial expression recognition using deep neural networks. In: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, pp. 1–10 (2016) 65. Li, S., Deng, W., Du, J.: Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, pp. 2584–2593 (2017) 66. Reed, S., Sohn, K., Zhang, Y., Lee, H.: Learning to disentangle factors of variation with manifold interaction. In: International Conference on Machine Learning, pp. 1431–1439 (2014) 67. Devries, T., Biswaranjan, K., Taylor, G.W.: Multi-task learning of facial landmarks and expression. In: 2014 Canadian Conference on Computer and Robot Vision, Montreal, QC, pp. 98–103 (2014) 68. Tang, Y.: Deep learning using linear support vector machines (2013). arXiv preprint arXiv:1306.0239 69. Zhang, Z., Luo, P., Loy, C.-C., Tang, X.: Learning social relation traits from face images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3631–3639 (2015) 70. Guo, Y., Tao, D., Yu, J., Xiong, H., Li, Y., Tao, D.: Deep neural networks with relativity learning for facial expression recognition. In: 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA, pp. 1–6 (2016) 71. Kim, B.-K., Dong, S.-Y., Roh, J., Kim, G., Lee, S.-Y.: Fusing aligned and non-aligned face information for automatic affect recognition in the wild: a deep learning approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 48–57 (2016) 72. Pramerdorfer, C., Kampel, M.: Facial expression recognition using convolutional neural networks: state-of-the-art (2016). arXiv preprint arXiv:1612.02903 73. Hamester, D., Barros, P., Wermter, S. (2015) Face expression recognition with a 2-channel convolutional neural network. In: 2015 International Joint Conference on Neural Networks (IJCNN), Killarney, pp. 1–8 (2015) 74. Noh, S., Park, H., Jin, Y., Park, J.: Feature-adaptive motion energy analysis for facial expression recognition. In: International Symposium on Visual Computing, pp. 452–463 (2007) 75. Bashyal, S., Venayagamoorthy, G.K.: Recognition of facial expressions using Gabor wavelets and learning vector quantization. Eng. Appl. Artif. Intell. 21, 1056–1064 (2008) 76. Wang, H., Hu, Y., Anderson, M., Rollins, P., Makedon, F.: Emotion detection via discriminative Kernel method. In: Proceedings of the 3rd International Conference on Pervasive Technologies Related to Assistive Environments (2010) 77. Owusu, E., Zhan, Y., Mao, Q.R.: A neural-Ada boost based facial expression recognition system. Expert Syst. Appl. 41, 3383–3390 (2014) 78. Dahmane, M., Meunier, J.: Prototype-based modeling for facial expression analysis. IEEE Trans. Multimedia 16(6), 1574–1584 (2014) 79. Hegde, G.P., Seetha, M., Hegde, N.: Kernel locality preserving symmetrical weighted fisher discriminant analysis based subspace approach for expression recognition. Eng. Sci. Technol. Int. J. 19, 1321–1333 (2016) 80. Levi, G., Hassner, T.: Emotion recognition in the wild via convolutional neural networks and mapped binary patterns. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 503–510 (2015) 81. Ng, H.-W., Nguyen, V.D., Vonikakis, V., Winkler, S.: Deep learning for emotion recognition on small datasets using transfer learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 443–449 (2015) 82. Kim, B.-K., Lee, H., Roh, J., Lee, S.-Y.: Hierarchical committee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 427–434 (2015)
266
N. Nahid et al.
83. Yu, Z., Zhang, C.: Image based static facial expression recognition with multiple deep network learning. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 435–442 (2015) 84. Mandal, T., Majumdar, A., Wu, Q.J.: Face recognition by curvelet based feature extraction. In: Proceedings of the International Conference Image Analysis and Recognition, Montreal, pp. 806–817 (2007) 85. Mohammed, A.A., Minhas, R., Wu, Q.J., Sid-Ahmed, M.A.: Human face recognition based on multidimensional PCA and extreme learning machine. Pattern Recogn. 44, 2588–2597 (2011) 86. Lee, S.H., Plataniotis, K.N., Ro, Y.M.: Intra-class variation reduction using training expression images for sparse representation based facial expression recognition. IEEE Trans. Affect. Comput. 5(3), 340–351 (2014) 87. Zheng, W.: Multi-view facial expression recognition based on group sparse reduced-rank regression. IEEE Trans. Affect. Comput. 5(1), 71–85 (2014) 88. Benitez-Quiroz, C.F., Srinivasan, R., Martinez, A.M.: EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 5562– 5570 (2016) 89. Hasani, B., Mahoor, M.H.: Facial expression recognition using enhanced deep 3D convolutional neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, pp. 2278–2288 (2017) 90. Zhao, K., Chu, W., Zhang, H.: Deep region and multi-label learning for facial action unit detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 3391–3399 (2016) 91. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 46–53 (2000) 92. Lucey, P., Cohn, J.F., Kanade, T., Saragih, J., Ambadar, Z., Matthews, I.: The extended CohnKanade dataset (CK+): a complete dataset for action unit and emotion-specified expression. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition— Workshops, San Francisco, CA, pp. 94–101 (2010) 93. Pantic, M., Valstar, M., Rademaker, R., Maat, L.: Web-based database for facial expression analysis. In: 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, pp. 5–10 (2005) 94. Valstar, M., Pantic, M.: Induced disgust, happiness and surprise: an addition to the mmi facial expression database. In: Proceedings of 3rd International Workshop on EMOTION (satellite of LREC): Corpora for Research on Emotion and Affect, p. 65 (2010) 95. Susskind, J.M., Anderson, A.K., Hinton, G.E.: The toronto face database. Department of Computer Science, University of Toronto, Toronto, ON, Canada. Technical Report, vol. 3 (2010) 96. Goodfellow, I.J., Erhan, D., Carrier, P.L., Courville, A., Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler, D., Lee, D.-H., et al.: Challenges in representation learning: a report on three machine learning contests. In: International Conference on Neural Information Processing, pp. 117–124 (2013) 97. Lyons, M., Akamatsu, S., Kamachi, M., Gyoba, J.: Coding facial expressions with gabor wavelets. In: Third IEEE International Conference on Automatic Face and Gesture Recognition, pp. 200–205 (1998) 98. Dhall, A., Murthy, O.R., Goecke, R., Joshi, J., Gedeon, T.: Video and image based emotion recognition challenges in the wild: EmotioW 2015. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pp. 423–426 (2015) 99. Yin, L., Wei, X., Sun, Y., Wang, J., Rosato, M.J.: A 3d facial expression database for facial behavior research. In: 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, pp. 211–216 (2006)
9 Contactless Human Emotion Analysis Across Different Modalities
267
100. Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: DISFA: a spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 4(2), 151–160 (2013) 101. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W.F., Weiss, B.: A database of German emotional speech. In: INTERSPEECH, pp. 1517–1527 (2005) 102. Tao, J., Liu, F., Zhang, M., Jia, H.: Design of speech corpus for mandarin text to speech. In: The Blizzard Challenge 2008 Workshop (2008) 103. Engberg, I.S., Hansen, A.V., Andersen, O., Dalsgaard, P.: Design, recording and verification of a danish emotional speech database. In: Fifth European Conference on Speech Communication and Technology (1997) 104. Wang, K.X., Zhang, Q.L., Liao, S.Y.: A database of elderly emotional speech. Proc. Int. Symp. Signal Process. Biomed. Eng Inf. 549–553 (2014) 105. Lee, S., Yildirim, S., Kazemzadeh, A., Narayanan, S.: An articulatory study of emotional speech production. In: Ninth European Conference on Speech Communication and Technology (2005) 106. Emotional prosody speech and transcripts. http://olac.ldc.upenn.edu/item/oai:www.ldc. upenn.edu:LDC2002S28. Accessed15 May 2019 107. Batliner, A., Steidl, S., Noeth, E.: Releasing a thoroughly annotated and processed spontaneous emotional database: the FAU Aibo Emotion Corpus. In: Proceedings of a Satellite Workshop of LREC, p. 28 (2008) 108. Albornoz, E.M., Milone, D.H., Rufiner, H.L.: Spoken emotion recognition using hierarchical classifiers. Comput. Speech Lang. 25(3), 556–570 (2011) 109. Bitouk, D., Verma, R., Nenkova, A.: Class-level spectral features for emotion recognition. Speech Commun. 52, 613–625 (2010) 110. Borchert, M., Dusterhoft, A.: Emotions in speech-experiments with prosody and quality features in speech for use in categorical and dimensional emotion recognition environments. In: Proceedings of 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 147–151 (2005) 111. Mao, Q., Dong, M., Huang, Z., Zhan, Y.: Learning salient features for speech emotion recognition using convolutional neural networks. IEEE Trans. Multimedia 16, 2203–2213 (2014) 112. Schuller, B., Muller, R., Lang, M., Rigoll, G.: Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles. In: Ninth European Conference on Speech Communication and Technology (2005) 113. Shen, P., Changjun, Z., Chen, X.: Automatic speech emotion recognition using support vector machine. Int. Conf. Electron. Mech. Eng. Inf. Technol. (EMEIT) 2, 621–625 (2011) 114. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, pp. 511–516 (2013) 115. Wu, S., Falk, T.H., Chan, W.-Y.: Automatic speech emotion recognition using modulation spectral features. Speech Commun. 53, 768–785 (2011) 116. Wang, K., An, N., Li, B.N., Zhang, Y., Li, L.: Speech emotion recognition using fourier parameters. IEEE Trans. Affect. Comput. 6, 69–75 (2015) 117. Yang, B., Lugger, M.: Emotion recognition from speech signals using new harmony features. Signal Process. 90(5), 1415–1423 (2010) 118. Ververidis, D., Kotropoulos, C.: Emotional speech classification using gaussian mixture models and the sequential floating forward selection algorithm. In: 2005 IEEE International Conference on Multimedia and Expo (ICME), vol. 7, pp. 1500–1503 (2005) 119. Zhang, Z., Weninger, F., Wollmer, M., Schuller, B.: Unsupervised learning in cross-corpus acoustic emotion recognition. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 523–528 (2011) 120. Busso, C., Lee, S., Narayanan, S.: Analysis of emotionally salient aspects of fundamental frequency for emotion detection. IEEE Trans. Audio Speech Lang. Process. 17(4), 582–596 (2009) 121. Grimm, M., Kroschel, K., Provost, E.M., Narayanan, S.: Primitives based evaluation and estimation of emotions in speech. Speech Commun. 49, 787–800 (2007)
268
N. Nahid et al.
122. Deng, J., Zhang, Z., Eyben, F., Schuller, B.: Autoencoder-based unsupervised domain adaptation for speech emotion recognition. IEEE Signal Process. Lett. 21(9), 1068–1072 (2014) 123. Kwon, O., Chan, K., Hao, J., Lee, T.-W. : Emotion recognition by speech signals. In: Eighth European Conference on Speech Communication and Technology (2003) 124. Lee, C., Mower, E., Busso, C., Lee, S., Narayanan, S.: Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 53, 1162–1171 (2011) 125. Iemocap database. https://sail.usc.edu/iemocap/. Accessed 15 May 2019 126. Han, K., Yu, D., Tashev, I.: Speech emotion recognition using deep neural network and extreme learning machine. In: Fifteenth Annual Conference of the International Speech Communication Association (2014) 127. Mirsamadi, S., Barsoum, E., Zhang, C.: Automatic speech emotion recognition using recurrent neural networks with local attention. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2227–2231 (2017) 128. Surrey audio-visual expressed emotion database. https://sail.usc.edu/iemocap/. Accessed 15 Feb 2020 129. Martin, O., Kotsia, I., Macq, B., Pitas, I.: The eNTERFACE’05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW’06), pp. 8–16 (2006) 130. Ringeval, F., Sonderegger, A., Sauer, J., Lalanne, D.: Introducing the recola multimodal corpus of remote collaborative and affective interactions. In: 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), pp. 1–8 (2013) 131. Trigeorgis, G., Ringeval, F., Brueckner, R., Marchi, E., Nicolaou, M.A., Schuller, B., Zafeiriou, S.: Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5200–5204 (2016) 132. Grimm, M., Kroschel, K., Narayanan, S.: The vera am Mittag German audiovisual emotional speech database. In: IEEE International Conference on Multimedia and Expo, pp. 865–868 (2008) 133. Schuller, B.: Recognizing affect from linguistic information in 3d continuous space. IEEE Trans. Affect. Comput. 2(4), 192–205 (2011) 134. Schuller, B., Muller, R., Eyben, F., Gast, J., Hornler, B., Wollmer, M., Rigoll, G., Hothker, A., Konosu, H.: Being bored? Recognising natural interest by extensive audiovisual integration for real-life application. Image Vis. Comput. 27(12), 1760–1774 (2009) 135. McKeown, G., Valstar, M., Cowie, R., Pantic, M., Schroder, M.: The SEMAINE database: annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE Trans. Affect. Comput. 3(1), 5–17 (2011) 136. Kaya, H., Fedotov, D., Yesilkanat, A., Verkholyak, O., Zhang, Y., Karpov, A.: LSTM based cross-corpus and cross-task acoustic emotion recognition. In: Interspeech, pp. 521–525 (2018) 137. Subramanian, R., Wache, J., Khomami Abadi, M., Vieriu, R., Winkler, S., Sebe, N.: ASCERTAIN: emotion and personality recognition using commercial sensors. IEEE Trans. Affect. Comput. 1, (2016). https://doi.org/10.1109/TAFFC.2016.2625250 138. Soleymani, M., Lichtenauer, J., Pun, T., Pantic, M.: A multimodal database for affect recognition and implicit tagging. IEEE Trans. Affect. Comput. 3, 42–55 (2012) 139. Chen, J., Hu, B., Xu, L., Moore, P., Su, Y.: Feature-level fusion of multimodal physiological signals for emotion recognition. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Washington, DC, pp. 395–399 (2015) 140. Tong, Z., Chen, X., He, Z., Tong, K., Fang, Z., Wang, X.: Emotion recognition based on photoplethysmogram and electroencephalogram. In: IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), Tokyo vol. 2018, pp. 402–407 (2018) 141. Wollmer, M., Eyben, F., Reiter, S., Schuller, B., Cox, C., Douglas-Cowie, E., Cowie, R.: Abandoning emotion classes-towards continuous emotion recognition with modelling of long-range dependencies. In: Proceedings of 9th Interspeech 2008 incorp. 12th Australian International Conference on Speech Science and Technology, SST 2008, Brisbane, Australia, pp. 597–600 (2008)
9 Contactless Human Emotion Analysis Across Different Modalities
269
142. Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., Karpouzis, K.: Modeling naturalistic affective states via facial and vocal expressions recognition. In: Proceedings of 8th International Conference Multimodal Interfaces, pp. 146–154 (2004) 143. Subramanian, R., Wache, J., Abadi, M.K., Vieriu, R.L., Winkler, S., Sebe, N.: ASCERTAIN: emotion and personality recognition using commercial sensors. IEEE Trans. Affect. Comput. 9, 147–160 (2018) 144. Pan, Y., Shen, P., Shen, L.: Speech emotion recognition using support vector machine. Int. J. Smart Home 6(2), 101–108 (2012) 145. Xiao, Z., Dellandrea, E., Dou, W., Chen, L.: Multi-stage classification of emotional speech motivated by a dimensional emotion model. Multimedia Tools Appl. 46(1), 119 (2010) 146. Kim, J., Englebienne, G., Truong, K.P., Evers, V.: Towards speech emotion recognition in the wild using aggregated corpora and deep multi-task learning (2017). arXiv preprint arXiv:1708.03920 147. Chen, M., He, X., Yang, J., Zhang, H.: 3-d convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 25(10), 1440– 1444 (2018) 148. Latif, S., Rana, R., Younis, S., Qadir, J., Epps, J.: Transfer learning for improving speech emotion classification accuracy (2018). arXiv preprint arXiv:1801.06353 149. Sahu, S., Gupta, R., Sivaraman, G., Espy-Wilson, C.: Smoothing model predictions using adversarial training procedures for speech based emotion recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4934–4938 (2018) 150. Liu, J., Su, Y., Liu, Y.: Multi-modal emotion recognition with temporal-band attention based on LSTM-RNN. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds.) Advances in Multimedia Information Processing-PCM 2017, vol. 10735. Springer, Cham and Switzerland (2018) 151. Koelstra, S., Patras, I.: Fusion of facial expressions and EEG for implicit affective tagging. Image Vis. Comput. 31, 164–174 (2013) 152. Petta, P., Pelachaud, C., Cowie, R.: Emotion-Oriented Systems the Humaine Handbook. Springer, Berlin (2011) 153. Huang, C., Liew, S.S., Lin, G.R., Poulsen, A., Ang, M.J.Y., Chia, B.C.S., Chew, S.Y., Kwek, Z.P., Wee, J.L.K., Ong, E.H., et al.: Discovery of irreversible inhibitors targeting histone methyltransferase, SMYD3. ACS Med. Chem. Lett. 10, 978–984 (2019) 154. Benezeth, Y., Li, P., Macwan, R., Nakamura, K., Yang, F., Benezeth, Y., Li, P., Macwan, R., Nakamura, K., Gomez, R., et al.: Remote heart rate variability for emotional state monitoring. In: Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Las Vegas, NV, USA, pp. 153–156, 4–7 March 2018
Chapter 10
Activity Recognition for Assisting People with Dementia Muhammad Fikry, Defry Hamdhana, Paula Lago, and Sozo Inoue
Abstract Technology can help and provide support to people with dementia to ensure their safety during daily activities. In this paper, we summarize information about activity recognition for people living with dementia. This paper aims to understand the uses and types of applications, the types of sensors/systems, methods, and data used within the scope of human activity recognition to monitor, detect symptoms, or help people with dementia. To this end, 447 abstracts were collected from a Scopus database, which yielded 127 relevant papers and 102 papers that were considered in detail based on the four categories of assessment (application, system/sensors, methods, and data). This paper shows the trend that smart environment technology is most widely used for monitoring people with dementia, wherein machine learning techniques as a method for activity recognition to achieve the results of testing or implementing the system. We conclude that combining sensor devices and the addition of smartphone devices in one system is suitable for implementation because it can be used as an identity such that it distinguishes the object under study with other objects. During the monitoring process, prevention can be achieved simultaneously by adding a warning alarm in the smartphone when people with dementia perform abnormal activities, and the results need to be further analyzed to get the best pattern of activities for people with dementia. Next, the type of application that was initially designed for monitoring can be developed into an assistant for people with dementia.
M. Fikry (B) · D. Hamdhana · P. Lago · S. Inoue Kyushu Institute of Technology, 2-4 Hibikino Wakamatsu, Kitakyushu, Fukuoka 808-0196, Japan e-mail: [email protected] D. Hamdhana e-mail: [email protected] P. Lago e-mail: [email protected] S. Inoue e-mail: [email protected] © Springer Nature Switzerland AG 2021 M. A. R. Ahad et al. (eds.), Contactless Human Activity Analysis, Intelligent Systems Reference Library 200, https://doi.org/10.1007/978-3-030-68590-4_10
271
272
M. Fikry et al.
10.1 Introduction Advances in technology have led to the development of a number of new devices and transferred the power of computers to the aspects of daily life. Further, this also drives the transformation of how the community deals with computer science. More sophisticated systems are being designed so that members of the community do not need to be computer specialists to benefit from computing resources. Systems in this area are called ambient intelligence (AMI); at their core, the systems attempt to make computational applications available to the society interfering with their routine and by minimizing direct interactions [13]. Meanwhile, Alzheimer’s disease and related dementia (ADRD) has reached around 50 million sufferers worldwide. This amount is growing population, and with aging being globally prevalent, expected to reach 152 million in 2050 [120]. Although many dementia diagnostic tools are available, 62% of ADRD cases worldwide remain undiagnosed, while 91% of the cases are diagnosed very late [95]. Missed or delayed diagnosis can increase the socioeconomic burden on the health care system because of the large expenditures on unnecessary investigations, treatments that are driven based on symptoms, and the lack of family and caregiver counseling [58]. Ubiquitous computing and intelligent data analysis can provide innovative methods and tools for detecting symptoms of the rise of cognitive impairment quickly and for monitoring its evolution [99]. In this paper, we review solutions based on activity recognition that offer to overcome the problem of dementia. We perform a systematic review process to identify the main solutions proposed for dementia, and we classify them based on their application, system, sensors, and methods used (Sect. 10.2). After the summary of dementia presented in Sect. 10.3 for completeness, we analyze the main technological solutions proposed in Sect. 10.4. As a result of our review, we find that monitoring is the most common approach to support and help people with dementia in their daily lives. However, research on other applications, such as early symptom detection, identification, and prediction are lacking. In terms of technology, there are only a few datasets for this problem, which makes the development of such technology difficult. In the reviewed studies, a small dataset is collected, which makes the development of research difficult. Finally, we conclude that increased monitoring is required to prevent people with dementia from performing activities outside of the recommended rules. Research in this challenging domain is an important future direction.
10.2 Review Method The purpose of this review is to understand the uses and type of applications, the kind of sensors/systems, methods, and data used within the scope of human activity recognition to monitor, detect symptoms or help with dementia.
10 Activity Recognition for Assisting People with Dementia
273
Table 10.1 Keyword of search for activity recognition areas Keywords: activity recognition AND • Information systems • Cognitive impairment • Physically active • Talking • Assessment methods • Prediction methods • Multilayer perceptron • LSVM • ICC • Equal error rate • Linear regression • Radio frequency
• Support vector regression • Radial basis function • K-nearest neighbors • Mean absolute error • Naïve bayes • Hidden semi-markov model • Conditional random fields • Long short term memory • Graph convolutional network • Prevention
We do a systematic review process [98] to analyze the papers. All available papers are evaluated based on predetermined criteria. Then, the results will be classified based on their relevance. The population of the systematic review contains research papers related to human activities and dementia. Furthermore, we conducted a search in the Scopus database.1 We searched for papers related to the areas of “human activity recognition” and “dementia”. We designed a list of keywords for each area (Tables 10.1 and 10.2) by consulting dementia experts and activity recognition experts. Table 10.1 is a summary of several keywords in activity recognition. Some worthless keywords have been deleted. From this table, the reader can analyze several methods and algorithms that are commonly used in activity recognition. Similar to Table 10.1, the contents of Table 10.2 are the collection of several keywords in the search for dementia before the two keywords are combined to narrow the search area. From Table 10.2 indicates the scope of dementia has often been discussed. In the activity recognition area, the highest keyword is activity recognition AND memory with 281 articles, then activity recognition AND assessment with 240 articles, and activity recognition AND language with 160 articles. Meanwhile, for the dementia area, the highest keyword is dementia AND prediction with 193 articles, then dementia AND drug with as many as 156 articles, and dementia AND support vector machine with 160 articles. The selected papers are then classified into four main subjects: application types, sensor/system types, methods, and datasets. The types of applications are divided into eight categories, namely, algorithm comparison, analysis, diagnosis, guidance, identification, monitoring, prediction, and prevention (Fig. 10.1). From this classification, we gain a general view of the development of dementia research. In these types 1 https://www.scopus.com.
274
M. Fikry et al.
Table 10.2 Keywords of search for dementia areas Keywords: Dementia AND • Behavioral and psychological symptoms of dementia • Mini-mental state examination • Montreal cognitive assessment • Assessment • Eating • Language • Memory • Prediction • Prevention • Acupuncture • Alcohol • APOE
• Apolipoprotein E
• Multilayer perceptron
• Behavior recognition
• Music therapy
• Diet
• Smoking
• Drug • Eating • Herbal • Lewy bodies • Light therapy • Massage • Mathematical • Mentally • Aromatherapy • Age • Genetic
• smoteBOOST • Talking • Temporal • Urine • Vascular • wRACOG • Alzheimer • Frontotemporal • Lifestyle
Fig. 10.1 Classification based on application type
of applications, the prevention area is the highest goal. However, before reaching that level, we must be able to deal with the previous stage. Figure 10.2 visualizes the types of sensors/systems used in each research that has been carried out, divided into ten categories: audio, design, electroencephalogram (EEG), framework, MRI, sensors, intelligent environment, speech, video, wearable. Patients with Dementia are easily disturbed by things that are not normal in their daily lives. The appropriate sensor system selection will prove beneficial for them to pass through their daily routine without feeling disturbed. Figure 10.3 visualizes the method/algorithm support for use in processing data, and there is support vector machine (SVM), Naive Bayes, K-Nearest Neighbors, Hidden Markov Model (HMM), Bayesian network, Decision Tree (C4.5), and Random Forest (RF). The selection of the right algorithm will help high predictive accuracy.
10 Activity Recognition for Assisting People with Dementia
275
Fig. 10.2 Classification by system sensor support
Fig. 10.3 Method/Algorithm support
Fig. 10.4 Dataset of dementia
Finally, the last subject is the dataset used to process information about the activity recognition technology for patients with dementia (Fig. 10.4): these datasets include real data, simulation data, Liara, activities of daily living (ADL) datasets, sensor data, smartphones, Alzheimer’s disease deuroimaging initiative (ADNI).
10.3 Overview of Dementia 10.3.1 Types of Dementia Dementia is a chronic brain disorder divided into four different diseases that affect cognition and mental degeneration: every subgroup exhibits similar brain deficiencies and mutations [67]. Dementia causes a decrease in memory recall and cognitive abilities. For a patient with dementia, their lifestyle, and their ability to socialize,
276
M. Fikry et al.
to the daily activities of the sufferer. Dementia is not the same as senile. Senile is a decrease in the ability to remember and think that usually occurs with age. This change can affect memory, but it is not significant; further, its effect and causes deffer for each person and their medical histories Alzheimer’s disease and vascular dementia are the most common types of dementia. Alzheimer’s is the most common neurodegenerative disease for dementia, and it comprises 60–80% of cases; vascular dementia is the second most common form of dementia (20%) [34]. Alzheimer’s is a type of dementia related to genetic changes and protein changes in the brain; vascular dementia is a type of dementia that causes disturbance within the blood vessels of the brain. Lewy body dementia (LBD) is a form of dementia caused by abnormal deposits of alpha-synuclein protein (Lewy bodies) inside neurons. It contributes 5–15% of all dementias [5]. Besides, frontotemporal dementia (FTD) is another type of dementia that affects the frontal and temporal lobes of the brain, thereby causing affects on behavior and personality, language function. A brief explanation is provided in Table 10.3. Unfortunately, there is currently no medicine to heal dementia [124]. However, some therapies are available to deal with dementia symptoms and behaviors. Here are the following therapies: • Cognitive stimulation therapy is good for improving memory, problem-solving skills, and the ability to support by performing group activities or sports.
Table 10.3 Distinguishing features of the subtypes of dementia [80] Dementia subtype Clinical presentationa Alzheimer’s disease
Vascular dementia
Lewy body dementia
Frontotemporal dementia
a Clinical
• Insidious onset and slow progressive decline • Short-term memory impairment in early stage; deficit on 3-word or 5-word recall; executive dysfunction in later stages • Sudden or gradual onset • Usually correlated with cerebrovascular disease (stroke or lacunar infarcts) and atherosclerotic comorbidities (diabetes, hypertension, or coronary heart disease) • Mild memory impairment in early stage • Possible gait difficulties and falls (depending on the extent of stroke) • Fluctuating cognition associated with parkinsonism • Poor executive function and visual hallucinations in early stage; deficits on tests designed to examine visual perception (pentagons, cube, trails, and clock face) • More prominent personality changes (disinhibition) and behavioral disturbances (apathy, aggression, and agitation with less memory impairment in early stage)
presentation summaries from Muangpaisan W. Clinical differences among 4 common dementia syndromes. Geriatr Aging 2007;10(425):9
10 Activity Recognition for Assisting People with Dementia
277
• Occupational therapy is aimed at teaching sufferers how to perform their daily activities safely based on their conditions and also how to control emotions in dealing with the development of symptoms. • Memory therapy assists patients with dementia to remember their life childhood memories, school periods, work, hobbies, etc. • Cognitive rehabilitation intends to train the nonfunctioning parts of the brain, using parts of the brain that are still healthy.
10.3.2 Causes of Dementia Dementia is caused by the connection between nerve cells in parts of the brain. Moreover, age is the riskiest factor for dementia [113]. In 2000, prevalence data from 11 European population-based studies were collected to determine stable estimates of dementia prevalence in the elderly (65 years) [63]. A small portion of people with dementia comes from dementia families with autosomal dominant mutations. Mutations caused by several genes have been shown to cause AD, except that the genetic form of AD accounts for