Human centric visual analysis with deep learning 9789811323867, 9789811323874


261 85 3MB

English Pages 160 Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Foreword......Page 5
Preface......Page 7
Contents......Page 9
Part I Motivation and Overview......Page 13
1.1.1 Perceptron......Page 14
1.1.2 Multilayer Perceptron......Page 15
1.1.3 Formulation of Neural Network......Page 17
1.2.1 Batch Normalization......Page 18
1.2.2 Batch Kalman Normalization......Page 20
References......Page 24
2.1 Face Detection......Page 25
2.2.1 Conventional Approaches......Page 26
2.2.2 Deep-Learning-Based Models......Page 27
2.3.1 Benchmarks for Pedestrian Detection......Page 28
2.3.2 Pedestrian Detection Methods......Page 29
References......Page 31
Part II Localizing Persons in Images......Page 36
3.1 Facial Landmark Machines......Page 38
3.2 The Cascaded BB-FCN Architecture......Page 40
3.2.1 Backbone Network......Page 41
3.2.2 Branch Network......Page 42
3.2.3 Ground Truth Heat Map Generation......Page 43
3.3.2 Evaluation Metric......Page 44
3.3.4 Comparison with the State of the Art......Page 45
3.4 Attention-Aware Face Hallucination......Page 46
3.4.1 The Framework of Attention-Aware Face Hallucination......Page 48
3.4.2 Recurrent Policy Network......Page 49
3.4.4 Deep Reinforcement Learning......Page 51
3.4.5 Experiments......Page 52
References......Page 53
4.1 Introduction......Page 55
4.2.1 Region Proposal Network for Pedestrian Detection......Page 57
4.2.3 Boosted Forest......Page 58
4.3 Experiments and Analysis......Page 59
References......Page 61
Part III Parsing Person in Detail......Page 63
5.1 Introduction......Page 66
5.2 Look into Person Benchmark......Page 68
5.3 Self-supervised Structure-Sensitive Learning......Page 69
5.3.1 Self-supervised Structure-Sensitive Loss......Page 71
5.3.2 Experimental Result......Page 73
References......Page 74
6.1 Introduction......Page 76
6.2 Related Work......Page 79
6.3 Crowd Instance-Level Human Parsing Dataset......Page 80
6.3.2 Dataset Statistics......Page 81
6.4 Part Grouping Network......Page 82
6.4.1 PGN Architecture......Page 83
6.4.2 Instance Partition Process......Page 85
6.5.1 Experimental Settings......Page 86
6.5.2 PASCAL-Person-Part Dataset......Page 87
6.5.4 Qualitative Results......Page 88
References......Page 89
7.1 Introduction......Page 91
7.2 Video Instance-Level Parsing Dataset......Page 92
7.3 Adaptive Temporal Encoding Network......Page 93
7.3.2 Parsing R-CNN......Page 96
7.3.3 Training and Inference......Page 97
References......Page 99
Part IV Identifying and Verifying Persons......Page 100
8.1 Introduction......Page 104
8.2 Generalized Similarity Measures......Page 106
8.2.1 Model Formulation......Page 109
8.2.2 Connection with Existing Models......Page 110
8.3.1 Deep Architecture......Page 111
8.3.2 Model Training......Page 113
8.4 Experiments......Page 116
References......Page 117
9.1 Introduction......Page 120
9.2 Related Work......Page 122
9.3 Framework Overview......Page 125
9.4 Formulation and Optimization......Page 126
References......Page 134
Part V Higher Level Tasks......Page 136
10.1 Introduction......Page 139
10.2 Deep Structured Model......Page 140
10.2.2 Latent Temporal Structure......Page 141
10.2.3 Deep Model with Relaxed Radius-Margin Bound......Page 143
10.3.1 Latent Temporal Structure......Page 146
10.3.2 Architecture of Deep Neural Networks......Page 147
10.4 Learning Algorithm......Page 149
10.4.1 Joint Component Learning......Page 150
10.4.3 Inference......Page 153
10.5.1 Datasets and Setting......Page 154
10.5.2 Empirical Analysis......Page 155
References......Page 159
Recommend Papers

Human centric visual analysis with deep learning
 9789811323867, 9789811323874

  • Author / Uploaded
  • Lin L
  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Liang Lin · Dongyu Zhang · Ping Luo · Wangmeng Zuo

Human Centric Visual Analysis with Deep Learning

Human Centric Visual Analysis with Deep Learning

Liang Lin Dongyu Zhang Ping Luo Wangmeng Zuo •





Human Centric Visual Analysis with Deep Learning

123

Liang Lin School of Data and Computer Science Sun Yat-sen University Guangzhou, Guangdong, China

Dongyu Zhang School of Data and Computer Science Sun Yat-sen University Guangzhou, Guangdong, China

Ping Luo School of Information Engineering The Chinese University of Hong Kong Hong Kong, Hong Kong

Wangmeng Zuo School of Computer Science Harbin Institute of Technology Harbin, China

ISBN 978-981-13-2386-7 ISBN 978-981-13-2387-4 https://doi.org/10.1007/978-981-13-2387-4

(eBook)

© Springer Nature Singapore Pte Ltd. 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Foreword

When Liang asked me to write the foreword to his new book, I was very happy and proud to see the success that he has achieved in recent years. I have known Liang since 2005, when he visited the Department of Statistics of UCLA as a Ph.D. student. Very soon, I was deeply impressed by his enthusiasm and potential in academic research during regular group meetings and his presentations. Since 2010, Liang has been building his own laboratory at Sun Yat-sen University, which is the best university in southern China. I visited him and his research team in the summer of 2010 and spent a wonderful week with them. Over these years, I have witnessed his fantastic success of him and his group, who set an extremely high standard. His work on deep structured learning for visual understanding has built his reputation as a well-established professor in computer vision and machine learning. Specifically, Liang and his team have focused on improving feature representation learning with several interpretable and context-sensitive models and applied them to many computer vision tasks, which is also the focus of this book. On the other hand, he has a particular interest in developing new models, algorithms, and systems for intelligent human-centric analysis while continuing to focus on a series of classical research tasks such as face identification, pedestrian detection in surveillance, and human segmentation. The performance of human-centric analysis has been significantly improved by recently emerging techniques such as very deep neural networks, and new advances in learning and optimization. The research team led by Liang is one of the main contributors in this direction and has received increasing attention from both the academy and industry. In sum, Liang and his colleagues did an excellent job with the book, which is the most up-to-date resource you can find and a great introduction to human-centric visual analysis with emerging deep structured learning. If you need more motivation than that, here is the foreword: In this book, you will find a wide range of research topics in human-centric visual analysis including both classical (e.g., face detection and alignment) and newly rising topics (e.g., fashion clothing parsing), and a series of state-of-the-art solutions addressing these problems. For example, a newly emerging task, human parsing, namely, decomposing a human image into semantic fashion/body regions, v

vi

Foreword

is deeply and comprehensively introduced in this book, and you will find not only the solutions to the real challenges of this problem but also new insights from which more general models or theories for related problems can be derived. To the best of our knowledge, to date, a published systematic tutorial or book targeting this subject is still lacking, and this book will fill that gap. I believe this book will serve the research community in the following aspects: (1) It provides an overview of the current research in human-centric visual analysis and highlights the progress and difficulties. (2) It includes a tutorial in advanced techniques of deep learning, e.g., several types of neural network architectures, optimization methods, and techniques. (3) It systematically discusses the main human-centric analysis tasks on different levels, ranging from face/human detection and segmentation to parsing and other higher level understanding. (4) It provides effective methods and detailed experimental analysis for every task as well as sufficient references and extensive discussions. Furthermore, although the substantial content of this book focuses on human-centric visual analysis, it is also enlightening regarding the development of detection, parsing, recognition, and high-level understanding methods for other AI applications such as robotic perception. Additionally, some new advances in deep learning are mentioned. For example, Liang introduces the Kalman normalization method, which was invented by Liang and his students, for improving and accelerating the training of DNNs, particularly in the context of microbatches. I believe this book will be very helpful and important to academic professors/students as well as industrial engineers working in the field of vision surveillance, biometrics, and human–computer interaction, where human-centric visual analysis is indispensable in analyzing human identity, pose, attributes, and behaviors. Briefly, this book will not only equip you with the skills to solve the application problems but will also give you a front-row seat to the development of artificial intelligence. Enjoy! Alan Yuille Bloomberg Distinguished Professor of Cognitive Science and Computer Science Johns Hopkins University, Baltimore, Maryland, USA

Preface

Human-centric visual analysis is regarded as one of the most fundamental problems in computer vision, which augments human images in a variety of application fields. Developing solutions for comprehensive human-centric visual applications could have crucial impacts in many industrial application domains such as virtual reality, human–computer interaction, and advanced robotic perception. For example, clothing virtual try-on simulation systems that seamlessly fit various clothes to the human body shape have attracted much commercial interest. In addition, human motion synthesis and prediction can bridge virtual and real worlds, facilitating more intelligent robotic–human interactions by enabling causal inferences for human activities. Research on human-centric visual analysis is quite challenging. Nevertheless, through the continuous efforts of academic and industrial researchers, continuous progress has been achieved in this field in recent decades. Recently, deep learning methods have been widely applied to computer vision. The success of deep learning methods can be partly attributed to the emergence of big data, newly proposed network models, and optimization methods. With the development of deep learning, considerable progress has also been achieved in different subtasks of human-centric visual analysis. For example, in facial recognition, the accuracy of the deep model-based method has exceeded the accuracy of humans. Other accurate face detection methods are also based on deep learning models. This progress has spawned many interesting and practical applications, such as face ID in smartphones, which can identify individual users and detect fraudulent authentication based on faces. In this book, we will provide an in-depth summary of recent progress in human-centric visual analysis based on deep learning methods. The book is organized into five parts. In the first part, Chap. 1 first provides the background of deep learning methods including a short review of the development of artificial neural networks and the backpropagation method to give the reader a better understanding of certain deep learning concepts. We also introduce a new technique for the training of deep neural networks. Subsequently, in Chap. 2, we provide an overview of the tasks and the current progress of human-centric visual analysis. vii

viii

Preface

In the second part, we introduce tasks related to how to localize a person in an image. Specifically, we focus on face detection and pedestrian detection. In Chap. 3, we introduce the facial landmark localization method based on a cascaded fully convolutional network. The proposed method first generates low-resolution response maps to identify approximate landmark locations and then produces fine-grained response maps over local regions for more accurate landmark localization. We then introduce the attention-aware facial hallucination method, which generates a high-resolution facial image from a low-resolution image. This method recurrently discovers facial parts and enhances them by fully exploiting the global interdependency of facial images. In Chap. 4, we introduce a deep learning model for pedestrian detection based on region proposal networks and boosted forests. In the third part, several representative human parsing methods are described. In Chap. 5, we first introduce a new benchmark for the human parsing task, followed by a self-supervised structure-sensitive learning method for human parsing. In Chaps. 6–7, instance-level human parsing and video instance-level human parsing methods are introduced. In the fourth part, person verification and face verification are introduced. In Chap. 8, we describe a cross-modal deep model for person verification. The model accepts different input modalities and produces prediction. In Chap. 9, we introduce a deep learning model for face recognition by exploiting unlabeled data based on active learning. The last part describes a high-level task and discusses the progress of human activity recognition. The book is based on our years of research on human-centric visual analysis. Since 2010, with grant support from the National Natural Science Foundation of China (NSFC), we have developed our research plan. Since then, an increasing number of studies have been conducted in this area. We would like to express our gratitude to our colleagues and Ph.D. students, i.e., Prof. Xiaodan Liang, Prof. Guanbin Li, Dr. Pengxu Wei, Dr. Keze Wang, Dr. Tianshui Chen, Dr. Qingxing Cao, Dr. Guangrun Wang, Dr. Lingbo Liu, and Dr. Ziliang Chen, for their contributions to the research achievements on this topic. It has been our great honor to work with them on this inspiring topic in recent years. Guangzhou, China

Liang Lin

Contents

Part I

Motivation and Overview

1

The Foundation and Advances of Deep Learning . 1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Perceptron . . . . . . . . . . . . . . . . . . . . 1.1.2 Multilayer Perceptron . . . . . . . . . . . . 1.1.3 Formulation of Neural Network . . . . . 1.2 New Techniques in Deep Learning . . . . . . . . 1.2.1 Batch Normalization . . . . . . . . . . . . . 1.2.2 Batch Kalman Normalization . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

3 3 3 4 6 7 7 9 13

2

Human-Centric Visual Analysis: Tasks and Progress 2.1 Face Detection . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Facial Landmark Localization . . . . . . . . . . . . . . 2.2.1 Conventional Approaches . . . . . . . . . . . 2.2.2 Deep-Learning-Based Models . . . . . . . . 2.3 Pedestrian Detection . . . . . . . . . . . . . . . . . . . . . 2.3.1 Benchmarks for Pedestrian Detection . . . 2.3.2 Pedestrian Detection Methods . . . . . . . . 2.4 Human Segmentation and Clothes Parsing . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

15 15 16 16 17 18 18 19 21 21

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

29 29 31 32 33 34

Part II 3

. . . . . . . . .

Localizing Persons in Images

Face Localization and Enhancement . . . . . . . . . . 3.1 Facial Landmark Machines . . . . . . . . . . . . . 3.2 The Cascaded BB-FCN Architecture . . . . . . 3.2.1 Backbone Network . . . . . . . . . . . . . 3.2.2 Branch Network . . . . . . . . . . . . . . . 3.2.3 Ground Truth Heat Map Generation

. . . . . .

. . . . . .

. . . . . .

ix

x

Contents

3.3

4

Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Performance Evaluation for Unconstrained Settings 3.3.4 Comparison with the State of the Art . . . . . . . . . . 3.4 Attention-Aware Face Hallucination . . . . . . . . . . . . . . . . . . 3.4.1 The Framework of Attention-Aware Face Hallucination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Recurrent Policy Network . . . . . . . . . . . . . . . . . . . 3.4.3 Local Enhancement Network . . . . . . . . . . . . . . . . 3.4.4 Deep Reinforcement Learning . . . . . . . . . . . . . . . . 3.4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

35 35 35 36 36 37

. . . . . .

. . . . . .

. . . . . .

39 40 42 42 43 44

Pedestrian Detection with RPN and Boosted Forest . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Region Proposal Network for Pedestrian Detection . 4.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Boosted Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Experiments and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

47 47 49 49 50 50 51 53

Part III 5

6

Parsing Person in Detail

Self-supervised Structure-Sensitive Learning for Human Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Look into Person Benchmark . . . . . . . . . . . . . . . . . 5.3 Self-supervised Structure-Sensitive Learning . . . . . . . 5.3.1 Self-supervised Structure-Sensitive Loss . . . 5.3.2 Experimental Result . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

59 59 61 62 64 66 67

Instance-Level Human Parsing . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Crowd Instance-Level Human Parsing Dataset 6.3.1 Image Annotation . . . . . . . . . . . . . . . 6.3.2 Dataset Statistics . . . . . . . . . . . . . . . 6.4 Part Grouping Network . . . . . . . . . . . . . . . . . 6.4.1 PGN Architecture . . . . . . . . . . . . . . . 6.4.2 Instance Partition Process . . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

69 69 72 73 74 74 75 76 78

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

Contents

xi

6.5

Experiments . . . . . . . . . . . . . . . . . . . 6.5.1 Experimental Settings . . . . . . 6.5.2 PASCAL-Person-Part Dataset 6.5.3 CIHP Dataset . . . . . . . . . . . . 6.5.4 Qualitative Results . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . .

7

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

79 79 80 81 81 82

Video Instance-Level Human Parsing . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 7.2 Video Instance-Level Parsing Dataset . . . . 7.2.1 Data Amount and Quality . . . . . . 7.2.2 Dataset Statistics . . . . . . . . . . . . 7.3 Adaptive Temporal Encoding Network . . . 7.3.1 Flow-Guided Feature Propagation 7.3.2 Parsing R-CNN . . . . . . . . . . . . . 7.3.3 Training and Inference . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

85 85 86 87 87 87 90 90 91 93

Part IV

. . . . . .

. . . . . .

Identifying and Verifying Persons

8

Person Verification . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 8.2 Generalized Similarity Measures . . . . . . . 8.2.1 Model Formulation . . . . . . . . . . . 8.2.2 Connection with Existing Models 8.3 Joint Similarity and Feature Learning . . . . 8.3.1 Deep Architecture . . . . . . . . . . . 8.3.2 Model Training . . . . . . . . . . . . . 8.4 Experiments . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

99 99 101 104 105 106 106 108 111 112

9

Face Verification . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . 9.2 Related Work . . . . . . . . . . . . 9.3 Framework Overview . . . . . . 9.4 Formulation and Optimization References . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

115 115 117 120 121 129

10 Human Activity Understanding . . . . 10.1 Introduction . . . . . . . . . . . . . . 10.2 Deep Structured Model . . . . . . 10.2.1 Spatiotemporal CNNs .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

135 135 136 137

Part V

Higher Level Tasks

xii

Contents

10.2.2 Latent Temporal Structure . . . . . . . . . . . . . . . . . . 10.2.3 Deep Model with Relaxed Radius-Margin Bound . 10.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Latent Temporal Structure . . . . . . . . . . . . . . . . . . 10.3.2 Architecture of Deep Neural Networks . . . . . . . . 10.4 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4.1 Joint Component Learning . . . . . . . . . . . . . . . . . 10.4.2 Model Pretraining . . . . . . . . . . . . . . . . . . . . . . . . 10.4.3 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5.1 Datasets and Setting . . . . . . . . . . . . . . . . . . . . . . 10.5.2 Empirical Analysis . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

137 139 142 142 143 145 146 149 149 150 150 151 155

Part I

Motivation and Overview

Human-centric visual analysis is important in the field of computer vision. Typical applications include face recognition, person reidentification, and pose estimation. Developing solutions to comprehensive human-centric visual applications has benefited from many industrial application domains such as smart surveillance, virtual reality, and human–computer interaction. Most problems of human-centric visual analysis are quite challenging. For example, person reidentification involves identifying the same person in images/videos captured by different cameras based on features such as clothing and body motions. However, due to the large variations in human appearance, significant background, irrelevant motions, scale and illumination changes, it is difficult to accurately reidentify the same person among thousands of images. For face recognition in the wild, illumination variation and occlusion can significantly reduce recognition accuracy. Similar problems exist in other human-centric visual analysis tasks. On the other hand, the past decade has witnessed the rapid development of feature representation learning, especially deep neural networks, which greatly enhanced the already rapidly developing field of computer vision. The emerging deep models that trained on large-scale databases have effectively improved systems performance in practical applications. The most famous example is AlphaGo, which beat the top human Go players in 2015. In this book, we introduce the recent development of deep neural networks in several human-centric visual analysis problems. The book is divided into five parts. In the first part, we briefly review the foundation of the deep neural network and introduce some recently developed advanced new techniques. We also provide an overview of human-centric visual analysis problems in this part. Then, from Part II to Part V, we introduce our work on typical human-centric visual analysis problems including face detection, face recognition and verification, pedestrian detection, pedestrian recognition, human parsing, and action recognition.

Chapter 1

The Foundation and Advances of Deep Learning

Abstract The past decade has witnessed the rapid development of feature representation learning, especially deep learning. Deep learning methods have achieved great success in many applications, including computer vision, and natural language processing. In this chapter, we present a short review of the foundation of deep learning, i.e., artificial neural network, and introduce some new techniques in deep learning.

1.1 Neural Networks Neural networks, the foundation of deep learning models, are biologically inspired systems that are intended to simulate the way in which the human brain processes information. The human brain consists of a large number of neurons that are highly connected by synapses. The arrangement of neurons and the strengths of the individual synapses, determined by a complex chemical process, establish the function of the neural network of the human brain. Neural networks are excellent tools for finding patterns that are far too complex or numerous for a human programmer to extract and teach the machine to recognize. The beginning of neural networks can be traced to the 1940s, when the single perceptron neuron was proposed, and only over the past several decades have neural networks become a major part of artificial intelligence. This is due to the development of backpropagation, which allows multilayer perceptron neural networks to adjust the weights of neurons in situations where the outcome does not match what the creator is hoping for. In the following, we briefly review the background of neural networks, including the perceptron, multilayer perceptron, and the backpropagation algorithm.

1.1.1 Perceptron The perceptron occupies a special place in the historical development of neural networks. Because the importance of different inputs are not the same, perceptrons © Springer Nature Singapore Pte Ltd. 2020 L. Lin et al., Human Centric Visual Analysis with Deep Learning, https://doi.org/10.1007/978-981-13-2387-4_1

3

4

1 The Foundation and Advances of Deep Learning

introduce weights w j to each input to account for the difference. The perceptron sums the weighted inputs and produces a single binary output with its activation function, f (x), which is defined as  f (x) =

 1, i f j wjxj + b > 0 0, other wise.

(1.1)

where w j is the weight and b is the bias, which shifts the decision boundary away from the origin. The perceptron with one output can only be used for binary classification problems. As with most other techniques for training linear classifiers, the perceptron naturally generalizes to multiclass classification. Here, the input x and the output y are drawn from arbitrary sets. A feature representation function f (x, y) maps each possible input/output pair to a finite-dimensional real-valued feature vector. The feature vector is multiplied by a weight vector w, but the resulting score is now used to choose among many possible outputs: yˆ = arg max f (x, y) · w.

(1.2)

y

Perceptron neurons are a type of linear classifier. If the dataset is linearly separable, then the perceptron network is guaranteed to converge. Furthermore, there is an upper bound on the number of times that the perceptron will adjust its weights during the training. Suppose that the input vectors from the two classes can be separated by a hyperplane with a margin γ, and let R denote the maximum norm of an input vector.

1.1.2 Multilayer Perceptron The multilayer perceptron (MLP) is a class of feedforward artificial neural networks consisting of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. Its multiple layers and nonlinear activation distinguish MLP from a linear perceptron. MLP has three basic features: (1) Each neuron in the network includes a differentiable nonlinear activation function. (2) The network contains one or more hidden layers, expect for the input and output nodes. (3) The network exhibits a high degree of connectivity. A typical MLP architecture is shown in Fig. 1.1. The network contains one input layer, two hidden layers, and an output layer. The network is fully connected, which means that a neuron in any layer of the network is connected to all the neurons in the previous layer. The first hidden layer is fed from the input layer, and its outputs are in turn applied to the next hidden layer, and this process is repeated for the remainder of the MLP neural network. Each neuron in the MLP network includes a differentiable nonlinear activation function. The sigmoid function is commonly used in MLP. The activation function

1.1 Neural Networks

5

Input Layer

Output Layer











Second Hidden Layer

First Hidden Layer

Fig. 1.1 Illustration of a typical neural network

of sigmoid neurons is defined as f (x) =

1 , 1 + exp{−(w · x + b)}

(1.3)

where x is the input vector and w is the weight vector. With the sigmoid function, the output of the neuron is no longer just the binary value 1 or 0. In general, the sigmoid function is real-valued, monotonic, smooth, and differentiable, having a nonnegative first derivative that is bell shaped. The smoothness of the sigmoid function means that small changes w j in the weights and b in the bias will produce a small change out put, which is well approximated by out put ≈

 ∂out put j

∂w j

w j +

∂out put b, ∂b

(1.4)

put put where the sum is overall weights, w j , and ∂out and ∂out denote the partial ∂w j ∂b derivates of the output with respect to w j and b, respectively. out put is a linear function of w j and b. This linearity makes it easy to choose small changes in the weights and biases to achieve the desired small change in the output, thus making it considerably easier to determine how changing the weights and bias will change the output. Solve “XOR” problem with MLP. Linear problems can be solved with a singlelayer perceptron. However, if the dataset is not linearly separable, no approximate solution will be gradually approached by a single perceptron neuron. For example, Fig. 1.2 shows the typical “XOR” function, which is a nonlinear function and cannot

6

1 The Foundation and Advances of Deep Learning Output Layer

1

w=-2 Hidden Layer

1

w=1

1

1 w=1

w=0.5 w=1

w=0.5

Input Layer 1

“XOR” problem

Solution of “XOR” problem with perception

Fig. 1.2 Left: The illusion of “XOR” Problem. Right: The solution of “XOR” problem with perceptions

be solved by the single-layer perceptron. In this case, we need to use an MLP to solve this problem.

1.1.3 Formulation of Neural Network To conveniently describe the neural network, we use the following parameter settings. Let n l denote the total number of layers in the neural network, and let L l denote the lth layer. Thus, L 1 and L nl are the input layer and the output layer, respectively. We use (W, b) = (W (1) , b(1) , W (2) , b(2) , . . .) to denote the parameters of the neural network, where Wi(l) j denotes the parameters of connections between unit j in layer l and unit i in layer l + 1. Additionally, bi(l) is the bias associated with unit i in layer l + 1. Thus, in this case, W (1) ∈ R3×3 , and W (2) ∈ R1×3 . We use ai(l) to denote the activation of unit i in layer l. Given a fixed setting of the parameters (W, b), the neural network defines a hypothesis as h W,b (x). Specifically, the computation that this neural network represents is given by   (1) (1) (1) x1 + W12 x2 + W13 x3 + b1(1) , a1(2) = f W11   (1) (1) (1) a2(2) = f W21 x1 + W22 x2 + W23 x3 + b2(1) ,   (1) (1) (1) a3(2) = f W31 x1 + W32 x2 + W33 x3 + b3(1) ,   (2) (2) (2) (2) (2) (2) h W,b (x) = a1(3) = f W11 a1 + W12 a2 + W13 a3 + b1(2) .

(1.5)

1.1 Neural Networks

7

Let z i(l) denote the total weighted sum of inputs to unit i in layer l, including  (1) (l) (l) the bias term (e.g., z i(2) = nj=1 Wi(1) j x j + bi ), such that ai = f (z i ). If we extend the activation function f (·) to apply to vectors in an elementwise fashion as f ([z1, z2, z3]) = [ f (z1), f (z2), f (z3)], then we can write the above equations more compactly as z (2) = W (1) x + b(1) , a (2) = f (z (2) ), z (3) = W (2) a (2) + b(2) ,

(1.6)

h W,b (x) = a (3) = f (z (3) ). We call this step forward propagation. Generally, recalling that we also use a(1) = x to also denote the values from the input layer, then given layer l’s activations a(l), we can compute layer (l + 1) s activations a (l+1) as z (l+1) = W (l) a (l) + b(l) ,   a (l+1) = f z (l+1) .

(1.7)

1.2 New Techniques in Deep Learning Compared with the traditional MLP, the new neural networks are generally deeper, and it is more difficult to optimize these neural networks by backpropagation. Thus, many new techniques have been proposed to smooth the network training, such as batch normalization (BN) and batch Kalman normalization [1].

1.2.1 Batch Normalization BN is a technique for improving the performance and stability of neural networks. This technique was introduced in Ioffe & Szegedy’s 2015 paper. Rather than just normalizing the inputs to the network, BN normalizes the inputs to layers within the network. The benefits of BN are as follows: • Networks are trained faster: Although each training iteration will be slower because of the extra normalization calculations during the forward pass and the additional hyperparameters to train during backpropagation, it should converge much more quickly; thus, training should be faster overall. • Higher learning rates: Gradient descent generally requires small learning rates for the network to converge. As networks become deeper, gradients become smaller during backpropagation and thus require even more iterations. Using BN allows much higher learning rates, thereby increasing the speed at which networks train.

8

1 The Foundation and Advances of Deep Learning

• Easier to initialize: Weight initialization can be difficult, particularly when creating deeper networks. BN helps reduce the sensitivity to the initial starting weights. Rather than whitening the features in layer inputs and outputs jointly, BN normalizes each scalar feature independently by making it have a mean of zero and variance of 1. For a layer with d-dimensional input x = {x(1), . . . , x(d)}, in BN, we normalize each dimension x (k) − E[x (k) ] , (1.8) xˆ (k) = V ar [x k ] where the expectation and variance are computed over the training dataset. Then, for each activation x (k) , a pair of parameters γ (k) , β (k) are introduced to scale and shift the normalized value as y (k) = γ (k) xˆ (k) + β (k) .

(1.9)

Algorithm 1.1: Training and Inference with Batch Normalization Input: Values of x over a minibatch: B = x1...m ; Parameters to be learned: γ, β Output: yi = B Nγ,β (xi ) uB ←

m 1  k xi m i=1

m 1  2 σB ← (xi − u B ) m i=1

xi − u B xˆi ←

2 + σB yi ← γ xˆi + β ≡ B Nγ,β (xi )

Consider a minibatch B of size m. Since the normalization is applied to each activation independently, let us focus on a particular activation x (k) and omit k for clarity. We have m values of this activation in the minibatch: B = x1...m .

(1.10)

Let the normalized values be xˆ1...m , and let their linear transformation be y1...m . We refer to the transform (1.11) B Nr,β : x1...m → y1...m as the BN. We present the BN transform in Algorithm 1. In this algorithm,  is a constant added to the minibatch variance for numerical stability.

1.2 New Techniques in Deep Learning

9

1.2.2 Batch Kalman Normalization Although the significance of BN has been demonstrated in many previous works, its drawback cannot be neglected, i.e., its effectiveness diminishes when a small minibatch is present in training. Consider a DNN that consists of a number of layers from bottom to top. In the traditional BN, the normalization step seeks to eliminate the change in the distributions of its internal layers by reducing their internal covariant shifts. Prior to normalizing the distribution of a layer, BN first estimates its statistics, including the means and variances. However, it is impractical to expect that the bottom layer of the input data can be pre-estimated on the training set because the representations of the internal layers keep changing after the network parameters have been updated in each training step. Hence, BN handles this issue with the following schemes. (i) During the model training, it approximates the population statistics by using the batch sample statistics in a minibatch. (ii) It retains the moving average statistics in each training iteration, and it employs them during the inference. However, BN has a limitation, namely, it is limited by the memory capacity of computing platforms (e.g., GPUs), especially when the network size and image size are large. In this case, the minibatch size is not sufficient to approximate the statistics, causing them to have bias and noise. Additionally, the errors would be amplified when the network becomes deeper, degenerating the quality of the trained model. Negative effects also exist in the inference, where the normalization is applied for each testing sample. Furthermore, in the BN mechanism, the distribution of a certain layer could vary along with the training iteration, which limits the stability of the convergence of the model. Recently, an extension of BN, called batch renormalization (BRN) [2], has been proposed to improve the performance of BN when the minibatch size is small. BKN advances the existing solutions by achieving a more accurate estimation of the statistics (means and variances) of the internal representations in DNNs. In contrast to BN and BRN, where the statistics are estimated by only measuring the minibatches within a certain layer, i.e., they considered each layer in the network as an isolated subsystem, BKN shows that the estimated statistics have strong correlations among the sequential layers. Moreover, the estimations can be more accurate by jointly considering its preceding layers in the network, as illustrated in Fig. 1.3b. By analogy, the proposed estimation method shares merits with the Kalman filtering process [3]. BKN performs two steps in an iterative manner. In the first step, BKN estimates the statistics of the current layer conditioned on the estimations of the previous layer. In the second step, these estimations are combined with the observed batch sample means and variances calculated within a minibatch. These two steps are efficient in BKN. Updating the current estimation by previous states brings negligible extra computational cost compared to the traditional BN. For example, in recent advanced deep architectures such as residual networks, the feature representations have a maximum number of 2048 dimensions (channels), and the extra cost is the matrix-vector product by transforming a state vector (representing the means and

10

1 The Foundation and Advances of Deep Learning

(a)

(b)

Fig. 1.3 a illustrates the distribution estimation in the conventional batch normalization (BN), where the minibatch statistics, μk and  k , are estimated based on the currently observed minibatch at the kth layer. For clarity of notation, μk and  k indicate the mean and the covariance matrix, respectively. Note that only the diagonal entries are used in normalization. X and X represent the internal representation before and after normalization. In b, batch Kalman normalization (BKN) provides a more accurate distribution estimation of the kth layer by aggregating the statistics of the preceding (k-1)th layer

Fig. 1.4 Illustration of the proposed batch Kalman normalization (BKN). At the (k-1)th layer ˆ k−1|k−1 . of a DNN, BKN first estimates its statistics (means and covariances), μˆ k−1|k−1 , and  Additionally, the estimations in the kth layer are based on the estimations of the (k-1)th layer, where these estimations are updated by combining with the observed statistics of the kth layer. This process treats the entire DNN as a whole system, in contrast to existing works that estimated the statistics of each hidden layer independently

variances) with a maximum number of 2048 dimensions into a new state vector and then combining with the current observations (Fig. 1.4).

1.2.2.1

Batch Kalman Normalization Method

Let x k be the feature vector of a hidden neuron in the kth hidden layer of a DNN, such as a pixel in the hidden convolutional layer of a CNN. BN normalizes the values of x k by using a minibatch of m samples, B = {x1k , x2k , ..., xmk }. The mean and covariance of x k are approximated by Sk ←

m 1  k (x − x¯ k )(xik − x¯ k )T m i=1 i

(1.12)

1.2 New Techniques in Deep Learning

11

and x¯ k ← x −x¯ We have xˆ k ← √ i k

k

diag(S k ) k

m 1  k x . m i=1 i

(1.13)

, where diag(·) denotes the diagonal entries of a matrix,

i.e., the variances of x . Then, the normalized representation is scaled and shifted to preserve the modeling capacity of the network, y k ← γ xˆ k + β, where γ and β are parameters that are optimized during training. However, a minibatch with a moderately large size is required to estimate the statistics in BN. It is compelling to explore better estimations of the distribution in a DNN to accelerate training. Assume that the true values of the hidden neurons in the kth layer can be represented by the variable x k , which is approximated by using the values in the previous layer x k−1 . We have (1.14) x k = Ak x k−1 + u k , where Ak is a state transition matrix that transforms the states (features) in the previous layer to the current layer. Additionally, u k is a bias that follows a Gaussian distribution with zero mean and unit variance. Note that Ak could be a linear transition between layers. This is reasonable because our purpose is not to accurately compute the hidden features in a certain layer given those in the previous layer but rather to draw a connection between layers to estimate the statistics. As the above true values of x k exist but are not directly accessible, they can be measured by the observation z k with a bias term vk : z k = x k + vk ,

(1.15)

where z k indicates the observed values of the features in a minibatch. In other words, to estimate the statistics of x k , previous studies only consider the observed value of z k in a minibatch. BKN takes the features in the previous layer into account. To this end, we compute the expectation on both sides of Eq. (1.14), i.e., E[x k ] = E[Ak x k−1 + u k ], and have (1.16) μˆ k|k−1 = Ak μˆ k−1|k−1 , where μˆ k−1|k−1 denotes the estimation of the mean in the (k-1)th layer, and μˆ k|k−1 is the estimation of the mean in the kth layer conditioned on the previous layer. We call μˆ k|k−1 an intermediate estimation of the layer k because it is then combined with the observed values to achieve the final estimation. As shown in Eq. (1.17), the estimation in the current layer μˆ k|k is computed by combining the intermediate estimation with a bias term, which represents the error between the observed values z k and μˆ k|k−1 . Here, z k indicates the observed mean values, and we have z k = x k . Additionally, q k is a gain value indicating how much we reply on this bias. μˆ k|k = μˆ k|k−1 + q k (z k − μˆ k|k−1 ).

(1.17)

12

1 The Foundation and Advances of Deep Learning

Algorithm 1.2: Training and Inference with Batch Kalman Normalization ˆ k−1|k−1 in the Input: Values of feature maps {x1...m } in the k th layer; μˆ k−1|k−1 ,  (k−1)th layer; parameters γ and β; moving mean μ and moving variance ; moving momentum α; Kalman gain q k and transition matrix Ak . ˆ k|k in the current layer. Output: {yik = BKN(xik )}; updated μ, ; statistics μˆ k|k and  Train: m m 1  k 1  k x¯ k ← xi , S k ← (xi − x¯ k )(xik − x¯ k )T m m i=1 k p ← 1 − qk ,

i=1 k|k−1 μˆ ← Ak μˆ k−1|k−1 ,

μˆ k|k ← pk μˆ k|k−1 + q k x¯ k

ˆ k|k−1 ← Ak  ˆ k−1|k−1 (Ak )T + R  ˆ k|k ← pk  ˆ k|k−1 +q k S k + p k q k (x¯ k −μˆ k|k−1 )(x¯ k −μˆ k|k−1 )T  x k − μˆ k|k k yik ← i γ + βk ˆ k|k ) diag( moving average : ˆ k|k ) μ := μ + α(μ − μˆ k|k ),  :=  + α( −  Inference:

yinference ← √ x−μ

diag()

γ+β

Similarly, the estimations of the covariances can be achieved by calculating ˆ k|k−1 = Cov(x k − μˆ k|k−1 ) and  ˆ k|k = Cov(x k − μˆ k|k ), where Cov(·) represents  the definition of the covariance matrix. By introducing p k = 1 − q k and z k = x k and combining the above definitions with Eqs. (1.16) and (1.17), we have the following update rules to estimate the statistics, as shown in Eq. (1.18). Its proof is given in the Appendix. ⎧ k|k−1 μˆ = Ak μˆ k−1|k−1 , ⎪ ⎪ ⎪ k|k ⎪ = p k μˆ k|k−1 + q k x¯ k , μ ˆ ⎨ k|k−1 ˆ ˆ k−1|k−1 (Ak )T + R,  = Ak  ⎪ k|k k k|k−1 ⎪ ˆ ˆ ⎪  =p  + q k Sk ⎪ ⎩ k k k + p q (x¯ −μˆ k|k−1 )(x¯ k −μˆ k|k−1 )T ,

(1.18)

ˆ k|k denote the intermediate and the final estimations of the covariˆ k|k−1 and  where  ance matrices in the kth layer, respectively. R is the covariance matrix of the bias u k in Eq. (1.14). Note that it is identical for all the layers. S k are the observed covariance matrices of the minibatch in the kth layer. In Eq. (1.18), the transition matrix Ak , the covariance matrix R, and the gain value q k are parameters that are optimized during ˆ k|k to normalize the hidden representation. training. In BKN, we employ μˆ k|k and  Please reference the 2 for the detail of Batch Kalman Normalization.

References

13

References 1. W. Guangrun, P. Jiefeng, L. Ping, W. Xinjiang, L. Liang, Batch kalman normalization: towards training deep neural networks with micro-batches, arXiv preprint arXiv:1802.03133 (2018) 2. S. Ioffe, C. Szegedy, Batch normalization: accelerating deep network training by reducing internal covariate shift, arXiv preprint arXiv:1502.03167 (2015) 3. R.E. Kalman et al., A new approach to linear filtering and prediction problems. J. Basic Eng. 82(1), 35–45 (1960)

Chapter 2

Human-Centric Visual Analysis: Tasks and Progress

Abstract The research of human-centric visual analysis has achieved considerable progress in recent years. In this chapter, we briefly review the tasks of human-centric visual analysis, including face detection, facial landmark localization, pedestrian detection, human segmentation, clothes parsing, etc.

2.1 Face Detection As one key step toward many subsequent face-related applications, face detection has been extensively studied in the computer vision literature. Early efforts in face detection date back to as early as the beginning of the 1970s, where simple heuristic and anthropometric techniques [1] were used. Prior to 2000, despite progress [2, 3], the practical performance of face detection was far from satisfactory. One genuine breakthrough was the Viola-Jones framework [4], which applied rectangular Haarlike features in a cascaded AdaBoost classifier to achieve real-time face detection. However, this framework has several critical drawbacks. First, its feature size was relatively large. Typically, in a 24 × 24 detection window, the number of Haarlike features was 160 thousand [5]. Second, this framework is not able to effectively handle non-frontal faces in the wild. Many works have been proposed to address these issues of the Viola-Jones framework and achieve further improvements. First, more complicated features (such as HOG [6], SIFT [7], SURF [8]) were used. For example, Liao et al. [9] proposed a new image feature called normalized pixel difference (NPD), which is computed as the difference to sum ratio between two pixel values. Second, to detect faces with various poses, some works combined multiple detectors, each of which was trained for a specific view. As a representative work, Zhu et al. [10] applied multiple deformable part models to capture faces with different views and expressions. Recent years have witnessed advances in face detection using deep learning methods, which significantly outperform traditional computer vision methods. For example, Li et al. [11] proposed a cascade architecture built on CNNs, which can quickly reject the background regions in the fast low-resolution stage and effectively calibrate the bounding boxes of face proposal in the high-resolution stages. Following a similar © Springer Nature Singapore Pte Ltd. 2020 L. Lin et al., Human Centric Visual Analysis with Deep Learning, https://doi.org/10.1007/978-981-13-2387-4_2

15

16

2 Human-Centric Visual Analysis: Tasks and Progress

procedure, Zhang et al. [12] leveraged a cascaded multitask architecture to enhance the face detection performance by exploiting the inherent correlation between detection and alignment. However, these single-scale detectors had to perform multiscale testing on image pyramids, which is time consuming. To reduce the level number of image pyramids, Hao et al. [13] and Liu et al. [14] proposed an efficient CNN that predicts the scale distribution histogram of the faces and guides the zoom-in and zoom-out of the images or features. Recently, many works have adapted the generic object detector Faster R-CNN [15] to perform face detection. For example, Wan et al. [16] bootstrapped Faster R-CNN with hard negative mining and achieved a significant improvement on the representative face detection benchmark FDDB [17]. Despite achieving progress, these methods generally failed to detect tiny faces in unconstrained conditions. To address this issue, Bai et al. [18] first generated a clear high-resolution face from a blurry small one by adopting a generative adversarial network and then performed the face detection.

2.2 Facial Landmark Localization 2.2.1 Conventional Approaches Facial landmark localization has long been attempted in computer vision, and a large number of approaches have been proposed for this purpose. The conventional approaches for this task can be divided into two categories: template fitting methods and regression-based methods. Template fitting methods build face templates to fit the input face appearance. A representative work is the active appearance model (AAM) [19], which attempts to estimate model parameters by minimizing the residuals between the holistic appearance and an appearance model. Rather than using holistic representations, a constrained local model (CLM) [20] learns an independent local detector for each facial keypoint and a shape model for capturing valid facial deformations. Improved versions of CLM primarily differ from each other in terms of local detectors. For instance, Belhumeur et al. [21] detected facial landmarks by employing SIFT features and SVM classifiers, and Liang et al. [22] applied AdaBoost to the HAAR wavelet features. These methods are generally superior to the holistic methods due to the robustness of patch detectors against illumination variations and occlusions. Regression-based facial landmark localization methods can be further divided into direct mapping techniques and cascaded regression models. The former directly maps local or global facial appearances to landmark locations. For example, Dantone et al. [23] estimated the absolute coordinates of facial landmarks directly from an ensemble of conditional regression trees trained on facial appearances. Valstar et al. [24] applied boosted regression to map the appearances of local image patches to the positions of corresponding facial landmarks. Cascaded regression models [25–31] formulate shape estimation as a regression problem and make predictions in a cas-

2.2 Facial Landmark Localization

17

caded manner. These models typically start from an initial face shape and iteratively refine the shape according to learned regressors, which map local appearance features to incremental shape adjustments until convergence is achieved. Cao et al. [25] trained a cascaded nonlinear regression model to infer an entire facial shape from an input image using pairwise pixel-difference features. Burgos–Artizzu et al. [32] proposed a novel cascaded regression model for estimating both landmark positions and their occlusions using robust shape-indexed features. Another seminal method is the supervised descent method (SDM) [27], which uses SIFT features extracted from around the current shape and minimizes a nonlinear least-squares objective using the learned descent directions. All these methods assume that an initial shape is given in some form, e.g., a mean shape [27, 28]. However, this assumption is too strict and may lead to poor performance on faces with large pose variations.

2.2.2 Deep-Learning-Based Models Despite their acknowledged successes, all the aforementioned conventional approaches rely on complicated feature engineering and parameter tuning, which consequently limits their performance in cluttered and diverse settings. Recently, CNNs and other deep learning models have been successfully applied to various visual computing tasks, including facial landmark estimation. Zhou et al. [33] proposed a four-level cascaded regression model based on CNNs, which sequentially predicted landmark coordinates. Zhang et al. [34] employed a deep architecture to jointly optimize facial landmark positions with other related tasks, such as pose estimation [35] and facial expression recognition [36]. Zhang et al. [37] proposed a new coarse-to-fine DAE pipeline to progressively refine facial landmark locations. In 2016, they further presented de-corrupt autoencoders to automatically recover the genuine appearance of the occluded facial parts, followed by predicting the occlusive facial landmarks [38]. Lai et al. [39] proposed an end-to-end CNN architecture to learn highly discriminative shape-indexed features and then refined the shape using the learned deep features via sequential regressions. Merget et al. [40] integrated the global context in a fully convolutional network based on dilated convolutions for generating robust features for landmark localization. Bulat et al. [41] utilized a facial super-resolution technique to locate the facial landmarks from low-resolution images. Tang et al. [42] proposed quantized densely connected U-Nets to largely improve the information flow, which helps to enhance the accuracy of landmark localization. RNN-based models [43–45] formulate facial landmark detection as a sequential refinement process in an end-to-end manner. Recently, 3D face models [46–50] have also been utilized to accurately locate the landmarks by modeling the structure of facial landmarks. Moreover, many researchers have attempted to adapt some unsupervised [51–53] or semisupervised [54] approaches to improve the precision of facial landmark detectors.

18

2 Human-Centric Visual Analysis: Tasks and Progress

2.3 Pedestrian Detection Pedestrian detection is a subtask of general object detection where pedestrians, rather than all involved objects, are detected in a given image. Since this task is significant for security monitoring, safe self-driving, and other application scenarios, it has been extensively studied over the past years. Due to the diversity of pedestrian gestures, the variety of backgrounds, and other reasons, pedestrian detection could be very challenging. In the following, we list several factors that could affect pedestrian detection. Diversity of appearance. For instance, rather than standing as still figures, pedestrians could appear with different clothing, gestures, angle of view, and illumination. Scale variation. Because of the distance to the camera, pedestrians would appear at different scales in the image. Large-scale pedestrians are relatively easy to detect, while pedestrians at small scales are challenging. Occlusion. In practical scenarios, pedestrians could be occluded by each other or by buildings, parked cars, trees, or other types of objects on the street. Backgrounds. Algorithms are confronted with hard negative samples, which are objects that appear like pedestrians and could easily be misclassified. Time and space complexity. Due to the large amount of candidate bounding boxes, the methods could be space consuming. Additionally, cascaded approaches are used in some methods, which could be time consuming. However, practical usage scenarios need real-time detection and memory saving.

2.3.1 Benchmarks for Pedestrian Detection INRIA [55] was released in 2005, containing 1805 images of humans cropped from a varied set of personal photos. ETH [56] was collected through strolls through busy shopping streets. Daimler [57] contains pedestrians that are fully visible in an upright position. TUD [58] was developed for many tasks, including pedestrian detection. Positive samples of the training set were collected in a busy pedestrian zone with a handheld camera, including not only upright standing pedestrians but also side standing ones. Negative samples of the training set were collected in an inner city district and also from vehicle driving videos. The test set is collected in the inner city of Brussels from a driving car. All pedestrians are annotated. KITTI [59] was collected by four high-resolution video cameras, and up to 15 cars and 30 pedestrians are visible per image. Caltech [60] is the largest pedestrian dataset to date, collecting 10 h of vehicle driving video in an urban scenario. This dataset includes pedestrians in different scales and positions, and various degrees of occlusions are also included.

2.3 Pedestrian Detection

19

2.3.2 Pedestrian Detection Methods The existing methods can be divided into two categories: one is handcrafted features followed by a classical classifier, and the other is deep learning methods.

2.3.2.1

Two-Stage Architectures of Pedestrian Detection

Early approaches typically consist of two separate stages: feature extraction and binary classification. Candidate bounding boxes are generated by sliding-window methods. Classic HOG [55] proposed using histogram of oriented gradients as features and a linear support vector machine as the classifier. Following this framework, various feature descriptors and classifiers were proposed. Typical classifiers include nonlinear SVM and AdaBoost. HIKSVM [61] proposed using histogram intersection kernel SVM, which is a nonlinear SVM. RandForest [62] used a random forest ensemble, rather than SVM, as the classifier. For various feature descriptors, ICF [63] generalized several basic features to multiple channel features by computations of linear filters, nonlinear transformations, pointwise transformations, integral histogram, and gradient histogram. Integral images are used to obtain the final features. Features are learned by the boosting algorithm, while decision trees are employed as the weak classifier. SCF [64] inherited the main idea of ICF, but it proposed a revision with insights. Rather than using regular cells as the classic HOG method does, SCF attempts to learn an irregular pattern of cells. The feature pool consists of squares in detection windows. ACF [65] attempted to accelerate pyramid feature learning though the aggregation of channel features. Additionally, it learns by AdaBoost [66], whose base classifier is deep tree. LDCF [67] proposed a local decorrelation transformation. SpatialPooling [68] was built based on ACF [65]. Spatial pooling is used to compute the covariance descriptor and local binary pattern descriptor, enhancing the robustness to noise and transformation. Features are learned by structural SVM. [69] explored several types of filters, and a checkerboard filter achieved the best performance. Deformable part models (DPMs) have been widely used for solving the occlusion issue. [70] first proposed deformable parts filters, which are placed near the bottom level of the HOG feature pyramid. A multiresolution model was proposed by [71] as a DPM. [72] used DPM for multi-pedestrian detection and proved that DPM can be flexibly incorporated with other descriptors such as HOG. [73] designed a multitask form of DPM that captures the similarities and differences of samples. DBN-Isol [74] proposed a discriminative deep model for learning the correlations of deformable parts. In [75], a parts model was embedded into a designed deep model.

2.3.2.2

Deep Convolutional Architectures of Pedestrian Detection

Sermanet et al. [76] first used a deep convolutional architecture. Reference [76] designed a multiscale convolutional network composed of two stages of convolu-

20

2 Human-Centric Visual Analysis: Tasks and Progress

tional layers for feature extraction, which is followed by a classifier. The model is first trained with unsupervised learning layer by layer and then using supervised learning with a classifier for label prediction. Unlike previous approaches, this convolutional network performs end-to-end training, whose features are all learned from the input data. Moreover, bootstrapping is used for relieving the imbalance between positive and negative samples. JointDeep [77] designed a deep convolutional network. Each of the convolutional layers in the proposed deep network is responsible for a specific task, while the whole network is able to learn feature extraction, deformation issues, occlusion issues, and classification jointly. MultiSDP [78] proposed a multistage contextual deep model simulating the cascaded classifiers. Rather than training sequentially, the cascaded classifiers in the deep model can be trained jointly using backpropagation. SDN [79] proposed a switchable restricted Boltzmann for better detection in cluttered background and variably presented pedestrians. Driven by the success of (“slow”) R-CNN [80] for general object detection, a recent series of methods have adopted a two-stage pipeline for pedestrian detection. These methods first use proposal methods to predict candidate detection bounding boxes, generally a large amount. These candidate boxes are then fed into a CNN for feature learning and class prediction. In the task of pedestrian detection, the proposal methods used are generally standalone pedestrian detectors consisting of handcrafted features and boosted classifiers. Reference [81] used SquaresChnFtrs [64] as proposal methods, which are fed into a CNN for classification. In this paper, two CNNs with different scales were tried, which are CifarNet [82] and AlexNet [83]. The methods were evaluated on the Caltech [60] and KITTI [59] datasets. The performance was on par with the state of the art at that time but is not yet able to surpass some of the handcrafted methods due to the design of CNN and lack of parts or occlusion modeling. TA-CNN [84] employed the ACF detector [65], incorporating with semantic information, to generate proposals. The CNN used was revised from AlexNet [83]. This method attempted to improve the model effects by relieving the confusion between positive samples and hard negative ones. The method was evaluated on the Caltech [60] and ETH [56] datasets, and it surpassed state-of-the-art methods. DeepParts [85] applied the LDCF [67] detector to generate proposals and learned a set of complementary parts by neural networks, improving occlusion detection. They first constructed a part pool covering all positions and ratios of body parts, and they automatically chose appropriate parts for part detection. Subsequently, the model learned a part detector for each body part without using part annotations. These part detectors are independent CNN classifiers, one for each body part. Furthermore, proposal shifting problems were handled. Finally, full-body scores were inferred, and pedestrian detection was fulfilled. SAF R-CNN [86] implemented an intuitive revision of this R-CNN two-stage approach. They used the ACF detector [65] for proposal generation. The proposals were fed into a CNN, and they were soon separated into two branches of subnetwork, driven by a scale-aware weighting layer. Each of the subnetworks is a popular Fast R-CNN [15] framework. This approach improved small-size pedestrian detection.

2.3 Pedestrian Detection

21

Unlike the above R-CNN-based methods, the CompACT method [87] obtained both handcrafted features and deep convolutional features, and on top of which it learned boosted classifiers. A complexity-aware cascade boosting algorithm was used such that features of various complexities are able to be integrated into one single model. CCF detector [88] is a boosted classifier on pyramids of deep convolutional features, but it uses no region proposals. Rather than using deep convolutional network as feature learner and predictor as mentioned methods do, this method utilized the deep convolutional network as the first step image feature extractor.

2.4 Human Segmentation and Clothes Parsing The goal of human parsing is to partition the human body into different semantic parts, such as hair, head, torso, arms, legs, and so forth, which provides rich descriptions for human-centric analysis and thus becomes increasingly important for many computer vision applications, including content-based image/video retrieval, person re-identification, video surveillance, action recognition and clothes fashion recognition. However, it is very challenging in real-life scenarios due to the variability in human appearances and shapes caused by the large numbers of human poses, clothes types, and occlusion/self-occlusion patterns. Part segment proposal generation. Previous works generally adopt low-level segment-based proposal. For example, some approaches take higher level cues. Bo and Fowlkes exploited roughly learned part location priors and part mean shape information, and they derived a number of part segments from the gPb-UCM method using a constrained region merging method. Dong et al. employed the Parselets for proposal to obtain mid-level part semantic information for the proposal. However, either low-level, mid-level ,or rough location proposals may result in many false positives, misleading the later process.

References 1. T. Sakai, M. Nagao, and T. Kanade, Computer Analysis and Classification of Photographs of Human faces (Kyoto University, 1972) 2. K.-K. Sung, T. Poggio, Example-based learning for view-based human face detection. TPAMI 20(1), 39–51 (1998) 3. H. Rowley, S. Baluja, T. Kanade, Rotation invariant neural network-based face detection, in CVPR. sn, p. 38 (1998) 4. P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in CVPR, vol. 1. IEEE, pp. I–511 (2001) 5. P. Viola, M.J. Jones, Robust real-time face detection. IJCV 57(2), 137–154 (2004) 6. Q. Zhu, M.-C. Yeh, K.-T. Cheng, S. Avidan, Fast human detection using a cascade of histograms of oriented gradients, in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 2. IEEE, pp. 1491–1498 (2006)

22

2 Human-Centric Visual Analysis: Tasks and Progress

7. P.C. Ng, S. Henikoff, Sift: Predicting amino acid changes that affect protein function. Nucleic acids research 31(13), 3812–3814 (2003) 8. Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, J. R. Smith, Learning locally-adaptive decision functions for person verification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3610–3617 (2013) 9. S. Liao, A.K. Jain, S.Z. Li, A fast and accurate unconstrained face detector. IEEE transactions on pattern analysis and machine intelligence 38(2), 211–223 (2016) 10. X. Zhu, D. Ramanan, Face detection, pose estimation, and landmark localization in the wild, in CVPR. IEEE, pp. 2879–2886 (2012) 11. H. Li, Z. Lin, X. Shen, J. Brandt, G. Hua, A convolutional neural network cascade for face detection, in CVPR, pp. 5325–5334 (2015) 12. K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters 23(10), 1499–1503 (2016) 13. Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, X. Hu, Scale-aware face detection, in CVPR, vol. 3 (2017) 14. Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, X. Tang, Recurrent scale approximation for object detection in cnn, in ICCV, vol. 5 (2017) 15. R. Girshick, Fast r-cnn, in Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015) 16. S. Wan, Z. Chen, T. Zhang, B. Zhang, K.-k. Wong, Bootstrapping face detection with hard negative examples, arXiv preprint arXiv:1608.02236 (2016) 17. V. Jain, E. Learned-Miller, Fddb: a benchmark for face detection in unconstrained settings, Technical Report UM-CS-2010-009, University of Massachusetts, Amherst (Tech, Rep, 2010) 18. Y. Bai, Y. Zhang, M. Ding, B. Ghanem, Finding tiny faces in the wild with generative adversarial network, inCVPR (2018) 19. T.F. Cootes, G.J. Edwards, C.J. Taylor, Active appearance models. PAMI 6, 681–685 (2001) 20. J.M. Saragih, S. Lucey, J.F. Cohn, Deformable model fitting by regularized landmark meanshift. IJCV 91(2), 200–215 (2011) 21. P.N. Belhumeur, D.W. Jacobs, D.J. Kriegman, N. Kumar, Localizing parts of faces using a consensus of exemplars. PAMI 35(12), 2930–2940 (2013) 22. L. Liang, R. Xiao, F. Wen, J. Sun, Face alignment via component-based discriminative search, in ECCV (Springer, 2008), pp. 72–85 23. M. Dantone, J. Gall, G. Fanelli, L. Van Gool, Real-time facial feature detection using conditional regression forests, in CVPR (IEEE, 2012), pp. 2578–2585 24. M. Valstar, B. Martinez, X. Binefa, M. Pantic, Facial point detection using boosted regression and graph models, in CVPR (IEEE, 2010), pp. 2729–2736 25. X. Cao, Y. Wei, F. Wen, J. Sun, Face alignment by explicit shape regression. IJCV 107(2), 177–190 (2014) 26. V. Kazemi, J. Sullivan, One millisecond face alignment with an ensemble of regression trees, in CVPR, pp. 1867–1874 (2014) 27. X. Xiong, F. Torre, Supervised descent method and its applications to face alignment, in CVPR, pp. 532–539 (2013) 28. S. Ren, X. Cao, Y. Wei, J. Sun, Face alignment at 3000 fps via regressing local binary features, in CVPR, pp. 1685–1692 (2014) 29. S. Zhu, C. Li, C.-C. Loy, X. Tang, Unconstrained face alignment via cascaded compositional learning, in CVPR, pp. 3409–3417 (2016) 30. O. Tuzel, T. K. Marks, S. Tambe, Robust face alignment using a mixture of invariant experts, in ECCV (Springer, 2016), pp. 825–841 31. X. Fan, R. Liu, Z. Luo, Y. Li, Y. Feng, Explicit shape regression with characteristic number for facial landmark localization, TMM (2017) 32. X. Burgos-Artizzu, P. Perona, P. Dollár, Robust face landmark estimation under occlusion, in ICCV, pp. 1513–1520 (2013) 33. E. Zhou, H. Fan, Z. Cao, Y. Jiang, Q. Yin, Extensive facial landmark localization with coarseto-fine convolutional network cascade, in ICCV Workshops, pp. 386–391 (2013)

References

23

34. Z. Zhang, P. Luo, C.C. Loy, X. Tang, Facial landmark detection by deep multi-task learning, in ECCV (Springer, 2014), pp. 94–108 35. H. Liu, D. Kong, S. Wang, B. Yin, Sparse pose regression via componentwise clustering feature point representation. TMM 18(7), 1233–1244 (2016) 36. T. Zhang, W. Zheng, Z. Cui, Y. Zong, J. Yan, K. Yan, A deep neural network-driven feature learning method for multi-view facial expression recognition. TMM 18(12), 2528–2536 (2016) 37. J. Zhang, S. Shan, M. Kan, X. Chen, Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment, in ECCV (Springer, 2014), pp. 1–16 38. J. Zhang, M. Kan, S. Shan, X. Chen, Occlusion-free face alignment: deep regression networks coupled with de-corrupt autoencoders, in CVPR, pp. 3428–3437 (2016) 39. H. Lai, S. Xiao, Z. Cui, Y. Pan, C. Xu, S. Yan, Deep cascaded regression for face alignment, arXiv preprint arXiv:1510.09083 (2015) 40. D. Merget, M. Rock, G. Rigoll, Robust facial landmark detection via a fully-convolutional local-global context network, in CVPR, pp. 781–790 (2018) 41. A. Bulat and G. Tzimiropoulos, Super-fan: Integrated facial landmark localization and superresolution of real-world low resolution faces in arbitrary poses with gans, in CVPR (2018) 42. Z. Tang, X. Peng, S. Geng, L. Wu, S. Zhang, D. Metaxas, Quantized densely connected u-nets for efficient landmark localization, in ECCV (2018) 43. X. Peng, R.S. Feris, X. Wang, D.N. Metaxas, A recurrent encoder-decoder network for sequential face alignment, in ECCV (Springer, 2016), pp. 38–56 44. S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, A. Kassim, Robust facial landmark detection via recurrent attentive-refinement networks, in ECCV (Springer, 2016), pp. 57–72 45. G. Trigeorgis, P. Snape, M.A. Nicolaou, E. Antonakos, S. Zafeiriou, Mnemonic descent method: a recurrent process applied for end-to-end face alignment, in CVPR, pp. 4177–4187 (2016) 46. X. Zhu, Z. Lei, X. Liu, H. Shi, S. Z. Li, Face alignment across large poses: a 3d solution, in CVPR, pp. 146–155 (2016) 47. A. Jourabloo, X. Liu, Large-pose face alignment via cnn-based dense 3d model fitting, in CVPR, pp. 4188–4196 (2016) 48. F. Liu, D. Zeng, Q. Zhao, X. Liu, Joint face alignment and 3d face reconstruction, in ECCV (Springer, 2016), pp. 545–560 49. A. Bulat, G. Tzimiropoulos, How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks, in CVPR, vol. 1, no. 2, p. 4 (2017) 50. Y. Feng, F. Wu, X. Shao, Y. Wang, X. Zhou, Joint 3d face reconstruction and dense alignment with position map regression network, in ECCV (2018) 51. X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, Y. Sheikh, Supervision-by-registration: an unsupervised approach to improve the precision of facial landmark detectors, in CVPR, pp. 360–368 (2018) 52. Y. Zhang, Y. Guo, Y. Jin, Y. Luo, Z. He, H. Lee, Unsupervised discovery of object landmarks as structural representations, in CVPR (2018) 53. X. Dong, Y. Yan, W. Ouyang, Y. Yang, Style aggregated network for facial landmark detection, in CVPR, vol. 2, p. 6 (2018) 54. S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, J. Kautz, Improving landmark localization with semi-supervised learning, in CVPR (2018) 55. N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2005) 56. B.L. Andreas Ess, L. Van Gool, Depth and appearance for mobile scene analysis, in IEEE International Conference on Computer Vision (ICCV) (2007) 57. M. Enzweiler, D.M. Gavrila, Monocular pedestrian detection: Survey and experiments. IEEE Trans. Pattern Anal. Mach. Intell. 12, 2179–2195 (2008) 58. C. Wojek, S. Walk, B. Schiele, Multi-cue onboard pedestrian detection (2009) 59. A. Geiger, P. Lenz, R. Urtasun, Are we ready for autonomous driving? the kitti vision benchmark suite, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (IEEE, 2012), pp. 3354–3361

24

2 Human-Centric Visual Analysis: Tasks and Progress

60. B. Schiele Piotr Dollár, C. Wojek, P. Perona, Pedestrian detection: an evaluation of the state of the art (2012) 61. S. Maji, A.C. Berg, J. Malik, Classification using intersection kernel support vector machines is efficient, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. IEEE (2008) 62. J. Marin, D. Vázquez, A.M. López, J. Amores, B. Leibe, Random forests of local experts for pedestrian detection, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2592–2599 (2013) 63. P.P. Piotr Dollár, Z. Tu, S. Belongie, Integral channel features, in British Machine Vision Conference (BMVC) (2009) 64. R. Benenson, M. Mathias, T. Tuytelaars, L. Van Gool, Seeking the strongest rigid detector, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3666–3673 (2013) 65. S.B. Piotr Dollár, R. Appel, P. Perona, Fast feature pyramids for object detection (2014) 66. R. Tibshirani-et al. J. Friedman, T. Hastie, Additive logistic regression: a statistical view of boosting, in The Annals of Statistics (2000) 67. W. Nam, P. Dollár, J.H. Han, Local decorrelation for improved pedestrian detection, in Advances in Neural Information Processing Systems, pp. 424–432 (2014) 68. S. Paisitkriangkrai, C. Shen, A. Van Den Hengel, Strengthening the effectiveness of pedestrian detection with spatially pooled features, in European Conference on Computer Vision (Springer, 2014), pp. 546–561 69. S. Zhang, R. Benenson, B. Schiele, et al., Filtered channel features for pedestrian detection, in CVPR, volume 1, p. 4 (2015) 70. P. Felzenszwalb, D. McAllester, D. Ramanan. A discriminatively trained, multiscale, deformable part model, in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (IEEE, 2008), pp. 1–8 71. D. Park, D. Ramanan, C. Fowlkes, Multiresolution models for object detection, in European Conference on Computer Vision (Springer, 2010), pp. 241–254 72. W. Ouyang, X. Wang, Single-pedestrian detection aided by multi-pedestrian detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3198–3205 (2013) 73. J. Yan, X. Zhang, Z. Lei, S. Liao, S.Z. Li, Robust multi-resolution pedestrian detection in traffic scenes, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3033–3040 (2013) 74. X. Wang, W. Ouyang, A discriminative deep model for pedestrian detection with occlusion handling, in 2012 IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2012), pp. 3258–3265 75. W. Ouyang, X. Zeng, X. Wang, Modeling mutual visibility relationship in pedestrian detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3222– 3229 (2013) 76. P. Sermanet, K. Kavukcuoglu, S. Chintala, Y. LeCun, Pedestrian detection with unsupervised multi-stage feature learning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3626–3633 (2013) 77. W. Ouyang, X. Wang, Joint deep learning for pedestrian detection, in Proceedings of the IEEE International Conference on Computer Vision, pp. 2056–2063 (2013) 78. X. Zeng, W. Ouyang, X. Wang, Multi-stage contextual deep learning for pedestrian detection, in Proceedings of the IEEE International Conference on Computer Vision, pp. 121–128 (2013) 79. P. Luo, Y. Tian, X. Wang, X. Tang, Switchable deep network for pedestrian detection, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 899– 906 (2014) 80. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587 (2014)

References

25

81. J. Hosang, M. Omran, R. Benenson, B. Schiele, Taking a deeper look at pedestrians, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4073–4082 (2015) 82. A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny images (Technical report, Citeseer, 2009) 83. A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems, pp. 1097–1105 (2012) 84. X. Wang, Y. Tian, P. Luo, X. Tang, Pedestrian detection aided by deep learning semantic tasks, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015) 85. X. Wang, Y. Tian, P. Luo, X. Tang, Deep learning strong parts for pedestrian detection, in IEEE International Conference on Computer Vision (ICCV) (2015) 86. Jianan Li, Xiaodan Liang, ShengMei Shen, Xu Tingfa, Jiashi Feng, Shuicheng Yan, Scaleaware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia 20(4), 985–996 (2018) 87. M. Saberian, Z. Cai, N. Vasconcelos, Learning complexity-aware cascades for deep pedestrian detection, in IEEE International Conference on Computer Vision (ICCV) (2015) 88. B. Yang, J. Yan, Z. Lei, S.Z. Li, Convolutional channel features, in ICCV, pp. 82–90 (2015)

Part II

Localizing Persons in Images

Finding people in images/videos is one of the fundamental problems of computer vision and has been widely studied over recent decades. It is an important step toward many subsequent applications such as face recognition, human pose estimation, and smart surveillance. In this part, we introduce two specific studies of finding people in images, i.e., facial landmark localization and pedestrian detection. With recent advances in deep learning techniques and large-scale annotated image datasets, deep convolutional neural network models have achieved significant progress in salient object detection [1], crowd analysis [2, 3], and facial landmark localization [4]. Facial landmark localization is typically formulated as a regression problem. Among the existing methods that follow this approach, cascaded deep convolutional neural networks [5, 6] have emerged as one of the leading methods because of their superior accuracy. Nevertheless, the three-level cascaded CNN framework is complicated and unwieldy. It is arduous to jointly handle the classification (i.e., whether a landmark exists) and localization problems for unconstrained settings. Long et al. [7] recently proposed an FCN for pixel labeling, which takes an input image with an arbitrary size and produces a dense label map with the same resolution. This approach shows convincing results for semantic image segmentation and is also very efficient because convolutions are shared among overlapping image patches. Notably, classification and localization can be simultaneously achieved with a dense label map. The success of this work inspires us to adopt an FCN in our task, i.e., pixelwise facial landmark prediction. Nevertheless, a specialized architecture is required because our task requires more accurate prediction than generic image labeling. Pedestrian detection is an essential task for an intelligent video surveillance system. It has also been an active research area in computer vision in recent years. Many pedestrian detectors, such as [8, 9], have been proposed based on handcrafted features. With the great success achieved by deep models in many tasks of computer vision, hybrid methods that combine traditional, handcrafted features [8, 9] and deep convolutional features [10, 11] have become popular. For example, in [12], a stand-alone pedestrian detector (which uses squares channel features) is adopted as a highly selective proposer ( τ,

(9.10)

where ηtj is the average accuracy of the jth classifier in the current iteration and α is a parameter that controls the pace increase rate. In our experiments, we empirically set {λ0 , α} = {0.2, 0.08}. Note that the pace parameters λ should be stopped when all training samples are with v = {1}. Thus, we introduce an empirical threshold τ constraint that λ is updated only in early iterations, i.e., t ≤ τ . τ is set as 12 in our experiments. The entire algorithm can then be summarized in Algorithm 1. It is easy to see that this solving strategy for the ASPL model finely accords with the pipeline of our framework. Convergence Discussion: As illustrated in Algorithm 1, the ASPL algorithm alternatively updates the variables, including the classifier parameters w, b (by weighted

9.4 Formulation and Optimization

129

SVM), the pseudo-labels y (closed-form solution by Theorem 1), the weight v (by SPL), and the low-confidence sample annotations φ (by AL). For the first three parameters, the updates are calculated by a global optimum obtained from a subproblem of the original model; thus, the decrease of the objective function can be guaranteed. However, similar to other existing AL techniques, human efforts are involved in the loop of the AL stage; thus, the monotonic decrease of the objective function cannot be guaranteed in this step. As the learning proceeds, the model tends to become increasingly mature, and the AL labor tends to lessen in the later learning stage. Thus, with gradually less involvement of the AL calculation in our algorithm, the monotonic decrease of the objective function through iteration tends to be promised, and thus, our algorithm tends to be convergent.

References 1. L. Lin, K. Wang, D. Meng, W. Zuo, L. Zhang, Active self-paced learning for cost-effective and progressive face identification, in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 1, pp. 7–19, 1 Jan. 2018 2. F. Celli, E. Bruni, B. Lepri, Automatic personality and interaction style recognition from facebook profile pictures, in ACM Conference on Multimedia (2014) 3. Z. Stone, T. Zickler, T. Darrell, Toward large-scale face recognition using social network context. Proc. IEEE 98, (2010) 4. Z. Lei, D. Yi, and S. Z. Li. Learning stacked image descriptor for face recognition. IEEE Trans. Circuit. Syst. Video Technol. PP(99), 1–1 (2015) 5. S. Liao, A.K. Jain, S.Z. Li, Partial face recognition: alignment-free approach. IEEE Transactions on Pattern Analysis and Machine Intelligence 35(5), 1193–1205 (2013) 6. D. Yi, Z. Lei, S. Z. Li, Towards pose robust face recognition, in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 3539–3545 (2013) 7. X. Zhu, Z. Lei, J. Yan, D. Yi, S.Z. Li, High-fidelity pose and expression normalization for face recognition in the wild, in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 787–796 (2015) 8. Yi Sun, Xiaogang Wang, and Xiaoo Tang. Hybrid deep learning for face verification. In Proc. of IEEE International Conference on Computer Vision (2013) 9. X. Wang, X. Guo, S. Z. Li, Adaptively unified semi-supervised dictionary learning with active points, in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1787–1795 (2015) 10. Yu-Feng Li, Zhi-Hua Zhou, Towards making unlabeled data never hurt. IEEE Trans. Pattern Anal. Mach. Intelligence 37(1), 175–188 (2015) 11. H. Zhao et al., A novel incremental principal component analysis and its application for face recognition (SMC, IEEE Transactions on, 2006) 12. T.-K. Kim, K.-Y. Kenneth Wong, B. Stenger, J. Kittler, R. Cipolla, Incremental linear discriminant analysis using sufficient spanning set approximations, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2007) 13. E. Ehsan, S. Guillermo, Y. Allen, S.S. Shankar, A convex optimization framework for active learning, in Proceedings of IEEE International Conference on Computer Vision (2013) 14. K. Wang, D. Zhang, Y. Li, R. Zhang, L. Lin, Cost-effective active learning for deep image classification. IEEE Trans. Circuits Syst. Video Technol. PP(99), 1–1 (2016) 15. L. Jiang, D. Meng, Q. Zhao, S. Shan, A.G. Hauptmann, Self-paced curriculum learning. Proceedings of AAAI Conference on Artificial Intelligence (2015)

130

9 Face Verification

16. L. Jiang, D. Meng, T .Mitamura, A.G. Hauptmann, Easy samples first: self-paced reranking for zero-example multimedia search, in ACM Conference on Multimedia (2014) 17. L. Jiang, D. Meng, S.-I. Yu, Z. Lan, S. Shan, A. Hauptmann, Self-paced learning with diversity, in Proceedings of Advances in Neural Information Processing Systems (2014) 18. Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning, in Proceedings of IEEE International Conference on Machine Learning (2009) 19. M Pawan Kumar et al., Self-paced learning for latent variable models, in Proceedings of Advances in Neural Information Processing Systems (2010) 20. G. Hu, Y. Yang, D. Yi, J. Kittler, W. Christmas, S.Z. Li, T. Hospedales, When face recognition meets with deep learning: an evaluation of convolutional neural networks for face recognition, in The IEEE International Conference on Computer Vision (ICCV) Workshops (2015) 21. Y. LeCun, K. Kavukcuoglu, C. Farabet, Convolutional networks and applications in vision, in ISCAS (2010) 22. K. Wang, L. Lin, W. Zuo, S. Gu, L. Zhang, Dictionary pair classifier driven convolutional neural networks for object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2138–2146, June 2016 23. L.I. Smith, A tutorial on principal components analysis. Cornell University, USA 51, 52 (2002) 24. M. Karasuyama, I. Takeuchi, Multiple incremental decremental learning of support vector machines, in Proceedings of Advances in Neural Information Processing Systems (2009) 25. N.-Y. Liang et al., A fast and accurate online sequential learning algorithm for feedforward networks (Neural Networks, IEEE Transactions on, 2006) 26. S. Ozawa et al., Incremental learning of feature space and classifier for face recognition. Neural Networks 18, (2005) 27. D.D. Lewis, W.A. Gale, A sequential algorithm for training text classifiers, in ACM SIGIR Conference (1994) 28. S. Tong, D. Koller, Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2, (2002) 29. A.K. McCallumzy, K. Nigamy, Employing em and pool-based active learning for text classification, in Proceedings of IEEE International Conference on Machine Learning (1998) 30. A.J. Joshi, F. Porikli, N. Papanikolopoulos, Multi-class active learning for image classification, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2009) 31. A. Kapoor, G. Hua, A. Akbarzadeh, S. Baker, Which faces to tag: Adding prior constraints into active learning, in Proceedings of IEEE International Conference on Computer Vision (2009) 32. A. Kapoor, K. Grauman, R. Urtasun, T. Darrell, Active learning with gaussian processes for object categorization, in Proceedings of IEEE International Conference on Computer Vision (2007) 33. X. Li, Y. Guo, Adaptive active learning for image classification, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013) 34. K. Brinker, Incorporating diversity in active learning with support vector machines, in Proceedings of IEEE International Conference on Machine Learning (2003) 35. Q. Zhao, D. Meng, L. Jiang, Q. Xie, Z. Xu, A.G. Hauptmann, Self-paced learning for matrix factorization, in Proceedings of AAAI Conference on Artificial Intelligence (2015) 36. M.P. Kumar, H. Turki, D. Preston, D. Koller, Learning specific-class segmentation from diverse data, in Proceedings of IEEE International Conference on Computer Vision (2011) 37. Y.J. Lee, K. Grauman, Learning the easy things first: Self-paced visual category discovery, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2011) 38. J.S. Supancic, D. Ramanan, Self-paced learning for long-term tracking, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (2013) 39. S. Yu et al., Cmu-informedia@ trecvid 2014 multimedia event detection, in TRECVID Video Retrieval Evaluation Workshop (2014) 40. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 1097–1105 (2012) 41. K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in ICLR (2015)

Part V

Higher Level Tasks

In computer vision, increasing attention has been paid to understanding human activity to determine what people are doing in a given video in different application domains, e.g., intelligent surveillance, robotics, and human–computer interaction. Recently developed 3D/depth sensors have opened new opportunities with enormous commercial value by providing richer information (e.g., extra depth data of scenes and objects) than traditional cameras. By building upon this enriched information, human poses can be estimated more easily. However, modeling complicated human activities remains challenging. Many works on human action/activity recognition focus mainly on designing robust and descriptive features [1, 2]. For example, Xia and Aggarwal [1] extract spatiotemporal interest points from depth videos (DSTIP) and developed a depth cuboid similarity feature (DCSF) to model human activities. Oreifej and Liu [2] propose capturing spatiotemporal changes in activities by using a histogram of oriented 4D surface normals (HON4D). Most of these methods, however, overlook detailed spatiotemporal structural information and limited periodic activities. Several compositional part-based approaches that have been studied for complex scenarios have achieved substantial progress [3, 4]; they represent an activity with deformable parts and contextual relations. For instance, Wang et al. [3] recognized human activities in common videos by training the hidden conditional random fields in a max-margin framework. For activity recognition in RGB-D data, Packer et al. [5] employed the latent structural SVM to train the model with part-based pose trajectories and object manipulations. An ensemble model of actionlets was studied in [4] to represent 3D human activities with a new feature called the local occupancy pattern (LOP). To address more complicated activities with large temporal variations, some improved models discover the temporal structures of activities by localizing sequential actions. For example, Wang and Wu [6] propose solving the temporal alignment of actions by maximum margin temporal warping. Tang et al. [7] capture the latent temporal structures of 2D activities based on the variable-duration hidden Markov model. Koppula and Saxena [8] apply conditional random fields to model the subactivities and affordances of the objects for 3D activity recognition. In the depth video scenario, Packer et al. [5] address action recognition by modeling both pose trajectories and object manipulations with a latent structural SVM. Wang et al. [4] develop an actionlet ensemble model and a novel feature called the

132

Part V: Higher Level Tasks

local occupancy pattern (LOP) to capture intraclass variance in 3D action. However, these methods address only small time period action recognition, in which temporal segmentation matters only slightly. Recently, AND/OR graph representations have been introduced as extensions of part-based models [9, 10] and produce very competitive performance to address large data variations. These models incorporate not only hierarchical decompositions but also explicit structural alternatives (e.g., different ways of composition). Zhu and Mumford [9] first explore AND/OR graph models for image parsing. Pei et al. [10] then introduce the models for video event understanding, but their approach requires elaborate annotations. Liang et al. [11] propose training the spatiotemporal AND/OR graph model using a nonconvex formulation, which is discriminatively trained on weakly annotated training data. However, the abovementioned models rely on handcrafted features, and their discriminative capacities are not optimized for 3D human activity recognition. The past few years have seen a resurgence of research on the design of deep neural networks, and impressive progress has been made on learning image features from raw data [12, 13]. To address human action recognition from videos, Ji et al. [14] develop a novel deep architecture of convolutional neural networks to extract features from both spatial and temporal dimensions. Luo et al. [15] propose incorporating a new switchable restricted Boltzmann machine (SRBM) to explicitly model the complex mixture of visual appearances for pedestrian detection; they train their model using an EM-type iterative algorithm. Amer and Todorovic [16] apply sumproduct networks (SPNs) to model human activities based on variable primitive actions.

References 1. L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in CVPR, pp. 2834–2841 (2013) 2. O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, in CVPR, pp. 716–723 (2013) 3. Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic vs. max-margin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011) 4. J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: CVPR, pp. 1290–1297 (2012) 5. B. Packer, K. Saenko, D. Koller, A combined pose, object, and feature model for action understanding, in CVPR, pp. 1378–1385 (2012) 6. J. Wang, Y. Wu, Learning maximum margin temporal warping for action recognition, in ICCV, pp. 2688–2695 (2013) 7. K. Tang, L. Fei-Fei, D. Koller, Learning latent temporal structure for complex event detection, in CVPR, pp. 1250–1257 (2012) 8. H.S. Koppula, A. Saxena, Learning spatio-temporal structure from rgb-d videos for human activity detection and anticipation, in ICML, pp. 792–800 (2013)

References

133

9. S. Zhu, D. Mumford, A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2(4), 259–362 (2007) 10. M. Pei, Y. Jia, S. Zhu, Parsing video events with goal inference and intent prediction, in ICCV, pp. 487–494 (2011) 11. X. Liang, L. Lin, L. Cao, Learning latent spatio-temporal compositional model for human action recognition, in ACM Multimedia, pp. 263–272 (2013) 12. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313 (5786), 504–507 (2006) 13. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst., 1097–1105 (2012) 14. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 15. P. Luo, Y. Tian, X. Wang, X. Tang, Switchable deep network for pedestrian detection, in CVPR (2014) 16. M.R. Amer, S. Todorovic, Sum-product networks for modeling activities with stochastic structure, in: CVPR, pp. 1314–1321 (2012)

Chapter 10

Human Activity Understanding

Abstract Understanding human activity is very challenging even with recently developed 3D/depth sensors. To solve this problem, this work investigates a novel deep structured model that adaptively decomposes an activity into temporal parts using convolutional neural networks (CNNs). The proposed model advances two aspects of the traditional deep learning approaches. First, a latent temporal structure is introduced into the deep model, accounting for large temporal variations in diverse human activities. In particular, we utilize the latent variables to decompose the input activity into a number of temporally segmented subactivities and feed them into the parts (i.e., subnetworks) of the deep architecture. Second, we incorporate a radius-margin bound as a regularization term into our deep model, which effectively improves the generalization performance for classification (Reprinted by permission from Springer Nature Customer Service Centre GmbH: Springer International c 2019). Journal of Computer Vision [1] 

10.1 Introduction Most previous methods recognize 3D human activities by training discriminative/generative classifiers based on carefully designed features [2–5]. These approaches often require sufficient domain knowledge and heavy feature engineering because of the difficulty (a), which could limit their application. To improve the discriminative performance, certain compositional methods [6, 7] model complex activities by segmenting videos into temporal segments of fixed length. However, because of the difficulty of this task (b), they may have problems segmenting complex activities composed of actions of diverse temporal durations, e.g., the examples in Fig. 10.1. In this work, we develop a deep structured human activity model to address the abovementioned challenges and demonstrate superior performance in comparison to other state-of-the-art approaches in the task of recognizing human activities from grayscale-depth videos captured by a RGB-D camera (i.e., Microsoft Kinect). Our model adaptively represents the input activity instance as a sequence of temporally separated subactivities, and each instance is associated with a cubic-like video © Springer Nature Singapore Pte Ltd. 2020 L. Lin et al., Human Centric Visual Analysis with Deep Learning, https://doi.org/10.1007/978-981-13-2387-4_10

135

136

10 Human Activity Understanding

Fig. 10.1 Two activities of the same category. We consider one activity as a sequence of actions that occur over time; the temporal composition of an action may differ for different subjects

segment of a flexible length. Our model is inspired by the effectiveness of two widely successful techniques: deep learning [8–13] and latent structure models [14–18]. One example of the former is the convolutional neural network (CNN), which was recently applied to generate powerful features for video classification [13, 19]. On the other hand, latent structure models (such as the deformable part-based model [15]) have been demonstrated to be an effective class of models for managing large object variations for recognition and detection. One of the key components of these models is the reconfigurable flexibility of the model structure, which is often implemented by estimating latent variables during inference. We adopt the deep CNN architecture [8, 13] to layer-wise extract features from the input video data, and the architecture is vertically decomposed into several subnetworks corresponding to the video segments, as Fig. 10.2 illustrates. In particular, our model searches for the optimal composition of each activity instance during recognition, which is the key to managing temporal variation in human activities. Moreover, we introduce a relaxed radius-margin bound into our deep model, which effectively improves the generalization performance for classification.

10.2 Deep Structured Model In this section, we introduce the main components of our deep structured model, including the spatiotemporal CNNs, the latent structure of activity decomposition, and the radius-margin bound for classification.

10.2 Deep Structured Model

137

10.2.1 Spatiotemporal CNNs We propose an architecture of spatiotemporal convolutional neural networks (CNNs), as Fig. 10.2 illustrates. In the input layer, the activity video is decomposed into M video segments, with each segment associated with one separated subactivity. Accordingly, the proposed architecture consists of M subnetworks to extract features from the corresponding decomposed video segments. Our spatiotemporal CNNs involve both 3D and 2D convolutional layers. The 3D convolutional layer extracts spatiotemporal features to jointly capture appearance and motion information and is followed by a max-pooling operator to improve the robustness against local deformations and noise. As shown in Fig. 10.2, each subnetwork (highlighted by the dashed box) is two stacked 3D convolutional layers and one 2D convolutional layer. For the input to each subnetwork, the number of frames is very small (e.g., 9). After two layers of 3D convolution followed by max-pooling, the temporal dimension of each set of feature maps is too small to perform a 3D convolution. Thus, we stack a 2D convolutional layer upon the two 3D convolutional layers. The outputs from the different subnetworks are merged to be fed into one fully connected layer that generates the final feature vector of the input video.

10.2.2 Latent Temporal Structure In contrast to traditional deep learning methods with fixed architectures, we incorporate the latent structure into the deep model to flexibly adapt to the input video

Fig. 10.2 The architecture of spatiotemporal convolutional neural networks. The neural networks are stacked convolutional layers, max-pooling operators, and a fully connected layer, where the raw segmented videos are treated as the input. A subnetwork is referred to as a vertically decomposed subpart with several stacked layers that extracts features for one segmented video section (i.e., one subactivity). Moreover, by using the latent variables, our architecture is capable of explicitly handling diverse temporal compositions of complex activities

138

10 Human Activity Understanding

Fig. 10.3 Illustration of incorporating the latent structure into the deep model. Different subnetworks are denoted by different colors

during inference and learning. Assume that the input video is temporally divided into a number M of segments corresponding to the subactivities. We index each video segment by its starting anchor frame s j and its temporal length (i.e., the number of frames) t j . t j is greater than m, i.e., t j ≥ m. To address the large temporal variation in human activities, we make s j and t j variables. Thus, for all video segments, we denote the indexes of the starting anchor frames as (s1 , ..., s M ) and their temporal lengths as (t1 , ..., t M ); these are regarded as the latent variables in our model, h = (s1 , ..., s M , t1 , ..., t M ). These latent variables specifying the segmentation will be adaptively estimated for different input videos. We associate the CNNs with the video segmentation by feeding each segmented part into a subnetwork, as Fig. 10.2 illustrates. Next, according to the method of video segmentation (i.e., decomposition of subactivities), we manipulate the CNNs by inputting the sampled video frames. Specifically, each subnetwork takes m video frames as the input, and if some frames are more than m, according to the latent variables, e.g., t j > m, then a uniform sampling is performed to extract m key frames. Figure 10.3 shows an intuitive example of our structured deep model in which the input video is segmented into three sections corresponding to the three subnetworks in our deep architecture. Thus, the configuration of the CNNs is dynamically adjusted in addition to searching for the appropriate latent variables of the input videos. Given the parameters of the CNNs ω and the input video xi with its latent variables h i , the generated feature of xi can be represented as φ(xi ; ω, h i ).

10.2 Deep Structured Model

139

10.2.3 Deep Model with Relaxed Radius-Margin Bound A large amount of training data is crucial for the success of many deep learning models. Given sufficient training data, the effectiveness of applying the softmax classifier to CNNs has been validated for image classification [20]. However, for 3D human activity recognition, the available training data are usually less than expected. For example, the CAD-120 dataset [21] consists of only 120 RGB-D sequences of 10 categories. In this scenario, although parameter pretraining and dropout are available, the model training often suffers from overfitting. Hence, we consider introducing a more effective classifier in addition to the regularizer to improve the generalization performance of the deep model. In supervised learning, the support vector machine (SVM), also known as the maxmargin classifier, is theoretically sound and generally can achieve promising performance compared with the alternative linear classifiers. In deep learning research, the combination of SVM and CNNs has been exploited [22] and has obtained excellent results in object detection [23]. Motivated by these approaches, we impose a maxmargin classifier (w, b) upon the feature generated by the spatiotemporal CNNs for human activity recognition. As a max-margin classifier, standard SVM adopts w2 , the reciprocal of the squared margin γ 2 , as the regularizer. However, the generalization error bound of SVM depends on the radius-margin ratio R 2 /γ 2 , where R is the radius of the minimum enclosing ball (MEB) of the training data [24]. When the feature space is fixed, the radius R is constant and can, therefore, be ignored. However, in our approach, the radius R is determined by the MEB of the training data in the feature space generated by the CNNs. In this scenario, there is a risk that the margin can be increased by simply expanding the MEB of the training data in the feature space. For example, simply multiplying a constant to the feature vector can enlarge the margin between the positive and negative samples, but obviously, this approach will not enable better classification. To overcome this problem, we incorporate the radius-margin bound into the feature learning, as Fig. 10.4 illustrates. In particular, we impose a maxmargin classifier with radius information upon the feature generated by the fully connected layer of the spatiotemporal CNNs. The optimization tends to maximize the margin while shrinking the MEB of the training data in the feature space, and we thus obtain a tighter error bound. Suppose there is a set of N training samples (X, Y ) = {(x1 , y1 ), ... , (x N , y N )}, where xi is the video, y ∈ {1, ..., C} represents the category labels, and C is the number of activity categories. We extract the feature for each xi by the spatiotemporal CNNs, φ(xi ; ω, h i ), where h i refers to the latent variables. By adopting the squared hinge loss and the radius-margin bound, we define the following loss function L 0 of our model:

140

10 Human Activity Understanding

Fig. 10.4 Illustration of our deep model with the radius-margin bound. To improve the generalization performance for classification, we propose integrating the radius-margin bound as a regularizer with feature learning. Intuitively, as well as optimizing the max-margin parameters (w, b), we shrink the radius R of the minimum enclosing ball (MEB) of the training data that are distributed in the feature space generated by the CNNs. The resulting classifier with the regularizer shows better generalization performance than the traditional softmax output

Radius−margin Ratio

   1 L0 = w2 Rφ2 2

 N   T 2 +λ max 0, 1 − w φ(xi ; ω, h i ) + b yi ,

(10.1)

i=1

where λ is the trade-off parameter, 1/w denotes the margin of the separating hyperplane, b denotes the bias, and Rφ denotes the radius of the MEB of the training data φ(X, ω, H ) = {φ(x1 ; ω, h 1 ), ..., φ(x N ; ω, h N )} in the CNN feature space. Formally, the radius Rφ is defined as [24, 25], Rφ2 = min R 2 , s.t.φ(xi ; ω, h i ) − φ0 2 ≤ R 2 , ∀i. R,φ0

(10.2)

The radius Rφ is implicitly defined by both the training data and the model parameters, meaning (i) the model in Eq. (10.1) is highly nonconvex, (ii) the derivative of Rφ with respect to ω is hard to compute, and (iii) the problem is difficult to solve using the stochastic gradient descent (SGD) method. Motivated by the radius-margin-based SVM [26, 27], we investigate using the relaxed form to replace the original definition of Rφ in Eq. (10.2). In particular, we introduce the maximum pairwise distance R˜ φ over all the training samples in the feature space as R˜ φ2 = max φ(xi ; ω, h i ) − φ(x j ; ω, h j )2 . i, j

(10.3)

Do and Kalousis [26] proved that Rφ could be well bounded by R˜ φ with the Lemma 2,

10.2 Deep Structured Model

141

√ 1+ 3 ˜ ˜ Rφ ≤ Rφ ≤ Rφ . 2

Lemma 2

The abovementioned lemma guarantees that the true radius Rφ can be well approximated by R˜ φ . With the proper parameter η, the optimal solution for minimizing the radius-margin ratio w2 Rφ2 is the same as that for minimizing the radius-margin sum w2 + η Rφ2 [26]. Thus, by approximating Rφ2 with R˜ φ2 and replacing the radiusmargin ratio with the radius-margin sum, we suggest the following deep model with the relaxed radius-margin bound: 1 w2 + max φ(xi ; ω, h i ) − φ(x j ; ω, h j )2 i, j 2 

N   2 +λ max 0, 1 − w T φ(xi ; ω, h i ) + b yi .

L1 =

(10.4)

i=1

However, the first max operator in Eq. (10.4) is defined over all training sample pairs, and the minibatch-based SGD optimization method is, therefore, unsuitable. Moreover, the radius in Eq. (10.4) is determined by the maximum distances of the sample pairs in the CNN feature space, and it might be sensitive to outliers. To address these issues, we approximate the max operator with a softmax function, resulting in the following model: L2 =

 1 w2 + η κi j φ(xi ; ω, h i ) − φ(x j ; ω, h j )2 2 i, j +λ

N 







2

(10.5)

max 0, 1 − w φ(xi ; ω, h i ) + b yi T

i=1

with

exp(αφ(xi ; ω, h i ) − φ(x j ; ω, h j )2 ) , κi j = 2 i j exp(αφ(x i ; ω, h i ) − φ(x j ; ω, h j ) )

(10.6)

where α ≥ 0 is the parameter used to control the degree of approximation of the hard max operator. When α is infinite, the approximation in Eq. (10.5) becomes the model in Eq. (10.4). Specifically, when α = 0, κi j = 1/N 2 , and the relaxed loss function can be reformulated as  1 φ(xi ; ω, h i ) − φ¯ ω 2 L 3 = w2 + 2η 2 i

 N   T 2 +λ max 0, 1 − w φ(xi ; ω, h i ) + b yi i=1

(10.7)

142

10 Human Activity Understanding

with φ¯ ω =

1 φ(xi ; ω, h i ). N i

(10.8)

The optimization objects in Eqs. (10.5) and (10.7) are two relaxed losses of our deep model with the strict radius-margin bound in Eq. (10.1). The derivatives of the relaxed losses with respect to ω are easy to compute, and the models can be readily solved via SGD, which will be discussed in detail in Sect. 10.4.

10.3 Implementation In this section, we first explain the implementation that makes our model adaptive to an alterable temporal structure and then describe the detailed setting of our deep architecture.

10.3.1 Latent Temporal Structure During our learning and inference procedures, we search for the appropriate latent variables that determine the temporal decomposition of the input video (i.e., the decomposition of activities). There are two parameters relating to the latent variables in our model: the number M of video segments and the temporal length m of each segment. Note that the subactivities decomposed by our model have no precise definition in a complex activity, i.e., actions can be ambiguous depending on the temporal scale being considered. To incorporate the latent temporal structure, we associate the latent variables with the neurons (i.e., convolutional responses) in the bottom layer of the spatiotemporal CNNs. The choice of the number of segments M is important for the performance of 3D human activity recognition. The model with a small M could be less expressive in addressing temporal variations, while a large M could lead to overfitting due to high complexity. Furthermore, when M = 1, the model latent structure is disabled, and our architecture degenerates to the conventional 3D-CNNs [13]. By referring to the setting of the number of parts for the deformable part-based model [15] in object detection, the value M can be set by cross-validation on a small set. In all our experiments, we set M = 4. Considering that the number of frames of the input videos is diverse, we develop a process to normalize the inputs by two-step sampling in the learning and inference procedure. First, we sample 30 anchor frames uniformly from the input video. Based on these anchor frames, we search for all possible nonoverlapping temporal segmentations, and the anchor frame segmentation corresponds to the segmentation of the input video. Then, from each video segment (indicating a subactivity), we uniformly

10.3 Implementation

143

sample m frames to feed the neural networks, and in our experiments, we set m = 9. In addition, we reject the possible segmentations that cannot offer m frames for any video segment. For an input video, the possibility of temporal structure variations (i.e., the possible enumeration number of anchor frame segmentations) is 115 in our experiments.

10.3.2 Architecture of Deep Neural Networks The proposed spatiotemporal CNN architecture is constructed by stacking two 3D convolutional layers, one 2D convolutional layer, and one fully connected layer, and the max-pooling operator is deployed after each 3D convolutional layer. Below, we introduce the definitions and implementations of these components of our model. 3D Convolutional Layer. The 3D convolutional operation is adopted to perform convolutions spanning both spatial and temporal dimensions to characterize both appearance and motion features [13]. Suppose p is the input video segment with width w, height h, and number of frames m, ω is the 3D convolutional kernel with width w , height h , and temporal length m . As shown in Fig. 10.5, a feature map v can be obtained by performing 3D convolutions from the sth to the (s + m − 1)th frames, where the response for the position (x, y, s) in the feature map is defined as

Fig. 10.5 Illustration of the 3D convolutions across both spatial and temporal domains. In this example, the temporal dimension of the 3D kernel is 3, and each feature map is thus obtained by performing 3D convolutions across 3 adjacent frames

144

10 Human Activity Understanding

vx ys = tanh(b +

k −1 h −1 m −1 

ωi jk · p(x+i)(y+ j)(s+k) ),

(10.9)

i=0 j=0 k=0

where p(x+i)(y+ j)(s+k) denotes the pixel value of the input video p at position (x + i, y + j) in the (s + k)th frame, ωi jk denotes the value of the convolutional kernel ω at position (i, j, k), b stands for the bias, and tanh denotes the hyperbolic tangent function. Thus, given p and ω, m − m + 1 feature maps can be obtained, each with a size of (w − w + 1, h − h + 1). Based on the 3D convolutional operation, a 3D convolutional layer is designed for spatiotemporal feature extraction by considering the following three issues: • Number of convolutional kernels. The feature maps generated by one convolutional kernel are limited in capturing appearance and motion information. To generate more types of features, several kernels are employed in each convolutional layer. We define the number of 3D convolutional kernels in the first layer as c1 . After the first 3D convolutions, we obtain c1 sets of m − m + 1 feature maps. Then, we use 3D convolutional kernels on the c1 sets of feature maps and obtain c1 × c2 sets of feature maps after the second 3D convolutional layer. • Decompositional convolutional networks. Our deep model consists of M subnetworks, and the input video segment for each subnetwork involves m frames (the later frames might be unavailable). In the proposed architecture, all of the subnetworks use the same structure, but each has its own convolutional kernels. For example, the kernels belonging to the first subnetwork are deployed only to perform convolutions on the first temporal video segment. Thus, each subnetwork generates specific feature representations for one subactivity. • Application to gray-depth video. The RGB images are first converted to graylevel images, and the gray-depth video is then adopted as the input to the neural networks. The 3D convolutional kernels in the first layer are applied to both the gray channel and the depth channel in the video, and the convolutional results of these two channels are further aggregated to produce the feature maps. Note that the dimensions of the features remain the same as those from only one channel. In our implementation, the input frame is scaled with height h = 80 and width w = 60. In the first 3D convolutional layer, the number of 3D convolutional kernels is c1 = 7, and the size of the kernel is w × h × m = 9 × 7 × 3. In the second layer, the number of 3D convolutional kernels is c2 = 5, and the size of the kernel is w × h × m = 7 × 7 × 3. Thus, we have 7 sets of feature maps after the first 3D convolutional layer and obtain 7 × 5 sets of feature maps after the second 3D convolutional layer. Max-pooling Operator. After each 3D convolution, the max-pooling operation is introduced to enhance the deformation and shift invariance [20]. Given a feature map with a size of a1 × a2 , a d1 × d2 max-pooling operator is performed by taking the maximum of every nonoverlapping d1 × d2 subregion of the feature map, resulting in an a1 /d1 × a2 /d2 pooled feature map. In our implementation, a 3 × 3 max-pooling operator was applied after every 3D convolutional layer. After two layers of 3D

10.3 Implementation

145

convolutions and max-pooling, for each subnetwork, we have 7 × 5 sets of 6 × 4 × 5 feature maps. 2D Convolutional Layer. After two layers of 3D convolutions followed by maxpooling, a 2D convolution is employed to further extract higher level complex features. The 2D convolution can be viewed as a special case of 3D convolution with m = 1, which is defined as vx y = tanh(b +

k −1 h −1

ωi j · p(x+i)(y+ j) ),

(10.10)

i=0 j=0

where p(x+i)(y+ j) denotes the pixel value of the feature map p at position (x + i, y + j), ωi j denotes the value of the convolutional kernel ω at position (i, j), and b denotes the bias. In the 2D convolutional layer, if the number of 2D convolutional kernels is c3 , then c1 × c2 × c3 sets of new feature maps are obtained by performing 2D convolutions on c1 × c2 sets of feature maps generated by the second 3D convolutional layer. In our implementation, the number of 2D convolutional kernels is set as c3 = 4 with a kernel size of 6 × 4. Hence, for each subnetwork, we can obtain 700 feature maps with a size of 1 × 1. Fully Connected Layer. There is only one fully connected layer with 64 neurons in our architecture. All these neurons connect to a vector of 700 × 4 = 2800 dimensions, which is generated by concatenating the feature maps from all the subnetworks. Because the training data are insufficient, and a large number of parameters (i.e., 179200) exist in this fully connected layer, we adopt the commonly used dropout trick with a 0.6 rate to prevent overfitting. The margin-based classifier is defined based on the output of the fully connected layer, where we adopt the squared hinge loss to predict the activity categories as θ (z) = arg max(wiT z + bi ), i

(10.11)

where z is the 64-dimensional vector from the fully connected layer, and {wi , bi } denotes the weight and bias connected to the ith activity category.

10.4 Learning Algorithm The proposed deep structured model involves three components to be optimized: (i) the latent variables H that manipulate the activity decomposition, (ii) the marginbased classifier {w, b}, and (iii) the CNN parameters ω. The latent variables are not continuous and need to be estimated adaptively for different input videos, making the standard backpropagation algorithm [8] unsuitable for our deep model. In this section, we present a joint component learning algorithm that iteratively optimizes the three components. Moreover, to overcome the problem of insufficient 3D data,

146

10 Human Activity Understanding

we propose to borrow a large number of 2D videos to pretrain the CNN parameters in advance.

10.4.1 Joint Component Learning If (X, Y ) = {(x1 , y1 ), ... , (x N , y N )} are denoted as the training set with N examples, where xi is the video, then yi ∈ {1, ..., C} denotes the activity category. Denote H = {h 1 , ..., h N } as the set of latent variables for all training examples. The model parameters to be optimized can be divided into three groups, i.e., H , {w, b}, and ω. Fortunately, given any two groups of parameters, the other group of parameters can be efficiently learned using either the stochastic gradient descent (SGD) algorithm (e.g., for {w, b} and ω) or enumeration (e.g., for H ). Thus, we conduct the joint component learning algorithm by iteratively updating the three groups of parameters with three steps: (i) Given the model parameters {w, b} and ω, we estimate the latent variables h i for each video and update the corresponding feature φ(xi ; ω, h i ) (Fig. 10.6a);

(b)

(a)

(c)

Fig. 10.6 Illustration of our joint component learning algorithm, which iteratively performs in three steps: a Given the classification parameters {w, b} and the CNN parameters ω, we estimate the latent variables h i for each video and generate the corresponding feature φ(xi ; ω, h i ); b given the updated features φ(X ; ω, H ) for all training examples, the classifier {w, b} is updated via SGD; and (c) given {w, b} and H , backpropagation updates the CNN parameters ω

10.4 Learning Algorithm

147

(ii) given the updated features φ(X ; ω, H ), we adopt SGD to update the max-margin classifier {w, b} (Fig. 10.6b); and (iii) given the model parameters {w, b} and H , we employ SGD to update the CNN parameters ω, which will lead to both an increase in the margin and a decrease in the radius Fig. 10.6b. It is worth mentioning that the two steps (ii) and (iii) can be performed in the same SGD procedure; i.e., their parameters are jointly updated. Below, we explain in detail the three steps for minimizing the losses in Eqs. (10.5) and (10.7), which are derived from our deep model. (i) Given the model parameters ω and {w, b}, for each sample (xi , yi ), the most appropriate latent variables h i can be determined by exhaustive searching over all possible choices,  h i∗ = arg min 1 − wφ(xi ; ω, h i ) + b yi . hi

(10.12)

GPU programming is employed to accelerate the search process. With the updated latent variables, we further obtain the feature set φ(X ; ω, H ) of all the training data. (ii) Given φ(X ; ω, H ) and the CNN parameters ω, batch stochastic gradient descent (SGD) is adopted to update the model parameters in Eqs. (10.5) and (10.7). In iteration t, a batch Bt ⊂ (X, Y, H ) of k samples is chosen. We can obtain the gradients of the max-margin classifier with respect to parameters {w, b}, ∂L =w−λ ∂w

  yi φ(xi ; ω, h i ) max 0, 1 − w T φ(xi ; ω, h i ) + b yi ,



(10.13)

(xi ,yi ,h i )∈Bt

∂L = −2λ ∂b



 yi max 0, 1 − w T φ(xi ; ω, h i ) + b yi ,



(10.14)

(xi ,yi ,h i )∈Bt

where L can be either the loss L 2 or the loss L 3 . (iii) Given the latent variables H and the max-margin classifier {w, b}, based on the gradients with respect to ω, the backpropagation algorithm can be adopted to learn the CNN parameters ω. To minimize L 2 in Eq. (10.5), we first update the weights κi j in Eq. (10.6) based on φ(X ; ω, H ) and then introduce the variables κi and φi ,  κi j , (10.15) κi = j

φi =



κi j φ(x j ; ω, h j ).

(10.16)

j

With κi and φi , based on batch SGD, the derivative of the spatiotemporal CNNs is

148

10 Human Activity Understanding ∂ L2 = 4η ∂ω − 2λ





(κi φ(xi ; ω, h i ) − φi )T

(xi ,yi ,h i )∈Bt

∂φ(xi ; ω, h i ) ∂ω



 ∂φ(xi ; ω, h i ) max 0, 1 − w T φ(xi ; ω, h i ) + b yi . yi w ∂ω

(10.17)

T

When α = 0, we first update the mean φ¯ ω in Eq. (10.8) based on φ(X ; ω, H ) and then compute the derivative of the relaxed loss in Eq. (10.7) as ∂ L3 = 4η ∂ω − 2λ







φ(xi ; ω, h i ) − φ¯ ω

(xi ,yi ,h i )∈Bt

w T yi

T ∂φ(xi ; ω, h i ) ∂ω

  ∂φ(xi ; ω, h i ) max 0, 1 − w T φ(xi ; ω, h i ) + b yi . ∂ω

(10.18)

By performing the backpropagation algorithm, we can further decrease the relaxed loss and optimize the model parameters. Note that during backpropagation, batch SGD is adopted to update the parameters, and the update stops when it runs through all the training samples once. The optimization algorithm iterates between these three steps until convergence.

Algorithm 2: Learning Algorithm Input: The labeled 2D, 3D activity dataset and learning rate αw,b , αω . Output: Model parameters {ω, w, b}. Initialization: Pretrain the spatiotemporal CNNs using the 2D videos. Learning on the 3D video dataset: repeat 1. Estimate the latent variables H for all samples by fixing model parameters {ω, w, b}. 2. Optimize {w, b} given the CNN model parameters ω and the input sample segments indicated by H : 2.1 Calculate φ(X ; ω, H ) by forwarding the neural network with ω. 2.2 Optimize {w, b} via: ∂L w := w − αw,b ∗ ∂w by Eq. (10.13); ∂L b := b − αw,b ∗ ∂b by Eq. (10.14); 3. Optimize ω given {w, b} and H : 3.1 Calculate κi j , κi and φi for L 2 , or calculate φ¯ ω for L 3 . 3.2 Optimize the parameters ω of the spatiotemporal CNNs: ω := ω − αω ∗ ∂∂ωL by Eq. (10.17) or (10.18). until L in (10.5) or (10.7) converges.

10.4 Learning Algorithm

149

10.4.2 Model Pretraining Parameter pretraining followed by fine-tuning is an effective method of improving performance in deep learning, especially when the training data are scarce. In the literature, there are two popular solutions, i.e., unsupervised pretraining on unlabeled data [28] and supervised pretraining for an auxiliary task [23]. The latter usually requires that the data formation (e.g., image) for parameter pretraining be exactly the same as that (e.g., image) for fine-tuning. In our approach, we suggest an alternative solution for 3D human activity recognition. Although collecting RGB-D videos of human activities is expensive, a large number of 2D activity videos can be easily obtained. Consequently, we first apply the supervised pretraining using a large number of 2D activity videos and then fine-tune the CNN parameters to train the 3D human activity models. In the pretraining step, the CNN parameters are randomly initialized at the beginning. We segment each input 2D video equally into M parts without estimating its latent variables. Because the annotated 2D activity videos are large, we simply employ the soft-max classifier with the CNNs and learn the parameters using the backpropagation method. The 3D and 2D convolutional kernels obtained in pretraining are only for the gray channel. Thus, after pretraining, we duplicate the dimension of the 3D convolutional kernels in the first layer and initialize the parameters of the depth channel by the parameters of the gray channel, which allows us to borrow the features learned from the 2D videos while directly learning the higher level information from the specific 3D activity dataset. For the fully connected layer, we set its parameters as random values. We summarize the overall learning procedure in Algorithm 2.

10.4.3 Inference Given an input video xi , the inference task aims to recognize the category of the activity, which can be formulated as the minimization of Fy (xi , ω, h) with respect to the activity label y and the latent variables h, (y ∗ , h ∗ ) = arg max{Fy (xi , ω, h) = w Ty φ(xi ; ω, h) + b y }. (y,h)

(10.19)

where {w y , b y } denotes the parameters of the max-margin classifier for the activity category y. Note that the possible values for y and h are discrete. Thus, the problem above can be solved by searching across all the labels y(1 ≤ y ≤ C) and calculating the maximum Fy (xi , ω, h) by optimizing h. To find the maximum of Fy (xi , ω, h), we enumerate all possible values of h and calculate the corresponding Fy (xi , ω, h) via

150

10 Human Activity Understanding

forward propagation. Because the forward propagations decided by different h are independent, we can parallelize the computation via GPU to accelerate the inference process.

10.5 Experiments To validate the advantages of our model, experiments are conducted on several challenging public datasets, i.e., the CAD-120 dataset [21], the SBU Kinect Interaction dataset [29], and a larger dataset newly created by us, namely, the Office Activity (OA) dataset. Moreover, we introduce a more comprehensive dataset in our experiments by combining five existing datasets of RGB-D human activity. In addition to demonstrating the superior performance of the proposed model compared to other state-of-the-art methods, we extensively evaluate the main components of our framework.

10.5.1 Datasets and Setting The CAD-120 dataset comprises 120 RGB-D video sequences of humans performing long daily activities in 10 categories and has been widely used to test 3D human activity recognition methods. These activities recorded via the Microsoft Kinect sensor were performed by four different subjects, and each activity was repeated three times by the same actor. Each activity has a long sequence of subactivities, which vary significantly from subject to subject in terms of length, order, and the way the task is executed. The challenges of this dataset also lie in the large variance in object appearance, human pose, and viewpoint. Several sampled frames and depth maps from these 10 categories are exhibited in Fig. 10.7a. The SBU dataset consists of 8 categories of two-person interaction activities, including a total of approximately 300 RGB-D video sequences, i.e., approximately 40 sequences for each interaction category. Although most interactions in this dataset are simple, it is still challenging to model two-person interactions by considering the following difficulties: (i) one person is acting, and the other person is reacting in most cases; (ii) the average frame length of these interactions is short (ranging from 20 to 40 s), and (iii) the depth maps have noise. Figure 10.7b shows several sampled frames and depth maps of these 8 categories. The proposed OA dataset is more comprehensive and challenging than the existing datasets, and it covers regular daily activities that take place in an office. To the best of our knowledge, it is the largest activity dataset of RGB-D videos, consisting of 1180 sequences. The OA database is publicly accessible.1 Three RGB-D sensors (i.e., Microsoft Kinect cameras) are utilized to capture data from different viewpoints, and 1 http://vision.sysu.edu.cn/projects/3d-activity/.

10.5 Experiments

151

Fig. 10.7 Activity examples from the testing databases. Several sampled frames and depth maps are presented. a CAD-120, b SBU, c OA1, and d OA2 show two activities of the same category selected from the three databases

more than 10 actors are involved. The activities are captured in two different offices to increase variability, and each actor performs the same activity twice. Activities performed by two subjects who interact are also included. Specifically, the dataset is divided into two subsets, each of which contains 10 categories of activities: OA1 (complex activities by a single subject) and OA2 (complex interactions by two subjects). Several sampled frames and depth maps from OA1 and OA2 are shown in Fig. 10.7c, d, respectively.

10.5.2 Empirical Analysis Empirical analysis is used to assess the main components of the proposed deep structured model, including the latent structure, relaxed radius-margin bound, model pretraining, and depth/grayscale channel. Several variants of our method are suggested by enabling/disabling certain components. Specifically, we denote the conventional 3D convolutional neural network with the softmax classifier as Softmax + CNN, the 3D CNN with the SVM classifier as SVM + CNN, and the 3D CNN with the relaxed radius-margin bound classifier as R-SVM + CNN. Analogously, we refer to our deep

152 1

Reconfigurable CNN

0.9

CNN

0.8

Test Error rate

Fig. 10.8 Test error rates with/without incorporating the latent structure into the deep model. The solid curve represents the deep model trained by the proposed joint component learning method, and the dashed curve represents the traditional training method (i.e., using standard backpropagation)

10 Human Activity Understanding

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

200

250

300

Train iteration

model as LCNN and then define Softmax + LCNN, SVM + LCNN, and R-SVM + LCNN accordingly. Latent Model Structure. In this experiment, we implement a simplified version of our model by removing the latent structure and comparing it with our full model. The simplified model is actually a spatiotemporal CNN model with both 3D and 2D convolutional layers, and this model uniformly segments the input video into M subactivities. Without the latent variables to be estimated, the standard backpropagation algorithm is employed for model training. We execute this experiment on the CAD120 dataset. Figure 10.8 shows the test error rates with different iterations of the simplified model (i.e., R-SVM + CNN) and the full version (i.e., R-SVM + LCNN). Based on the results, we observe that our full model outperforms the simplified model in both error rate and training efficiency. Furthermore, the structured models with model pretraining, i.e., Softmax + LCNN, SVM + LCNN, R-SVM + LCNN, achieve 14.4%/11.1%/12.4% better performance than the traditional CNN models, i.e., Softmax + CNN, SVM + CNN, R-SVM + CNN, respectively. The results clearly demonstrate the significance of incorporating the latent temporal structure to address the large temporal variations in human activities. Pretraining. To justify the effectiveness of pretraining, we discard the parameters trained on the 2D videos and learn the model directly on the grayscale-depth data. Then, we compare the test error rate of the models with/without pretraining. To analyze the rate of convergence, we adopt the R-SVM + LCNN framework and allow with/without pretraining to share the same learning rate settings for a fair comparison. Using the CAD120 dataset, we plot the test error rates with increasing iteration numbers during training in Fig. 10.9. The model using pretraining converges in 170 iterations, while the model without pretraining requires 300 iterations, and the model with pretraining converges to a much lower test error rate (9%) than that without pretraining (25%). Furthermore, we also compare the performance with/without pretraining using SVM + LCNN and R-SVM + LCNN. We find that pretraining is

10.5 Experiments

153 1

with pretraining without pretraining

0.9

Test Error rate

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

200

250

300

Train iteration

Fig. 10.9 Test error rates with/without pretraining 0.08

0.92

0.95 0.08

0.08

0.83

0.02 0.93

0.05

1.00

0.92

0.08

0.11

0.08

0.02

0.03

0.08

0.05

0.67 0.70

0.03

0.05

0.08

0.53

0.09

0.04

0.48

0.09

0.06

0.58 0.17 0.03

0.05 0.48

0.31

0.47

0.17

0.67

0.02 0.08

0.17 0.05

0.07

0.22

0.12

0.12

0.02

0.03

0.03

0.07

0.55 0.02

0.05

0.03

0.26

0.34

0.02 0.62

0.09

0.12

0.09

0.02 0.12

0.82 0.08

0.09

0.31

0.15

0.77

0.08

0.62

0.12

0.08

0.92

0.08

0.05

0.13 0.08

0.08

0.05

0.07

0.72

0.10 0.08

0.08

0.17

0.08

0.17

0.08

0.17

0.68

0.05

0.12 0.02

0.05

0.97

0.03 0.75

0.25

0.68

0.08 0.85

0.08

0.08

0.08

0.05

0.95 0.92

0.03

0.89

0.92

0.08

0.03

0.05

0.95

0.08

0.02

0.05

0.95

1.00

0.62

0.02 0.02

0.08

0.92

0.32 0.10

0.02 0.27

0.56 0.21

0.04

0.11 0.52

0.14 0.65

Fig. 10.10 Confusion matrices of our proposed deep structured model on the a CAD120, b SBU, c OA1, and d OA2 datasets. It is evident that these confusion matrices all have a strong diagonal with few errors

effective in reducing the test error rate. Actually, the test error rate with pretraining is approximately 15% less than that without pretraining (Fig. 10.9). Relaxed Radius-margin Bound. As described above, the training data for grayscale-depth human activity recognition are scarce. Thus, for the last fully connected layer, we adopt the SVM classifier by incorporating the relaxed radius-margin bound, resulting in the R-SVM + LCNN model. To justify the role of the relaxed radius-margin bound, Table 10.1 compares the accuracy of Softmax + LCNN, SVM + LCNN, and R-SVM + LCNN on all datasets with the same experimental settings.

154

10 Human Activity Understanding

Table 10.1 Average accuracy of all categories on four datasets with different classifiers Softmax + LCNN (%) SVM + LCNN (%) R-SVM + LCNN (%) CAD120 SBU OA1 OA2 Merged_50 Merged_4

82.7 92.4 60.7 47.0 30.3 87.1

89.4 92.8 68.5 53.7 36.4 88.5

90.1 94.0 69.3 54.5 37.3 88.9

Table 10.2 Channel analysis of the three datasets. Average accuracy of all categories is reported Grayscale (%) Depth (%) Grayscale + depth (%) OA1 OA2 Merged_50 Merged_4

60.4 46.3 27.8 81.7

65.2 51.1 33.4 85.5

69.3 54.5 37.3 88.9

The max-margin-based classifiers (SVM and R-SVM) are particularly effective on small-scale datasets (CAD120, SBU, OA1, OA2, Merged_50) (Fig. 10.10). On average, the accuracy of R-SVM + LCNN is on average 6.5% higher than that of Softmax + LCNN andis approximately 1% higher than that of SVM + LCNN. On the Merged_4 dataset, the improvement of R-SVM + LCNN is incrementally evident, as it is 1.8% higher than Softmax + LCNN. These results confirm our motivation to incorporate the radius margin bound into our deep learning framework. Moreover, when the model is learned without pretraining, R-SVM + LCNN gains about 4% and 8% accuracy improvement over Softmax + LCNN and SVM + LCNN, respectively. Channel Analysis. To evaluate the contribution of the grayscale and depth data, we execute the following experiment on the OA datasets: keeping only one data channel as input. Specifically, we first disable the depth channel and input the grayscale data to perform the training/testing and then disable the grayscale channel and employ the depth channel for training/testing. Table 10.2 proves that depth data can improve the performance by large margins, especially on OA1 and Merged_50. The reason is that large appearance variations exist in the grayscale data. In particular, our testing is performed on new subjects, which further increases the appearance variations. In contrast, the depth data are more reliable and have much smaller variations, which is helpful in capturing salient motion information.

References

155

References 1. L. Lin, K. Wang, W. Zuo, M. Wang, J. Luo, L. Zhang, A deep structured model with radiusmargin bound for 3D human activity recognition. Int. J. Comput. Vis. 118(2), 256–273 (2016) 2. L. Xia, C. Chen, J.K. Aggarwal, View invariant human action recognition using histograms of 3d joints, in CVPRW, pp 20–27 (2012) 3. O. Oreifej, Z. Liu, Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences, in CVPR, pp. 716–723 (2013) 4. L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in CVPR, pp. 2834–2841 (2013) 5. J. Wang, Z. Liu, Y. Wu, J. Yuan, Mining actionlet ensemble for action recognition with depth cameras, in: CVPR, pp. 1290–1297 (2012) 6. Y. Wang, G. Mori, Hidden part models for human action recognition: Probabilistic vs. maxmargin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011) 7. J.M. Chaquet, E.J. Carmona, A. Fernandez-Caballero, A survey of video datasets for human action and activity recognition. Comput. Vis. Image Underst. 117(6), 633–659 (2013) 8. Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, L.D. Jackel et al., Handwritten digit recognition with a back-propagation network (Adv. Neural Inf. Process, Syst, 1990) 9. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006) 10. P. Wu, S. Hoi, H. Xia, P. Zhao, D. Wang, C. Miao, Online multimodal deep similarity learning with application to image retrieval, in ACM Mutilmedia, pp. 153–162 (2013) 11. P. Luo, X. Wang, X. Tang, Pedestrian parsing via deep decompositional neural network, in ICCV, pp. 2648–2655 (2013) 12. K. Wang, X. Wang, L. Lin, 3d human activity recognition with reconfigurable convolutional neural networks, in ACM MM (2014) 13. S. Ji, W. Xu, M. Yang, K. Yu, 3d convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013) 14. S. Zhu, D. Mumford, A stochastic grammar of images. Found. Trends Comput. Graph. Vis. 2(4), 259–362 (2007) 15. P.F. Felzenszwalb, R.B. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010) 16. M.R. Amer, S. Todorovic, Sum-product networks for modeling activities with stochastic structure, in: CVPR, pp. 1314–1321 (2012) 17. L. Lin, X. Wang, W. Yang, J.H. Lai, Discriminatively trained and-or graph models for object shape detection. IEEE Trans. Pattern Anal. Mach. Intelli. 37(5), 959–972 (2015) 18. M. Pei, Y. Jia, S. Zhu, Parsing video events with goal inference and intent prediction, in ICCV, pp. 487–494 (2011) 19. A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with convolutional neural networks, in CVPR (2014) 20. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 1097–1105, (2012) 21. H.S. Koppula, R. Gupta, A. Saxena, Learning human activities and object affordances from rgb-d videos. Int. J. Robot. Res. (IJRR) 32(8), 951–970 (2013) 22. F.J. Huang, Y. LeCun, Large-scale learning with svm and convolutional for generic object categorization, in CVPR, pp. 284–291 (2006) 23. R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014) 24. V. Vapnik, Statistical Learning Theory (John Wiley and Sons, New York, 1998) 25. O. Chapelle, V. Vapnik, O. Bousquet, S. Mukherjee, Choosing multiple parameters for support vector machines. Mach. Learn. 46(1–3), 131–159 (2002)

156

10 Human Activity Understanding

26. H. Do, A. Kalousis, Convex formulations of radius-margin based support vector machines, in ICML (2013) 27. H. Do, A. Kalousis, M. Hilario, Feature weighting using margin and radius based error bound optimization in svms, in Machine Learning and Knowledge Discovery in Databases, Lecture Notes in Computer Science, vol 5781, Springer, Berlin Heidelberg, pp 315–329 (2009) 28. P S, K K, S C, Y L, Pedestrian detection with unsupervised multi- stage feature learning, in CVPR (2013) 29. K. Yun, J. Honorio, D. Chattopadhyay, T.L. Berg, D. Samaras, Two-person interaction detection using body-pose features and multiple instance learning, in Computer Vision and Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society Conference on, IEEE (2012)