Multimodal analytics for next-generation big data technologies and applications 9783319975979, 9783319975986

431 21 11MB

English Pages 391 Year 2019

Table of contents :
Intended Audience......Page 6
Acknowledgments......Page 8
Contents......Page 9
List of Contributors......Page 11
Part I: Introduction......Page 14
1.1 Introduction......Page 15
1.2 Sentiment, Affect, and Emotion Analytics for Big Multimodal Data......Page 16
1.3 Unsupervised Learning Strategies for Big Multimodal Data......Page 18
1.4 Supervised Learning Strategies for Big Multimodal Data......Page 19
1.5 Multimodal Big Data Processing and Applications......Page 20
Part II: Sentiment, Affect and Emotion Analysis for Big Multimodal Data......Page 22
2.1 Introduction......Page 23
2.2 Multimodal Affect Recognition......Page 24
2.2.1 Information Fusion Techniques......Page 25
2.3 Recent Results......Page 27
2.3.1 Multimodal Sentiment Analysis......Page 28
2.3.2 Multimodal Emotion Recognition......Page 29
2.4 Proposed Method......Page 35
2.4.1 Textual Features......Page 36
2.4.3 Visual Features......Page 37
2.4.5 Contextual LSTM Architecture......Page 38
2.4.6 Fusion......Page 41
2.5.2 Speaker-Independent Experiment......Page 42
2.5.4 Generalizability of the Models......Page 45
2.5.5 Visualization of the Datasets......Page 47
2.6 Discussion......Page 48
References......Page 49
3.1 Introduction......Page 54
3.2 Related Works......Page 56
3.2.1 Text: Emotion and Sentiment Recognition from Textual Data......Page 57
3.2.2 Image: Emotion Recognition from Visual Image Data......Page 59
3.2.4 Video: Emotion and Sentiment Recognition from Video-Based Data......Page 63
3.2.5 Sentiment Recognition from a Combination of Modalities......Page 65
3.3.1 Data Preprocessing......Page 68
3.3.2 Feature Extraction......Page 73
3.4 Simulation......Page 74
3.4.1 Results and Discussions......Page 75
References......Page 76
4.1 Introduction......Page 81
4.3 Social Emotion Mining......Page 83
4.4.1 Problem Definition......Page 84
4.4.2 Network Architecture......Page 85
4.4.3 Parameter Estimation......Page 88
4.5 Experiments......Page 89
4.5.1 Experiment Design......Page 90
4.5.2 Results and Analysis......Page 91
4.6 Conclusion......Page 94
References......Page 97
Part III: Unsupervised Learning Strategies for Big Multimodal Data......Page 100
5.1 Introduction......Page 101
5.2.1 Bi-clusters in Matrices......Page 104
5.2.2 Bi-clustering Analysis Based on Matrix Decomposition......Page 106
5.3 Co-clustering Analysis in Tensor Data......Page 109
5.4.1 High-Order Singular Vector Decomposition......Page 110
5.4.2 Canonical Polyadic Decomposition......Page 112
5.5 Co-clustering in Tensors......Page 118
5.5.1 Linear Grouping in Factor Matrices......Page 119
5.6 Experiment Results......Page 120
5.6.1 Noise and Overlapping Effects in Co-cluster Identification Using Synthetic Tensors......Page 121
5.6.2 Co-clustering of Gene Expression Tensor in Cohort Study......Page 124
5.7 Conclusion......Page 127
References......Page 128
6.1 Introduction......Page 131
6.2 Bi-cluster Analysis of Data......Page 133
6.2.1 Bi-cluster Patterns......Page 134
6.2.2 Multi-objective Optimization......Page 135
6.2.3 Bi-cluster Validation......Page 136
6.3.1 AIS-Based Bi-clustering......Page 137
6.3.2 GA-Based Bi-clustering......Page 138
6.3.3 Multi-objective Bi-clustering......Page 140
6.4 Multi-objective SPEA-Based Algorithm......Page 141
6.5 Bi-clustering Experiments......Page 147
6.5.1 Gene Expression Dataset......Page 148
6.5.3 Facebook Dataset......Page 149
References......Page 153
7.1 Introduction......Page 157
7.2 Low Rank Representation on Grassmann Manifolds......Page 160
7.2.1 Low Rank Representation......Page 161
7.2.2 Grassmann Manifolds......Page 162
7.2.3 LRR on Grassmann Manifolds......Page 163
7.2.4 LRR on Grassmann Manifolds with Gaussian Noise (GLRR-F)......Page 164
7.2.5 LRR on Grassmann Manifolds with 2/1 Noise (GLRR-21)......Page 165
7.2.7 Examples on Video Datasets......Page 167
7.3.1 An Improved LRR on Grassmann Manifolds......Page 170
7.3.2 LRR on Grassmann Manifolds with Tangent Space Distance......Page 171
7.4.1 Weighted Product Grassmann Manifolds......Page 172
7.4.2 LRR on Weighted Product Grassmann Manifolds......Page 174
7.4.3 Optimization......Page 175
7.4.4 Experimental Results......Page 176
7.5 Dimensionality Reduction for Grassmann Manifolds......Page 178
7.5.2 LPP for Grassmann Manifolds......Page 179
7.5.3 Objective Function......Page 180
7.5.4 GLPP with Normalized Constraint......Page 181
7.5.5 Optimization......Page 182
7.5.6 Experimental Results......Page 183
References......Page 185
Part IV: Supervised Learning Strategies for Big Multimodal Data......Page 187
8.1 Introduction......Page 188
8.2.1 Multi-output Neural Network......Page 189
8.2.2 Special Loss Function for Missing Observations......Page 191
8.2.3 Weight Constraints by Special Norm Regularization......Page 193
8.3 The Optimization Method......Page 195
8.4.1 Simulation Study for Information Loss in Demand......Page 199
8.4.2 Simulation Study in Norm Regularization......Page 201
8.4.3 Empirical Study......Page 204
8.4.4 Testing Error Without Insignificant Tasks......Page 206
References......Page 208
9.1 Introduction......Page 211
9.2 Most Basic Recurrent Neural Networks......Page 213
9.3 Long Short-Term Memory......Page 214
9.4 Gated Recurrent Units......Page 216
9.6 Nonlinear AutoRegressive eXogenous Inputs Networks......Page 218
9.7 Echo State Network......Page 219
9.8 Simple Recurrent Unit......Page 220
9.9 TRNNs......Page 221
9.9.1 Tensorial Recurrent Neural Networks......Page 222
9.9.2 Loss Function......Page 224
9.9.3 Recurrent BP Algorithm......Page 225
9.10 Experimental Results......Page 231
9.10.1 Empirical Study with International Relationship Data......Page 232
9.10.2 Empirical Study with MSCOCO Data......Page 235
9.10.3 Simulation Study......Page 239
9.11 Conclusion......Page 245
References......Page 246
10.1 Introduction......Page 248
10.2.1 Tensors and Our Notations......Page 250
10.2.4 Coupled Matrix Tensor Factorization......Page 251
10.3.1 Joint Analysis of Coupled Data......Page 252
10.3.3 Distributed Factorization......Page 253
10.4 SMF: Scalable Multimodal Factorization......Page 255
10.4.2 Block Processing......Page 257
10.4.3 N Copies of an N-mode Tensor Caching......Page 258
10.4.4 Optimized Solver......Page 259
10.4.5 Scaling Up to K Tensors......Page 263
10.5.1 Observation Scalability......Page 264
10.5.2 Machine Scalability......Page 265
10.5.3 Convergence Speed......Page 266
10.5.5 Optimization......Page 268
References......Page 269
Part V: Multimodal Big Data Processing and Applications......Page 272
11.1 Introduction......Page 273
11.2.1 Multimodal Visual Data Registration......Page 275
11.2.2 Feature Detector and Descriptors......Page 277
11.3.3 Videos......Page 278
11.3.5 360 Cameras......Page 279
11.4 System Overview......Page 280
11.5.1 3D Feature Detector......Page 281
11.5.2 3D Feature Descriptors......Page 282
11.5.3 Description Domains......Page 283
11.5.4 Multi-domain Feature Descriptor......Page 284
11.5.5 Hybrid Feature Matching and Registration......Page 285
11.6 Public Multimodal Database......Page 286
11.7 Experiments......Page 287
11.7.1 3D Feature Detector......Page 290
11.7.2 Feature Matching and Registration......Page 292
References......Page 297
12.1 Introduction......Page 300
12.2.1 Type-2 Fuzzy Sets......Page 302
12.2.2 Fuzzy Logic Classification System Rules Generate......Page 303
12.2.3 The Big Bang-Big Crunch Optimization Algorithm......Page 306
12.3 The Proposed Type-2 Fuzzy Logic Scenes Classification System for Football Video in Future Big Data......Page 307
12.3.1 Video Data and Feature Extraction......Page 308
12.3.2 Type-1 Fuzzy Sets and Classification System......Page 310
12.3.3 Type-2 Fuzzy Sets Creation and T2FLCS......Page 311
12.4.2 Rule Base Optimization and Similarity......Page 313
12.5 Experiments and Results......Page 315
12.6 Conclusion......Page 317
References......Page 318
13.1 Introduction......Page 320
13.2.2 Data Fusion Algorithms for Traffic Congestion Estimation......Page 322
13.2.3 Data Fusion Architecture......Page 324
13.3.1 Homogeneous Traffic Data Fusion......Page 325
13.3.2 Heterogeneous Traffic Data Fusion......Page 327
13.4.1 Measurement and Estimation Errors......Page 328
13.5 Conclusion......Page 331
References......Page 335
14.1 Introduction......Page 337
14.2 Background and Literature Review......Page 339
14.2.1 Image Frameworks for Big Data......Page 341
14.3.1 Inter-Frame Parallelism......Page 344
14.3.2 Intra-Frame Parallelism......Page 345
14.4 Parallel Processing of Images Using GPU......Page 347
14.4.2 Implementation of Gaussian Mixture Model on GPU......Page 349
14.4.3 Implementation of Morphological Image Operations on GPU......Page 350
14.4.4 Implementation of Connected Component Labeling on GPU......Page 352
14.4.5 Experimental Results......Page 355
14.5 Conclusion......Page 358
References......Page 359
Chapter 15: Multimodal Approaches in Analysing and Interpreting Big Social Media Data......Page 361
15.1 Introduction......Page 362
15.2 Where Do We Start?......Page 365
15.3 Multimodal Social Media Data and Visual Analytics......Page 367
15.3.2 Emoji Analytics......Page 368
15.3.3 Sentiment Analysis......Page 374
15.4.1 Social Information Landscapes......Page 377
15.4.2 Geo-Mapping Topical Data......Page 379
15.4.3 GPU Accelerated Real-Time Data Processing......Page 381
15.4.4 Navigating and Examining Social Information in VR......Page 386
15.5 Conclusion......Page 388
References......Page 389

Recommend Papers

Big Data and Analytics: The key concepts and practical applications of big data analytics

Unveiling insights, unleashing potential: Navigating the depths of big data and analytics for a data-driven tomorrow Ke

120 63 4MB Read more

Technologies and Applications for Big Data Value 9783030783075, 3030783073

This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications fo

121 24 54MB Read more

Analytics and Big Data 9781625277749

677 69 5MB Read more

Big Data and Social Media Analytics: Trending Applications 3030670430, 9783030670436

This edited book provides techniques which address various aspects of big data collection and analysis from social media

529 89 20MB Read more

Taming Big Data Analytics

621 72 7MB Read more

Big Data Concepts, Technologies, and Applications 9781032162751, 9781032579184, 9781003441595

With the advent of such advanced technologies as cloud computing, the Internet of Things, the Medical Internet of Things

102 33 45MB Read more

Big Data Management: Data Governance Principles for Big Data Analytics 9783110664065, 9783110662917

Data analytics is core to business and decision making. The rapid increase in data volume, velocity and variety offers b

180 66 2MB Read more

Big Data Analytics 9783031556388, 9783031556395

This book introduces readers to big data analytics. It covers the background to and the concepts of big data, big data a

121 119 8MB Read more

Big Data Analytics: by Ques10

Table of Contents Introduction to Big Data AnalyticsHadoopNoSQL MapReduceTechniques in Big Data AnalyticsBig Data Analyt

180 112 1MB Read more

Smart Grids and Big Data Analytics for Smart Cities 9783030521554

539 24 54MB Read more

Multimodal analytics for next-generation big data technologies and applications
9783319975979, 9783319975986

Author / Uploaded
Seng K.P (ed.)

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

Kah Phooi Seng · Li-minn Ang Alan Wee-Chung Liew · Junbin Gao Editors

Multimodal Analytics for Next-Generation Big Data Technologies and Applications

Multimodal Analytics for Next-Generation Big Data Technologies and Applications

Kah Phooi Seng • Li-minn Ang • Alan Wee-Chung Liew • Junbin Gao Editors

Multimodal Analytics for Next-Generation Big Data Technologies and Applications

Editors Kah Phooi Seng School of Engineering and Information Technology University of New South Wales Canberra, ACT, Australia Alan Wee-Chung Liew School of Information and Communication Technology Grifﬁth University Gold Coast, QLD, Australia

Li-minn Ang School of Information and Communication Technology Grifﬁth University Gold Coast, QLD, Australia Junbin Gao The University of Sydney Business School University of Sydney Sydney, NSW, Australia

ISBN 978-3-319-97597-9 ISBN 978-3-319-97598-6 https://doi.org/10.1007/978-3-319-97598-6

(eBook)

© Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

To Grace Ang Yi-en, my blessed daughter, for all the joy you bring. —Kah Phooi Seng To my parents and loved ones, for your unceasing support and love. —Li-minn Ang To Alana and Nicholas, my two lovely children, who bring me joy and headache. —Alan Liew Wee Chung To Zhi Chen, my lovely wife, and Henry Yu Gao, my amazing son, for your constant love and support. —Junbin Gao

Preface

The digital age brings modern data acquisition methods which allow the gathering of different types and modes of data in variety and volume (termed in this book as multimodal Big Data) to be used as multiple input sources into increasingly sophisticated computing systems deploying intelligent techniques and machine-learning abilities to ﬁnd hidden patterns embedded in combined data. Some examples of different types of data modalities include text, speech, image, video, and sensorbased data. The data types could originate from various sources ranging from social media networks to wireless sensor systems and could be in the form of structured or unstructured data. Big Data techniques are targeted toward large system-level problems that cannot be solved by conventional methods and technologies. The advantage of utilizing multimodal Big Data is that it facilitates a comprehensive view of information and knowledge as it allows data to be integrated, analyzed, modeled, and visualized from multiple facets and viewpoints. While the use of multimodal data gives increased information-rich content for information and knowledge processing, it leads to a number of additional challenges in terms of scalability, decision-making, data fusion, distributed architectures, and predictive analytics. Addressing these challenges requires new approaches for data collection, transmission, storage, and information processing from the multiple data sources. The aim of this book is to provide a comprehensive guide to multimodal Big Data technologies and analytics and introduce the reader to the current state of multimodal Big Data information processing. We hope that the reader will share our enthusiasm in presenting this volume and will ﬁnd it useful.

Intended Audience The target audience for this book is very broad. It includes academic researchers, scientists, lecturers, and advanced students and postgraduate students in various disciplines like engineering, information technology, and computer science. It could vii

viii

Preface

be an essential support for fellowship programs in Big Data research. This book is also intended for consultants, practitioners, and professionals who are experts in IT, engineering, and business intelligence to build decision support systems. The following groups will also beneﬁt from the content of the book: scientists and researchers, academic and corporate libraries, lecturers and tutors, postgraduates, practitioners and professionals, undergraduates, etc. ACT, Australia Gold Coast, Australia Gold Coast, Australia Sydney, Australia

Kah Phooi Seng Li-minn Ang Alan Wee-Chung Liew Junbin Gao

Acknowledgments

We express our deepest appreciation to all the people who have helped us in the completion of this book. We thank all the reviewers of the book for their tremendous service by critically reviewing the chapters and offering useful suggestions, particularly Ch’ng Sue Inn, Chew Li Wern, and Md Anisur Rahman. We gratefully acknowledge our Springer editor Ronan Nugent and the publication team, for their diligent efforts and support toward the publication of this book. Kah Phooi Seng Li-minn Ang Alan Wee-Chung Liew Junbin Gao

ix

Contents

Part I 1

Multimodal Information Processing and Big Data Analytics in a Digital World . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kah Phooi Seng, Li-minn Ang, Alan Wee-Chung Liew, and Junbin Gao

Part II 2

Introduction

Sentiment, Affect and Emotion Analysis for Big Multimodal Data

Speaker-Independent Multimodal Sentiment Analysis for Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Cambria, Soujanya Poria, and Amir Hussain

3

Multimodal Big Data Affective Analytics . . . . . . . . . . . . . . . . . . . . . Nusrat Jahan Shoumy, Li-minn Ang, and D. M. Motiur Rahaman

4

Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanghui Rao, Haoran Xie, Fu Lee Wang, Leonard K. M. Poon, and Endong Zhu

Part III 5

3

13 45

73

Unsupervised Learning Strategies for Big Multimodal Data

Multimodal Co-clustering Analysis of Big Data Based on Matrix and Tensor Decomposition . . . . . . . . . . . . . . . . . . . . . . . Hongya Zhao, Zhenghong Wei, and Hong Yan

95

6

Bi-clustering by Multi-objective Evolutionary Algorithm for Multimodal Analytics and Big Data . . . . . . . . . . . . . . . . . . . . . . 125 Maryam Golchin and Alan Wee-Chung Liew

7

Unsupervised Learning on Grassmann Manifolds for Big Data . . . . 151 Boyue Wang and Junbin Gao xi

xii

Contents

Part IV

Supervised Learning Strategies for Big Multimodal Data

8

Multi-product Newsvendor Model in Multi-task Deep Neural Network with Norm Regularization for Big Data . . . . . . . . . . . . . . 183 Yanfei Zhang

9

Recurrent Neural Networks for Multimodal Time Series Big Data Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Mingyuan Bai and Boyan Zhang

10

Scalable Multimodal Factorization for Learning from Big Data . . . 245 Quan Do and Wei Liu

Part V

Multimodal Big Data Processing and Applications

11

Big Multimodal Visual Data Registration for Digital Media Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 Hansung Kim and Adrian Hilton

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 Song Wei and Hani Hagras

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction . . . . 319 Taiwo Adetiloye and Anjali Awasthi

14

Parallel and Distributed Computing for Processing Big Image and Video Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337 Praveen Kumar, Apeksha Bodade, Harshada Kumbhare, Ruchita Ashtankar, Swapnil Arsh, and Vatsal Gosar

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 Eugene Ch’ng, Mengdi Li, Ziyang Chen, Jingbo Lang, and Simon See

List of Contributors

Taiwo Adetiloye Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada Li-minn Ang School of Information and Communication Technology, Grifﬁth University, Gold Coast, QLD, Australia Ruchita Ashtankar Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India Anjali Awasthi Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada Mingyuan Bai University of Sydney Business School, The University of Sydney, Darlington, NSW, Australia Apeksha Bodade Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India Erik Cambria School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore Eugene Ch’ng NVIDIA Joint-Lab on Mixed Reality, University of Nottingham Ningbo China, Ningbo, China Ziyang Chen NVIDIA Joint-Lab on Mixed Reality, University of Nottingham Ningbo China, Ningbo, China Quan Do Advanced Analytics Institute, University of Technology Sydney, Chippendale, NSW, Australia Junbin Gao The University of Sydney Business School, University of Sydney, Sydney, NSW, Australia Maryam Golchin School of Information and Communication Technology, Grifﬁth University, Gold Coast, QLD, Australia xiii

xiv

List of Contributors

Vatsal Gosar Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India Hani Hagras The Computational Intelligence Centre, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK Adrian Hilton Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK Amir Hussain School of Natural Sciences, University of Stirling, Stirling, UK Hansung Kim Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK Praveen Kumar Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India Harshada Kumbhare Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India Jingbo Lang NVIDIA Joint-Lab on Mixed Reality, University of Nottingham Ningbo China, Ningbo, China Alan Wee-Chung Liew School of Information and Communication Technology, Grifﬁth University, Gold Coast, QLD, Australia Mengdi Li International Doctoral Innovation Centre, University of Nottingham Ningbo China, Ningbo, China Wei Liu Advanced Analytics Institute, University of Technology Sydney, Chippendale, NSW, Australia Leonard K. M. Poon Department of Mathematics and Information Technology, The Education University of Hong Kong, Tai Po, Hong Kong Soujanya Poria School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore D. M. Motiur Rahaman School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, NSW, Australia Yanghui Rao School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China Simon See NVIDIA AI Technology Centre, Singapore, Singapore Kah Phooi Seng School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia Nusrat Jahan Shoumy School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, NSW, Australia

List of Contributors

xv

Boyue Wang Municipal Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Chaoyang, Beijing, China Fu Lee Wang Caritas Institute of Higher Education, Tseung Kwan O, Hong Kong Song Wei The Computational Intelligence Centre, School of Computer Science and Electronic Engineering, University of Essex, Colchester, UK Zhenghong Wei Department of Statistics, Shenzhen University, Shenzhen, Guangdong, China Haoran Xie Department of Mathematics and Information Technology, The Education University of Hong Kong, Tai Po, Hong Kong Hong Yan Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong Boyan Zhang School of Information Technologies, The University of Sydney, Darlington, NSW, Australia Yanfei Zhang University of Sydney Business School, The University of Sydney, Darlington, NSW, Australia Hongya Zhao Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong Endong Zhu School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China

Part I

Introduction

Chapter 1

Multimodal Information Processing and Big Data Analytics in a Digital World Kah Phooi Seng, Li-minn Ang, Alan Wee-Chung Liew, and Junbin Gao

Abstract This chapter presents a review of important issues for multimodal information processing and Big data analytics in a digital world, emphasizing the issues brought by the concept of universal machine learning intelligence and data-fromeverywhere, and driven by the applications for the future and next-generation technologies. Furthermore, the chapter explains the organization of the book, describing which issues and related technologies are addressed in which chapters of the book.

1.1

Introduction

The ﬁrst generation of Big data systems was mostly concerned with the processing of text-based data for applications such as website analytics, social media analytics, and credit card fraud. It is anticipated that the next generation of Big data systems will focus on the processing of myriad forms of multimodal-based data such as images, videos, speech data, gestures, facial expressions, location-based data, and gene-based data. In this book, we use the term “Big Multimodal Data” to convey this synergy of Big data systems from heterogeneous sources with universal machine learning approaches for multimodal information processing with the goal to ﬁnd hidden patterns and derive useful outcomes for society. While the use of multimodalbased data gives increased information-rich content for information mining and knowledge processing, they lead to a number of additional challenges in terms of

K. P. Seng (*) School of Engineering and Information Technology, University of New South Wales, Canberra, ACT, Australia e-mail: [email protected] L.-m. Ang · A. W.-C. Liew School of Information and Communication Technology, Grifﬁth University, Gold Coast, QLD, Australia J. Gao The University of Sydney Business School, University of Sydney, Sydney, NSW, Australia © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_1

3

4

K. P. Seng et al.

scalability, decision-making, data fusion, distributed architectures, and predictive analytics. Addressing these challenges requires new information processing approaches to be developed for its collection, transmission, storage, and processing (analytics). This book addresses the challenges for Big multimodal data analytics where several aspects of universal machine intelligence, ubiquitous computing, datafrom-everywhere, and powerful computational elements are emerging. The objective of this book is to collect together quality research works on multimodal analytics for Big data to serve as a comprehensive source of reference and to play an inﬂuential role in setting the scopes and directions of this emerging ﬁeld of research. The prospective audience would be researchers, professionals, and students in engineering and computer science that engage with speech/audio processing, image/visual processing, multimodal information processing, data science, artiﬁcial intelligence, etc. We hope that the book will serve as a source of reference for technologies and applications for single modality to multimodality data analytics in Big data environments. We give a short introduction to these issues and explain how the book is organized. Following this introduction, the rest of the book is organized into four parts as follows. Part I covers important emerging technologies for Big multimodal data in the areas of sentiment, affect, and emotion analytics. The research area of affective computing is highly representative and illustrative of the role that Big multimodal data analytics will play in the digital world. The objective of covering this area early in the book is to give readers an overall appreciation of the strengths and issues involved in multimodal Big data analytics, before more advanced techniques are given in the later sections of the book. Parts II and III will then focus on giving more in-depth discussions on advanced machine learning strategies and techniques for Big multimodal data. Part II covers some challenging issues to be addressed for unsupervised learning strategies for Big multimodal data, and Part III covers the issues for supervised learning strategies for Big multimodal data. The ﬁnal part of the book, Part IV, provides a selection of emerging applications and information processing techniques for Big multimodal data.

1.2

Sentiment, Affect, and Emotion Analytics for Big Multimodal Data

Emotions are strongly characteristic of human beings. In the ﬁeld of psychology, the term affect is used to describe the experience of feeling or emotion. The area of affective computing can be said to refer to the ﬁeld of information processing and computation for emotions, affect, and sentiment. This area of affective computing has a growing relevance and inﬂuence in the digital world. In a global market, the success of a business in the digital world is heavily dependent on building strong customer relationships and increasing customer satisfaction levels. The

1 Multimodal Information Processing and Big Data Analytics in a Digital World

5

determination of customer and product sentiments has a major role to play in this area of affective computing. For example, by exploiting the rapid advance of information and communication technologies (ICT) such as the application of video analytics in contact centers, the customer’s emotions over the full service cycle can be evaluated to improve the satisfaction and retention levels. The use of video analytics is a good illustration of multimodal data as a video stream contains two data sources (audio and visual components) that collectively contain the overall emotion or affect data to be extracted for the analytics. An audiovisual multimodal emotion recognition system could be developed for the classiﬁcation and identiﬁcation of the universal six human emotions (happy, angry, sad, disgust, surprise, and fear) from video data. The detected customer emotions can then be mapped to give the corresponding customer satisfaction level scores. A signiﬁcant challenge to be overcome in multimodal information processing is to determine the optimal fusion techniques to combine the collective information from the multiple data sources. In our example for video analytics at customer contact centers, the two data sources to be combined or fused are the audio emotion component (e.g., speech data) and the visual emotion component (e.g., facial expressions). The challenges are further exacerbated when all the myriad data sources and interactions (e.g., tweets, emails, phone conversations, video chats, blogs, and other online social media) from the customer have to be considered in the analytics to determine an overall customer emotional proﬁle. The data-fromeverywhere for thousands of customers to be proﬁled leads to a Big data challenge for multimodal information processing and analytics. Part I of this book expounds and further discusses multimodal Big data information processing for sentiment, affect, and emotion analytics. Chapter 2 presents a multimodal sentiment analysis framework that outperforms the state of the art. Sentiment analysis (or also known as opinion mining) is a branch of research in affective computing to infer the view, position, or attitude of a user and the goal is to classify the data into either positive, negative, or neutral sentiments. In this chapter, the authors proposed using a deep convolutional neural network to extract features from visual, audio, and text modalities. This chapter also contains a discussion for an important component to be considered for a multimodal information processing system, which is the type of information fusion technique (feature-level or early fusion, decision-level or late fusion, hybrid multimodal fusion, model-level fusion, rule-based fusion, classiﬁcation-based fusion, and estimation-based fusion) to be applied to combine the multimodal data. The authors in Chap. 3 present a review of existing work for multimodal Big data affective analytics focusing on both sentiment and emotion analytics. The review is presented for three forms of multimodal data (text, audio, and visual modalities). The ﬁnal part of the chapter presents a multimodal sentiment recognition approach for video, speech, and text data that can be implemented on Big multimodal databases and the approach is validated using the YouTube dataset. Chapter 4 proposes another approach for sentiment analysis using a Hybrid Convolutional Neural Network (HCNN) to predict the sentiment strengths from textual datasets based on semantic and syntactic information. The authors use a hybrid approach to combine the

6

K. P. Seng et al.

character-level, word-level, and part-of-speech (POS) features to predict the overall intensity over sentiment labels, and validated their work using a real-world corpus from six different sources to show the effectiveness of the proposed model.

1.3

Unsupervised Learning Strategies for Big Multimodal Data

In general, there are two main machine learning strategies and techniques (unsupervised learning and supervised learning) that can be employed for Big multimodal data (other techniques like semi-supervised learning also exist). The decision and usage of the two different learning strategies are highly dependent on whether the true class labels for the multimodal data are available to perform the analytics. Unsupervised machine learning strategies and techniques do not require the class labels to be available for the analytics. An important problem that falls into the category of unsupervised machine learning is the problem of data clustering, where the objective is to partition a dataset into non-overlapping groups or clusters of similar elements such that each data instance is assigned to only one cluster. Currently, there are many approaches that have been proposed to perform the clustering task (e.g., partitioned-based clustering, hierarchical clustering, densitybased clustering, spectral clustering, grid-based clustering, and correlation-based clustering). The well-known k-means algorithm is an example for the partitionedbased clustering approach. Part II of this book discusses advanced learning strategies and techniques that are suitable for handling the multiple modalities for Big data from varied sources. We use the term “data-from-everywhere” to convey this explosion of data from multimodal data sources. This section gives particular focus on addressing two major challenges for unsupervised machine learning strategies to deal particularly with the data-from-everywhere for Big multimodal data. The two issues or challenges are termed as the co-clustering (or also known as bi-clustering) challenge and the dimensionality reduction challenge. Compared with conventional clustering algorithms, the co-clustering problem aims to allow for the simultaneous clustering of the rows and columns of a matrix. The authors in Chap. 5 discuss the challenging issues for co-clustering multimodal data based on matrix and tensor decomposition. This chapter gives a review and background information for the co-clustering problem, and then presents a systematic co-clustering framework based on canonical decomposition (CP). Their work is validated on a time series dataset for genomic expression from multiple sclerosis patients after injection treatment. Chapter 6 proposes a different approach toward solving the bi-clustering problem by using nature-inspired strategies. This chapter introduces advanced approaches for bi-clustering using multi-objective optimization and search strategy techniques (when several conﬂicting objectives are being optimized) and proposes an evolutionary algorithm termed PBD-SPEA for

1 Multimodal Information Processing and Big Data Analytics in a Digital World

7

the detection of highly enriched bi-clusters. The authors validated their approach using different types of multimodal data (gene expression data, image data, and data from social media [Facebook]). The second challenge of dimensionality reduction for unsupervised learning for Big multimodal data involves building models to map the data from high-dimensional datasets to lower-dimensional datasets while retaining the most important information for the ﬁnal clustering. Chapter 7 presents a discussion of this problem using mappings on Grassmann manifolds for Big data clustering. The chapter also gives background information and discusses three major strategies in dealing with learning tasks on manifolds (intrinsic strategy, extrinsic strategy, and embedding strategy). The authors then propose an approach using the embedding strategy and a Low Rank Representation (LRR) model to be used for clustering a set of data points based on the new representation. Their work is validated on data from videos and image sets occurring in computer vision applications.

1.4

Supervised Learning Strategies for Big Multimodal Data

Supervised machine learning strategies and techniques do require the class labels to be available for the analytics and the classiﬁcation tasks. In the literature, many techniques have been proposed for supervised machine learning tasks such as decision trees, logistic regression, statistical and Bayesian approaches, ensemble learning approaches, nature-inspired strategies and evolutionary-based approaches, support vector machines (SVMs), and techniques employing artiﬁcial neural networks. A current trend in supervised machine learning strategies is toward performing the classiﬁcation tasks using deep-learning-based neural networks. Data representation is an important factor to be considered for machine learning tasks in general. This issue is particularly critical for multimodal data that comes from multiple and varied data sources because each source may have disparate or different representations. Conventional techniques use what is often termed a feature-based data representation, where a handcrafted approach is used to ﬁnd the best data representation among all the possible feature combinations. In contrast, deep learning techniques use a learning-based representation where the network is able to learn the high-level representations from low-level features using a set of nonlinear transformations without needing to handcraft the features. Part III of this book provides discussions for important issues and challenges to be considered for advanced supervised learning strategies for Big multimodal data. Chapter 8 begins by presenting an approach for multitask learning using deep neural networks. Compared with conventional machine learning algorithms, which only focus on performing a single classiﬁcation task at a time, the multitask learning problem aims to allow for learning from multiple tasks that share the same goal. The authors propose an approach for the multi-product newsvendor model where

8

K. P. Seng et al.

multiple k products are considered in a single training process. In the conventional approach, to deal with multiple products simultaneously, the optimization problem for classiﬁcation has to be repeated multiple times leading to inefﬁciencies. Chapter 9 considers the challenges using Recurrent Neural Networks (RNNs) for time series forecasting where both the spatial and temporal information contained in multimodal time series data have to be used for accurate forecasting. This chapter discusses the various models for RNNs (Elman RNN, Jordan RNN, Long ShortTerm Memory (LSTM), Gated Recurrent Units (GRUs), and others). The authors then propose an approach termed the Tensorial Recurrent Neural Networks (TRNN) and show that the TRNN outperforms all the other models in general for an image captioning application. The ﬁnal chapter in this part (Chap. 10) addresses the issue of Scalable Multimodal Factorization (SMF) for learning from Big multimodal data. Their proposed SMF model improves the performance of MapReduce-based factorization algorithms, particularly when the observed data becomes bigger. Their approach is validated on real-world datasets and is shown to be effective in terms of convergence speed and accuracy.

1.5

Multimodal Big Data Processing and Applications

Part IV of this book focuses on various applications and information processing techniques for Big multimodal data systems. Chapter 11 addresses an important issue for Big multimodal data and presents a framework for multimodal digital media data management that allows the various input modalities to be registered into a uniﬁed three-dimensional space. The authors propose an approach based on a hybrid Random Sample Consensus (RANSAC) technique for multimodal visual data registration. Their work is validated on datasets from a multimodal database acquired from various modalities including active (LIDAR) and passive (RGB point clouds, 360 spherical images, and photos) visual sensors. The authors in Chap. 12 present a hybrid fuzzy system to classify football scenes from Big video data. The authors introduce a novel system based on Hybrid Interval Type-2 Fuzzy Logic Classiﬁcation Systems (IT2FLCS) whose parameters are optimized using the Big Bang-Big Crunch (BB-BC) algorithm. The objective of the system is to classify the football scene into three categories (center scene, player close-up scene, and people scene). Their work is validated using data from over 20 videos selected from various European football leagues. The authors in Chap. 13 address the problem of trafﬁc congestion prediction in transportation systems using fusion models from machine learning algorithms (neural networks, random forests, and deep belief networks), extended Kalman ﬁlters, and trafﬁc tweet information from Twitter data sources. Their work is validated on real-world trafﬁc data sources from Canada. The remaining two chapters of the book provide some ﬁnal discussions on various other facets and aspects to be considered for processing Big multimodal data. Chapter 14 discusses parallel and distributed computing architectures for

1 Multimodal Information Processing and Big Data Analytics in a Digital World

9

processing Big image and video data. The authors describe a Parallel Image Processing Library (ParIPL) and framework that is aimed to signiﬁcantly simplify image processing for Apache Hadoop using the Hadoop Image Processing Interface (HIPI) and Graphical Processing Units (GPUs). The ﬁnal chapter (Chap. 15) discusses multimodal information processing approaches in analyzing and interpreting Big social media data. The authors describe their experiences in acquiring, viewing, and interacting (including using VR headsets and gesture-based navigation) with different data modalities collected from social media networks.

Part II

Sentiment, Affect and Emotion Analysis for Big Multimodal Data

Chapter 2

Speaker-Independent Multimodal Sentiment Analysis for Big Data Erik Cambria, Soujanya Poria, and Amir Hussain

Abstract In this chapter, we propose a contextual multimodal sentiment analysis framework which outperforms the state of the art. This framework has been evaluated against speaker-dependent and speaker-independent problems. We also address the generalizability issue of the proposed method. This chapter also contains a discussion for an important component to be considered for a multimodal information processing system, which is the type of information fusion technique to be applied to combine the multimodal data.

2.1

Introduction

In recent years, sentiment analysis [1] has become increasingly popular for processing social media data on online communities, blogs, Wikis, microblogging platforms, and other online collaborative media. Sentiment analysis is a branch of affective computing research [2] that aims to classify text (but sometimes also audio and video [3]) into either positive or negative (but sometimes also neutral [4]). Sentiment analysis systems can be broadly categorized into knowledge-based [5], statistics-based [6], and hybrid [7]. While most works approach it as a simple categorization problem, sentiment analysis is actually a suitcase research problem [8] that requires tackling many NLP tasks, including aspect extraction [9], named entity recognition [10], word polarity disambiguation [11], temporal tagging [12], personality recognition [13], and sarcasm detection [14]. Sentiment analysis has raised growing interest both within the scientiﬁc community and in the business world, due to the remarkable beneﬁts to be had from ﬁnancial forecasting [15] and

E. Cambria (*) · S. Poria School of Computer Science and Engineering, NTU, Singapore, Singapore e-mail: [email protected]; [email protected] A. Hussain School of Natural Sciences, University of Stirling, Stirling, UK e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_2

13

14

E. Cambria et al.

political forecasting [16], e-health [17] and e-tourism [18], community detection [19] and user proﬁling [20], and more. With the advancement of communication technology, abundance of smartphones and the rapid rise of social media, large amount of data is uploaded by the users as videos rather than text. For example, consumers tend to record their reviews and opinions on products using a web camera and upload them on social media platforms such as YouTube or Facebook to inform subscribers of their views. These videos often contain comparisons of products from competing brands, the pros and cons of product speciﬁcations, etc., which can aid prospective buyers in making an informed decision. The primary advantage of analyzing videos over textual analysis for detecting emotions and sentiment from opinions is the surplus of behavioral cues. Video provides multimodal data in terms of vocal and visual modalities. The vocal modulations and facial expressions in the visual data, along with textual data, provide important cues to better identify true affective states of the opinion holder. Thus, a combination of text and video data helps create a better emotion and sentiment analysis model. Recently, a number of approaches to multimodal sentiment analysis producing interesting results have been proposed [21–23]. However, there are major issues that remain unaddressed in this ﬁeld, such as the role of speaker-dependent and speakerindependent models, the impact of each modality across datasets, and generalization ability of a multimodal sentiment classiﬁer. Not tackling these issues has presented difﬁculties in effective comparison of different multimodal sentiment analysis methods. In this chapter, we address some of these issues and, in particular, propose a novel framework that outperforms the state of the art on benchmark datasets by more than 10%. We use a deep convolutional neural network to extract features from visual and text modalities. The chapter is organized as follows: Section 2.2 provides a literature review on multimodal sentiment analysis; Sect. 2.3 describes the proposed method. Experimental results and discussion are presented in Sects. 2.4 and 2.5, respectively. Finally, Sect. 2.6 concludes the chapter.

2.2

Multimodal Affect Recognition

Multimodal affect analysis has already created a lot of buzz in the ﬁeld of affective computing. This ﬁeld has now become equally important and popular among the computer scientists [24]. In the previous section, we discussed state-of-the-art methods that used either the Visual, Audio or Text modalities for affect recognition. In this section, we discuss the approaches to solve the multimodal affect recognition problem.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

2.2.1

15

Information Fusion Techniques

Multimodal affect recognition can be seen as the fusion of information coming from different modalities. Multimodal fusion is the process of combining data collected from various modalities for analysis tasks. It has gained increasing attention from researchers in diverse ﬁelds, owing to its potential for innumerable applications, including but not limited to sentiment analysis, emotion recognition, semantic concept detection, event detection, human tracking, image segmentation, and video classiﬁcation. The fusion of multimodal data can provide surplus information with an increase in accuracy [25] of the overall result or decision making. As the data collected from various modalities comes in various forms, it is also necessary to consider the period of multimodal fusion in different levels. To date, there are mainly two levels or types of fusion studied by researchers: feature-level fusion or early fusion, and decision-level fusion or late fusion. These have also been employed by some researchers as part of a hybrid fusion approach. Furthermore, there is “modellevel fusion,” a type of multimodal fusion designed by researchers as per their application requirements. Feature-Level or Early Fusion [26–30] Fuses the features extracted from various modalities such as visual features, text features, audio features, etc., as a general feature vector and the combined features are sent for analysis. The advantage of feature-level fusion is that the correlation between various multimodal features at an early stage can potentially provide better task accomplishment. The disadvantage of this fusion process is time synchronization, as the features obtained belong to diverse modalities and can differ widely in many aspects, so before the fusion process takes place, the features must be brought into the same format. Decision-Level or Late Fusion [31–35] In this fusion process, the features of each modality are examined and classiﬁed independently and the results are fused as a decision vector to obtain the ﬁnal decision. The advantage of decision-level fusion is that the fusion of decisions obtained from various modalities becomes easy compared to feature-level fusion, since the decisions resulting from multiple modalities usually have the same form of data. Another advantage of this fusion process is that every modality can utilize its best suitable classiﬁer or model to learn its features. As different classiﬁers are used for the analysis task, the learning process of all these classiﬁers at the decision-level fusion stage becomes tedious and time consuming. Our survey of fusion methods used to date has shown that, more recently, researchers have tended to prefer decision-level fusion over feature-level fusion. Among the notable decision-level fusion methods, Kalman ﬁlter has been proposed in [34] as a method to fuse classiﬁers. They considered video as a time dynamics or series and the prediction scores (between 0 and 1) of the base classiﬁers were fused using Kalman ﬁlter. On the other hand, Dobrivsek et al. [35] employed weight sum and weighted product rule for fusion. On the eNTERFACE dataset, weighted product (accuracy: 77.20%) rule gave better result than weighted sum approach (accuracy: 75.90%).

16

E. Cambria et al.

Hybrid Multimodal Fusion [23, 36, 37] This type of fusion is the combination of both feature-level and decision-level fusion methods. In an attempt to exploit the advantages of both feature- and decision-level fusion strategies and overcome the disadvantages of each, researchers opt for hybrid fusion. Model-Level Fusion [38–42] This is a technique that uses the correlation between data observed under different modalities, with a relaxed fusion of the data. Researchers built models satisfying their research needs and the problem space. Song et al. [43] used a tripled Hidden Markov Model (HMM) to model the correlation properties of three component HMMs based on audiovisual streams. Zeng et al. [44] proposed a Multistream Fused Hidden Markov Model (MFHMM) for audiovisual affect recognition. The MFHMM builds an optimal connection between various streams based on maximum entropy and maximum mutual information principle. Caridakis et al. [45] and Petridis et al. [46] proposed neural networks to combine audio and visual modalities for emotion recognition. Sebe et al. [47] proposed the Bayesian network topology to recognize emotions from audiovisual modalities, by combining the two modalities in a probabilistic manner. According to Atrey et al. [48], fusion can be classiﬁed into three categories: rulebased, classiﬁcation-based, and estimation-based methods. The categorization is based on the basic nature of the methods and the problem space, as outlined next. Rule-Based Fusion Methods [49, 50] As the name suggests, multimodal information is fused by statistical rule-based methods such as linear weighted fusion, majority voting, and custom-deﬁned rules. The linear weighted fusion method uses sum or product operators to fuse features obtained from different modalities or decision obtained from a classiﬁer. Before the fusion of multimodal information takes place, normalized weights are assigned to every modality under consideration. Thus, the linear weighted fusion method is computationally less expensive compared to other methods; however, the weights need to be normalized appropriately for optimal execution. The drawback is that the method is sensitive to outliers. Majority voting fusion is based on the decision obtained by a majority of the classiﬁers. Custom-deﬁned rules are application speciﬁc, in that the rules are created depending on the information collected from various modalities and the ﬁnal outcome expected in order to achieve optimized decisions. Classiﬁcation-Based Fusion Methods [51, 52] In this method, a range of classiﬁcation algorithms are used to classify the multimodal information into predeﬁned classes. Various methods used under this category include: Support Vector Machines (SVMs), Bayesian inference, Dempster–Shafer theory, dynamic Bayesian networks, neural networks, and maximum entropy models. SVM is probably the most widely used supervised learning method for data classiﬁcation tasks. In this method, input data vectors are classiﬁed into predeﬁned learned classes, thus solving the pattern classiﬁcation problem in view of multimodal fusion. The method is usually applicable for decision-level and hybrid fusion. The Bayesian inference fusion method fuses multimodal information based on rules of probability theory. In this method, the features from various modalities or the decisions obtained from

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

17

various classiﬁers are combined and an implication of the joint probability is derived. The Dempster–Shafer evidence theory generalizes Bayesian theory of subjective probability. This theory allows union of classes and also represents both uncertainty and imprecision, through the deﬁnition of belief and plausibility functions. The Dempster–Shafer theory is a statistical method and is concerned with fusing independent sets of probability assignments to form a single class, thus relaxing the disadvantage of the Bayesian inference method. The Dynamic Bayesian Network (DBN) is an extension of the Bayesian inference method to a network of graphs, where the nodes represent different modalities and the edges denote their probabilistic dependencies. The DBN is termed by different names in the literature such as Probabilistic Generative Models, graphical models, etc. The advantage of this network over other methods is that the temporal dynamics of multimodal data can easily be integrated. The most popular form of DBN is the Hidden Markov Model (HMM). The maximum entropy model is a statistical model classiﬁer that follows an information-theoretic approach and provides probability of observed classes. Finally, the other widely used method is neural networks. A typical neural network model consists of input, hidden, and output nodes or neurons. The input to the network can be features of different modality or decisions from various classiﬁers. The output provides fusion of data under consideration. The hidden layer of neurons provides activation functions to produce the expected output, and the number of hidden layers and neurons are chosen to obtain the desired accuracy of results. The connections between neurons have speciﬁc weights that can be appropriately tuned for the learning process of the neural network, to achieve the target performance accuracy. Estimation-Based Fusion Methods [53, 54] This category includes Kalman ﬁlter, extended Kalman ﬁlter, and particle-ﬁlter-based fusion methods. These methods are usually employed to estimate the state of moving object using multimodal information, especially audio and video. The Kalman ﬁlter is used for real-time dynamic, low-level data and provides state estimates for the system. This model does not require storage of the past of the object under observation, as the model only needs the state estimate of the previous time stamp. However, the Kalman ﬁlter model is restricted to linear systems. Thus, for systems with nonlinear characteristics, extended Kalman ﬁlter is used. Particle ﬁlters, also known as Sequential Monte Carlo model, is a simulation-based method used to obtain the state distribution of nonlinear and non-Gaussian state-space models.

2.3

Recent Results

In this section, we describe recent key works in multimodal affect recognition. We summarize state-of-the-art methods, their results, and categorize the works based on several multimodal datasets.

18

2.3.1

E. Cambria et al.

Multimodal Sentiment Analysis

MOUD Dataset The work by Perez et al. [21] focuses on multimodal sentiment analysis using the MOUD dataset based on visual, audio, and textual modalities. FACS and AUs were used as visual features and openEAR was used for extracting acoustic, prosodic features. Simple unigrams were used for textual feature construction. The combination of these features was then fed to an SVM for fusion and 74.66% accuracy was obtained. In 2015, Poria et al. [23] proposed a novel method for extraction of features from short texts using a deep convolutional neural network (CNN). The method was used for detection of sentiment polarity with all three modalities (audio, video, and text) under consideration in short video clips of a person uttering a sentence. In this chapter, a deep CNN was trained; however, instead of using it as a classiﬁer, values from its hidden layer were used as features for input to a second stage classiﬁer, leading to further improvements in accuracy. The main novelty of this chapter was using deep CNN to extract features from text and multiple kernel learning for classiﬁcation of the multimodal heterogeneous fused feature vectors. For the visual modality, CLM-Z based features were used and openEAR was employed on the audio data for feature extraction. YouTube Dataset Morency et al. [55] extracted facial features like smile detection and duration, look away, and audio features like pause duration for sentiment analysis on the YouTube dataset. As textual features, two lexicons containing positive and negative words were developed from the MPQA corpus distribution. They fused and fed those features to a Hidden Markov Model (HMM) for ﬁnal sentiment classiﬁcation. However, the accuracy was relatively lower (55.33%). Possible future work would be to use more advanced classiﬁers, such as SVM, CNN, coupled with the use of complex features. Poria et al. [37] proposed a similar approach where they extracted FCPs using CLM-Z, and used the distances between those FCPs as features. Additionally, they used GAVAM to extract head movement and other rotational features. For audio feature extraction, the state-of-the-art openEAR was employed. Concept-based methods and resources like SenticNet [5] were used to extract textual features. To this end, both feature-level and decision-level fusion were used to obtain the ﬁnal classiﬁcation result. ICT-MMMO Dataset Wollmer et al. [42] used the same mechanism as [55] for audiovisual feature extraction. In particular, OKAO vision was used to extract visual features, which were then fed to CFS for feature selection. In the case of audio feature extraction, they used openEAR. Simple bag-of-words (BOW) were utilized as text features. Audiovisual features were fed to a Bidirectional-LSTM (BLSTM) for early feature-level fusion and SVM was used to obtain the class label of the textual modality. Finally, the output of BLSTM and SVM was fused at the decision level, using a weighted summing technique.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

2.3.2

19

Multimodal Emotion Recognition

Recent Works on the SEMAINE Dataset Gunes et al. [56] used visual aspects that aim to predict dimensional emotions from spontaneous head gestures. Automatic detection of head nods and shakes is based on two-dimensional (2D) global head motion estimation. In order to determine the magnitude and direction of the 2D head motion, optical ﬂow was computed between two consecutive frames. It was applied to a reﬁned region (i.e., resized and smoothed) within the detected facial area to exclude irrelevant background information. Directional code words were generated by the visual feature extraction module, and fed into an HMM for training a nodHMM and a shakeHMM. A Support Vector Machine for Regression (SVR) was used for dimensional emotion prediction from head gestures. The ﬁnal feature set was scaled in the range [1, +1]. The parameters of SVR, for each coderdimension combination were optimized using tenfold cross-validation for a subset of the data at hand. The MSE for detection of valence, arousal, and other axes was found to be 0.1 on average, as opposed to 0.115 resulting from human annotators. Valstar et al. [57] focused on FACS Action Units detection and intensity estimation and derived its datasets from SEMAINE and BP4D-Spontaneous database. The training partition (SEMAINE database) consists of 16 sessions, the development partition had 15 sessions, and the test partition 12 sessions. There were a total of 48,000 images in the training partition, 45,000 in development, and 37,695 in testing (130,695 frames in total). For SEMAINE, 1-min segments of the most facially expressive part of each selected interaction were coded. For the baseline system for this task, two types of features were extracted: two-layer appearance features (Local Binary Gabor Patterns) and geometric features derived from tracked facial point locations, which were then fed into a linear SVM. The average MSE on AU in BP4D datasets was around 0.8, while similar techniques were not applied on SEMAINE. [56, 57] took into consideration all the frames of videos, which in turn made the training more time taking. Nicolaou et al. [58] developed an algorithm for automatically segmenting videos into data frames, in order to show the transition of emotions. To ensure one-to-one correspondence between timestamps of accorder, annotations were binned according to video frames. The crossing over from one emotional state to the other was detected by examining the valence values and identifying the points where the sign changed. The crossovers were then matched across coders. The crossover frame decision was made and the start frame of the video segment decided. The ground truth values for valence were retrieved by incrementing the initial frame number where each crossover was detected by the coders. The procedure of determining combined average values continued until the valence value crossed again to a nonnegative valence value. The endpoint of the audiovisual segment was then set to the frame including the offset, after crossing back to a nonnegative valence value. Discerning dimensional emotions from head gestures proposed a string-based audiovisual fusion, which achieved better results for dimensions valence and expectation as compared to feature-based fusion. This approach added video-based events like

20

E. Cambria et al.

facial expression action units, head nods, shakes as “words” to string of acoustic events. The nonverbal visual events were extracted similar to the unimodal analysis illustrated in [56] (use of nodeHMM and shakeHMM). For detection of facial action units, a local binary patterns descriptor was used and tested on the MMI facial Expression Database. For verbal and nonverbal acoustic events, emotionally relevant keywords derived from automatic speech recognition (ASR) transcripts of SEMAINE, were used. Key words were detected using the multistream large vocabulary continuous speech recognition (LVCSR) engine on recognizer’s output, rather than ground truth labels. Finally, an SVR with linear kernel was trained. The event fusion was performed at the string level per segment, by joining all events where more than half of the event overlapped with the segment in a single string. The events could thus be seen as “words.” The resulting strings were converted to a feature vector representation through a binary bag-of-words (BOW) approach. This leads to an average correlation coefﬁcient of 0.70 on Activation, Valence, and Intensity, which nearly matches human accuracy for the same task. Recent Works on the HUMAINE Dataset Chang et al. [59] worked on the vocal part of the HUMAINE dataset information to analyze emotion, mood, and mental state, eventually combining it into low-footprint C library as AMMON for phones. Sound processing starts with segmenting the audio stream from the microphone into frames with ﬁxed duration (200 ms) and ﬁxed stepping duration (80 ms). The features selected were low-level descriptors (LLDs) (ZCR, RMS, MFCC, etc.) and functions (Mean, SD, skewness). AMMON was developed by extending an ETSI (European Tele-communications Standards Institute) front-end feature extraction library. It included features to describe glottal vibrational cycles, which is a promising feature for monitoring depression. They performed a two-way classiﬁcation task to separate clips with positive emotions from those with negative emotions. A feature vector was extracted from each clip using AMMON without glottal timings. Finally, the use of SVM with these feature vectors produced 75% accuracy on BELFAST (The naturalistic dataset of HUMAINE). Castellano et al. [60] aimed to integrate information from facial expressions, body movement, gestures and speech, for recognition of eight basic emotions. The facial features were extracted by generating feature masks, which were then used to extract feature points, comparing them to a neutral frame to produce face animation parameters (FAPs) as in the previous research. Body tracking was performed using the EyesWeb platform, which tracked silhouettes and blobs, extracting motion and ﬂuidity as main expressive cues. The speech feature extraction focuses on intensity, pitch, MFCC, BSB, and pause length. These were then independently fed into a Bayesian classiﬁer and integrated at decision-level fusion. While the unimodal analysis led to an average of 55% accuracy, feature-level fusion produced a significantly higher accuracy of 78.3%. Decision-level fusion results did not vary much over feature-level fusion. Another interesting work in [61] aims to present a novel approach to online emotion recognition from visual, speech, and text data. For video labeling, temporal information was exploited, which is known to be an important issue, i.e., one

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

21

utterance at time (t) depends on the utterance at time t. The audio features used in the study include: signal energy, pitch, voice quality, MFCC, spectral energy, and time signal, which were then modeled using an LSTM. The spoken content knowledge was incorporated at frame level via early fusion, wherein negative keywords were used for activation, and positive for valence. Subsequently, frame-based emotion recognition with unimodal and bimodal feature sets, and turn-based emotion recognition with an acoustic feature set were performed as evaluations. Finally, whilst an SVR was found to outperform an RNN in recognizing activation features, the RNN performed better in recognition of valence from frame-based models. The inclusion of linguistic features produced no monotonic trend in the system. Recent Works on the eNTERFACE Dataset eNTERFACE is one of the most widely used datasets in multimodal emotion recognition. Though in this discussion we mainly focus on multimodalities, we also explain some of the notable unimodal works that have impacted this research ﬁeld radically. Among unimodal experiments reported on this dataset, one of the notable works was carried out by Eyben et al. [62]. They pioneered the openEAR, a toolkit to extract speech-related features for affect recognition. Several LLDs like Signal Energy, FFT-spectrum, MFCC, Pitch and their functionals were used as features. Multiple Data Sinks were used in the feature extractor, feeding data to different classiﬁers (K-Nearest Neighbor, Bayes, and Support-Vector-based classiﬁcation and regression using the freely available LibSVM). The experiments produced a benchmark accuracy of 75% on the eNTERFACE dataset. The study by Chetty et al. [63] aims to develop an audiovisual fusion approach at multiple levels to resolve the misclassiﬁcation of emotions that occur at unimodal level. The method was tested on two different acted corpora, DaFEx and eNTERFACE. Facial deformation features were identiﬁed using singular value decomposition (SVD) values (positive for expansion and negative for contraction) and were used to determine movement of facial regions. Marker-based audiovisual features were obtained by dividing the face into several sectors, and making the nose marker the local center for each frame. PCA was used to reduce the number of features per frame to a ten-dimensional vector for each area. LDA optimized SVDF and VDF feature vectors and an SVM classiﬁer was used for evaluating expression quantiﬁcation, as High, Medium, and Low. The unimodal implementation of audio features led to an overall performance accuracy of around 70% on DaFEx and 60% on eNTERFACE corpus, but the sadness–neutral pair and happiness–anger pair were confused signiﬁcantly. The overall performance accuracy for visual only features was found to be around 82% for the eNTERFACE corpus and only slightly higher on the DaFEx corpus; however, a signiﬁcant confusion value on neutral– happiness and sadness–anger pairs was found. Audiovisual fusion led to an improvement of 10% on both corpora, signiﬁcantly decreasing the misclassiﬁcation probability. Another attempt [64] at merging audiovisual entities led to 66.5% accuracy on the eNTERFACE dataset (Anger being the highest at 81%). They adopted local binary pattern (LBP) for facial image representations for facial expression recognition. The

22

E. Cambria et al.

process of LBP features extraction generally consists of three steps: ﬁrst, a facial image is divided into several nonoverlapping blocks. Second, LBP histograms are computed for each block. Finally, the block LBP histograms are concatenated into a single vector. As a result, the facial image is represented by the LBP code. For audio features, prosody features like pitch, intensity and quality features like HNR, jitter, and MFCC are extracted. These features were fed into an SVM with the radial basis function kernel. While unimodal analysis produced an accuracy of 55% (visual at 63%), multimodal analysis increased this to 66.51%, demonstrating support for the convergence idea. While the previous two papers focused on late fusion-based emotion recognition, SAMMI [65] was built to focus on real-time extraction, taking into account low-quality videos and noise. A module called “Dynamic Control” to adapt the various fusion algorithms and content-based concept extractors to the quality of input signals. For example, if sound quality was detected to be low, the relevance of the vocal emotional estimation, with respect to video emotional estimation, was reduced. This was an important step to make the system more reliable and lose some constraints. The visual part was tested on two approaches: (a) facial FP absolute movements; (b) relative movements of couples of facial FP. For low-cost beneﬁts, authors used the Tomasi implementation of the Lukas Kanade (LK) algorithm (embedded in the Intel OpenCV library). The vocal expressions extracted were similar to those reported in other papers (HNR, jitter, intensity, etc.). The features were fed as 1 s window interval deﬁnitions, into two classiﬁers: SVM and a conventional Neural Network (NN). Finally, SAMMI performed fusion between estimations resulting from the different classiﬁers or modalities. The output of such a module signiﬁcantly enhanced the system performance. Since the classiﬁcation step is computationally efﬁcient with both NN and SVM classiﬁers, multiple classiﬁers can be employed at the same time without adversely impacting the system performance. Though the NN was found to improve the CR+ value in fear and sadness, an overall Bayesian network performed equally well with a CR+ of 0.430. Poria et al. [28] proposed an intelligent multimodal emotion recognition framework that adopts an ensemble feature extraction by exploiting the joint use of text, audio, and video features. They trained visual classiﬁer on CK++ dataset, textual classiﬁer on ISEAR dataset and tested on the eNTERFACE dataset. Audio features were extracted using openAudio and cross-validated on the eNTERFACE dataset. Training on the CK++ and ISEAR datasets improved the generalization capability of the corresponding classiﬁer through cross-validated performance on both datasets. Finally, we used feature-level fusion for evaluation and an 87.95% accuracy was achieved, which exceeded all earlier benchmarks. Recent Works on the IEMOCAP Dataset In multimodal emotion recognition, IEMOCAP dataset is the most popular dataset and numerous works have reported its use as a benchmark. Below, we outline some of the recent key works. Rehman and Busso [66] developed a personalized emotion recognition system using an unsupervised feature adaption scheme by exploiting the audio modality. The OpenSMILE toolkit with the INTERSPEECH 2009 Emotion Challenge feature set was used to extract a set of common acoustic and prosodic features. A linear kernel

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

23

SVM with sequential minimal optimization (SMO) was used as the emotion detector. The purpose of normalizing acoustic features was to reduce speaker variability, while preserving the discrimination between emotional classes. The iterative feature normalization approach iteratively estimated the normalizing parameters from an unseen speaker. It thus served as a suitable framework for personalized emotion recognition system. In the IFN scheme, an emotion recognition system was used to iteratively identify neutral speech of the unseen speaker. Next, it estimated the normalization parameters using only this subset (relying on the detected labels). These normalization parameters were then applied to the entire data, including the emotional samples. To estimate the performance, the study used leave-one-speakerout, tenfold cross-validation. The results on the IEMOCAP database indicated that the accuracy of the proposed system was 2% (absolute) higher than the one achieved by the baseline, without the feature adaptation scheme. The results on uncontrolled recordings (i.e., speech downloaded from a video-sharing website) revealed that the feature adaptation scheme signiﬁcantly improved the unweighted and weighted accuracies of the emotion recognition system. While most papers have focused on audiovisual fusion, Qio Jio [67] reported emotion recognition with acoustic and lexical features. For acoustic features, low-level acoustic features were extracted at frame level on each utterance and used to generate feature representation of the entire dataset, using the OpenSMILE toolkit. The features extracted were grouped into three categories: continuous, qualitative, and cepstral. Low-level feature vectors were then turned into a static feature vector. For each emotional utterance, a GMM was built via MAP adaptation using the features extracted in the same utterance. Top 600 words from each of the four emotion classes respectively were selected and merged to form a basic word vocabulary of size 2000. A new lexicon for each emotion class (in which each word has a weight indicating its inclination for expressing this emotion) was constructed. The new emotion lexicon not only collected words that appeared in one emotion class but also assigned a weight indicating its inclination for expressing this emotion. This emotion lexicon was then used to generate a vector feature representation for each utterance. Two types of fusion schemes were experimented with: early fusion (feature concatenation) and late fusion (classiﬁcation score fusion). The SVM with linear kernel was used as emotion classiﬁer. The system based on early fusion of Cepstral-BoW and GSV-mean acoustic features combined with ACO-based system, Cepstrum-based system, Lex-BoWbased system, and Lex-eVector-based system through late fusion achieves the best weighted emotion recognition accuracy of 69.2%. Continuing with bimodal systems, Metallinou et al. [68] carried out emotion recognition using audiovisual modalities by exploiting Gaussian Mixture Models (GMMs). Markers were placed on the faces of actors to collect spatial information of these markers for each video frame in IEMOCAP. Facial markers were separated into six blocks, each of which deﬁned a different facial region. A GMM was trained for each of the emotional states examined; angry (ANG), happy (HAP), neutral (NEU), and sad (SAD). The marker point coordinates were used as features for the training of Gaussian mixture models. The frame rate of the markers was 8.3 ms. The

24

E. Cambria et al.

feature vector for each facial region consisted of three-dimensional coordinates of the markers belonging to that region plus their ﬁrst and second derivatives. GMM with 64 mixtures was chosen as it was shown to achieve good performance. MFCCs are used for vocal analysis. The feature vector comprised 12 MFCCs and energy, their ﬁrst and second derivatives, constituting a 39-dimensional feature vector. The window length for the MFCC extraction was 50 ms and the overlap set to 25 ms, to match the window of the facial data extraction. Similar to facial analysis, a GMM was trained for each emotion along with an extra one for background noise. Here, a GMM with 32 mixtures was chosen. Two different classiﬁer combination techniques were explored: the ﬁrst a Bayesian approach for multiple cue combination, and the second an ad hoc method utilizing SVMs with radial basis kernels that used post classiﬁcation accuracies as features. Anger and happiness were found to have better recognition accuracies in the face-based classiﬁer compared to emotional states with lower levels of activation, such as sadness and neutrality; while anger and sadness demonstrated good accuracy in voice-based classiﬁers. A support vector classiﬁer (SVC) was used to combine the separate face and voice model decisions. The Bayesian classiﬁer and SVC classiﬁers were found to perform comparably, with neutral being the worst recognized emotional state, and anger/sadness being the best. While previous works focused on bimodality, the work in [69] aims to classify emotions using audio, visual, and textual information by attaching probabilities to each category based on automatically generated trees, with SVMs acting as nodes. There were several acoustic features used, ranging from jitter and shimmer for negative emotions to intensity and voicing statistics per frame. Instead of representing the nonstationary MFCC features using statistical functionals as in previous works, they use a set of model-based features obtained by scoring all MFCC vectors in a sentence using emotion-dependent Gaussian mixture models (GMM). The lexical features were summarized using LIWC and GI systems represented by bag-of-word stems. The visual features encapsulated facial animation parameters representing nose, mouth and chin markers, eyebrow angle, etc. A randomized tree is generated using the set of all classiﬁers whose performance is above a threshold parameter. The experiments were conducted in leave-one-speakerout fashion. The unimodal feature set achieved an accuracy of around 63% whereas their combination led to an increase of around 8%. Other Multimodal Cognitive Research DeVault et al. [70] introduced SimSensei Kiosk, a virtual human interviewer named Ellie, for automatic assessment of distress indicators among humans. Distress indicators are verbal and nonverbal behaviors correlated with depression, anxiety or posttraumatic stress disorder (PSTD). The SimSensei Kiosk was developed in a way the user feels comfortable talking and sharing information, thus providing clinicians an automatic assessment of psychological distress in a person. The evaluation of the kiosk was carried out by the Wizard-of-Oz prototype system, which had two human operators for deciding verbal and nonverbal responses. This development of SimSensei kiosk was carried out over a period of 2 years with 351 participants, out of which 217 were male, 132 were female, and 2 did not report the gender. In this work, termed the Multi-sense

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

25

framework, a multimodal real-time sensing system was used, for synchronized capture of different modalities, real-time tracking, and fusion process. The multimodal system was also integrated with GAVAM head tracker, CLM-Z face tracker, SHORE face detector, and more. The SimSensei Kiosk uses four statistically trained utterance classiﬁers to capture the utterance meaning of the users and Cerebella, a research platform for realization of the relation between mental states and human behavior. Alam et al. [31] proposed an automatic personality trait recognition framework using the YouTube personality dataset. The dataset consists of videos by 404 YouTube bloggers (194 male and 204 female). The features used for this task are linguistic, psycholinguistic, emotional features, and audiovisual features. Automatic recognition of personality traits is an important topic in the ﬁeld of NLP, particularly aimed at processing the interaction between human and virtual agents. High-dimensional features were selected using the relief algorithm and classiﬁcation models were generated using SMO for the SVM. At the ﬁnal stage, decision-level fusion for classiﬁcation of personality traits was used. Other notable work in personality recognition was carried out by Sarkar et al. [27], who used the YouTube personality dataset and a logistic regression model with ridge estimator, for classiﬁcation purposes. They divided features into ﬁve categories, i.e., audiovisual features, text features, word statistics features, sentiment features, and gender features. A total of 1079 features were used, with 25 audiovisual features, 3 word statistics feature, 5 sentiment feature, 1 demographic feature, and 1045 text features. In conclusion, their in-depth feature analysis showcased helpful insights for solving the multimodal personality recognition task. Siddiquie et al. [71] introduced the task of exploiting multimodal affect and semantics for automatic classiﬁcation of politically persuasive web videos. Rallying A Crowd (RAC) dataset was used for experimentation with 230 videos. The approach was executed by extraction of audio, visual and textual features to capture affect and semantics in the audio-video content and sentiment in the viewers’ comments. For the audio domain, several grades of speech arousal and related semantic categories such as crowd reaction and music were detected. For the visual domain, visual sentiment and semantic content were detected. The research employs both feature-level and decision-level fusion methods. In the case of decision-level fusion, the author used both conventional- and learning-based decision fusion approaches to enhance the overall classiﬁcation performance.

2.4

Proposed Method

Overall, the method consists of three major steps. At ﬁrst, the textual, audio, and visual features are extracted, which is followed by a step that focuses on the fusion of these heterogeneous features. The last step of this method is about classifying the sentiment coming from fused multimodal signal.

26

2.4.1

E. Cambria et al.

Textual Features

For feature extraction from textual data, we used a convolutional neural network (CNN). The idea behind convolution is to take the dot product of a vector of k weights wk, known as kernel vector, with each k-gram in the sentence s(t) to obtain another sequence of features c(t) ¼ (c1(t), c2(t), . . ., cL(t)): c j ¼ wkT xi:iþk1

ð2:1Þ

We then apply a max pooling operation over the feature map and take the maximum value cˆ(t) ¼ max {c(t)} as the feature corresponding to this particular kernel vector. We used varying kernel vectors and window sizes to obtain multiple features. For each word xi(t) in the vocabulary, a d-dimensional vector representation called word embedding was given in a lookup table that had been learned from the data [72]. The vector representation of a sentence was a concatenation of the vectors for individual words. The convolution kernels are then applied to word vectors instead of individual words. Similarly, one can have lookup tables for features other than words if these features are deemed helpful. We used these features to train higher layers of the CNN to represent bigger groups of words in sentences. We denote the feature learned at a hidden neuron h in layer l as Fhl. Multiple features are learned in parallel at the same CNN layer. The features learned at each layer are used to train the next layer: Fl ¼

nh X

wkh ∗F l1

ð2:2Þ

h¼1

where * denotes convolution, wk is a weight kernel for hidden neuron h, and nh is the total number of hidden neurons. The CNN sentence model preserves the order of words by adopting convolution kernels of gradually increasing sizes, which span an increasing number of words and ultimately the entire sentence. Each word in a sentence was represented using word embeddings. We employed the publicly available word2vec vectors, which were trained on 100 billion words from Google News. The vectors were of dimensionality d ¼ 300, trained using the continuous bag-of-words architecture [72]. Words not present in the set of pretrained words were initialized randomly. Each sentence was wrapped to a window of 50 words. Our CNN had two convolution layers. A kernel size of 3 and 4, each of them having 50 feature maps, was used in the ﬁrst convolution layer and a kernel size 2 and 100 feature maps in the second one. We used ReLU as the nonlinear activation function of the network. The convolution layers were interleaved with pooling layers of dimension 2. We used the activation values of the 500-dimensional fully connected layer of the network as our feature vector in the ﬁnal fusion process.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

2.4.2

27

Audio Features

We automatically extracted audio features from each annotated segment of the videos. Audio features were also extracted in 30 Hz frame-rate; we used a sliding window of 100 ms. To compute the features, we used the open-source software openSMILE [73]. This toolkit automatically extracts pitch and voice intensity. Voice normalization was performed and voice intensity was thresholded to identify samples with and without voice. Z-standardization was used to perform voice normalization. The features extracted by openSMILE consist of several low-level descriptors (LLD) and their statistical functionals. Some of the functionals are amplitude mean, arithmetic mean, root quadratic mean, etc. Taking into account all functionals of each LLD, we obtained 6373 features.

2.4.3

Visual Features

Since the video data is very large, we only consider every tenth frame in our training videos. The Constrained Local Model (CLM) was used to ﬁnd the outline of the face in each frame [74]. The cropped frame size was further reduced by scaling down to a lower resolution. In this way we could drastically reduce the amount of training video data. The input is a sequence of images in a video. To capture the temporal dependence, we transform each pair of consecutive images at t and t + 1 into a single image. We use kernels of varying dimensions—Kernel 1, 2, and 3—to learn Layer-1 2D features shown in Fig. 2.1 from the transformed input. Similarly, the second layer also uses kernels of varying dimensions to learn 2D features. Up-sampling layer transformed features of different kernel sizes into uniform 2D features. Next, a logistic layer of neurons was used.

Neuron with Highly Activated Features of Eyes and Ear

Fig. 2.1 Top image segments activated at two feature detectors in the ﬁrst layer of deep CNN

28

E. Cambria et al.

Preprocessing involves scaling all video frames to half the resolution. Each pair of consecutive video frames was converted into a single frame to achieve temporal convolution features. All the frames were standardized to 250 500 pixels by padding with zeros. The ﬁrst convolution layer contained 100 kernels of size 1020; the next convolution layer had 100 kernels of size 20 30; this layer was followed by a logistic layer of 300 neurons and a recurrent layer of 50 neurons. The convolution layers were interleaved with pooling layers of dimension 2 2.

2.4.4

Context-Dependent Feature Extraction

In sequence classiﬁcation, the classiﬁcation of each member is dependent on the other members. Utterances in a video maintain a sequence. We hypothesize that within a video there is a high probability of inter-utterance dependency with respect to their sentimental clues. In particular, when classifying one utterance, other utterances can provide important contextual information. This calls for a model that takes into account such interdependencies and the effect these might have on the current utterance. To capture this ﬂow of informational triggers across utterances, we use an LSTM-based recurrent network scheme [75]. Long Short-Term Memory (LSTM) [76] is a kind of recurrent neural network (RNN), an extension of conventional feed-forward neural network. Speciﬁcally, LSTM cells are capable of modeling long-range dependencies, which other traditional RNNs fail to do given the vanishing gradient issue. Each LSTM cell consists of an input gate i, an output gate o, and a forget gate f, to control the ﬂow of information. Current research [77] indicates the beneﬁt of using such networks to incorporate contextual information in the classiﬁcation process. In our case, the LSTM network serves the purpose of context-dependent feature extraction by modeling relations among utterances. We term our architecture “contextual LSTM.” We propose several architectural variants of it later in the chapter.

2.4.5

Contextual LSTM Architecture

Let unimodal features have dimension k, each utterance is thus represented by a feature vector xi,t 2 Rk, where t represents the tth utterance of the video i. For a video, we collect the vectors for all the utterances in it, to get Xi ¼ [xi,1,xi,2, . . ., xi,Li] 2 RLik , where Li represents the number of utterances in the video. This matrix Xi serves as the input to the LSTM. Figure 2.2 demonstrates the functioning of this LSTM module.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

29

Fig. 2.2 Contextual LSTM network: input features are passed through a unidirectional LSTM layer, followed by a dense and then a softmax layer. The dense layer activations serve as the output features

In the procedure getLstmFeatures(Xi) of Algorithm 2.1, each of these utterance xi,t is passed through an LSTM cell using the equations mentioned in line 32–37. The output of the LSTM cell hi,t is then fed into a dense layer and ﬁnally into a softmax layer (line 38–39). The activations of the dense layer zi,t are used as the contextdependent features of contextual LSTM.

30

Algorithm 2.1 Proposed architecture (Table 2.1)

E. Cambria et al.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data Table 2.1 Summary of notations used in Algorithm 2.1

Weight Wi, Wf, Wc, Wo Pi, Pf, Pc, PoVo Wz Wsft

2 ℝdk 2 ℝdd 2 ℝmd 2 ℝcm

31 Bias bi, bf, bc, bo bz bsft

2 ℝd 2 ℝm 2 ℝc

d dimension of hidden unit, k dimension of input vectors to LSTM layer, c number of classes

The training of the LSTM network is performed using categorical cross-entropy on each utterance’s softmax output per video, i.e., 1 loss ¼ P M i¼1

Li

Li X M X C X

j yi,j c log2 b y i, c

ð2:3Þ

i¼1 j¼1 c¼1

where M is the total number of videos, Li is the number of utterances for ith video, y ij, c is the predicted output for jth utterance of yi,j c is the original output of class c, and b ith video. As a regularization method, dropout between the LSTM cell and dense layer is introduced to avoid overﬁtting. As the videos do not have the same number of utterances, padding is introduced to serve as neutral utterances. To avoid the proliferation of noise within the network, bit masking is done on these padded utterances to eliminate their effect in the network. Hyper-parameters tuning is done on the train set by splitting it into train and validation components with 80/ 20% split. RMSprop has been used as the optimizer, which is known to resolve Adagrad’s radically diminishing learning rates [78]. After feeding the train set to the network, the test set is passed through it to generate their context-dependent features. These features are ﬁnally passed through an SVM for the ﬁnal classiﬁcation.

2.4.6

Fusion

In order to fuse the information extracted from each modality, we concatenated feature vectors extracted from each modality and sent the combined vector to the contextual framework for the ﬁnal decision. This scheme of fusion is called featurelevel fusion. We discuss the results of this fusion in Sect. 2.4.

32

2.5 2.5.1

E. Cambria et al.

Experiments and Observations Datasets

Multimodal Sentiment Analysis Datasets For our experiments, we use the MOUD dataset, developed by Perez-Rosas et al. [21]. They collected 80 product review and recommendation videos from Youtube. Each video was segmented into its utterances and each utterance was labeled by a sentiment (positive, negative, and neutral). On average, each video has six utterances; each utterance is 5 s long. The dataset contains 498 utterances labeled positive, negative, or neutral. In our experiment we did not consider neutral labels, which led to the ﬁnal dataset consisting of 448 utterances. In a similar fashion, Zadeh et al. [79] constructed a multimodal sentiment analysis dataset called Multimodal Opinion-Level Sentiment Intensity (MOSI), which is bigger than MOUD, consisting of 2199 opinionated utterances, 93 videos by 89 speakers. The videos address a large array of topics, such as movies, books, and products. In the experiment to address the generalizability issues, we trained a model on MOSI and tested on MOUD. Multimodal Emotion Recognition Dataset The USC IEMOCAP database [80] was collected for the purposes of studying multimodal expressive dyadic interactions. This dataset contains 12 h of video data split into 5 min of dyadic interaction between professional male and female actors. Each interaction session was split into spoken utterances. At least three annotators assigned to each utterance one emotion category: happy, sad, neutral, angry, surprised, excited, frustration, disgust, fear, and other. In this work, we considered only the utterances with majority agreement (i.e., at least two out of three annotators labeled the same emotion) in the emotion classes of angry, happy, sad, and neutral. Figure 2.3 presents the visualization of MOSI, MOUD, and IEMOCAP datasets.

2.5.2

Speaker-Independent Experiment

Most of the research in multimodal sentiment analysis is performed on a dataset with speaker overlap in train-and-test splits. As we know, each individual is unique in his/her own way of expressing emotions and sentiments; so ﬁnding generic, personindependent features for sentimental analysis is very important. However, given this overlap, where the model has already seen the behavior of a certain individual, the results do not scale to true generalization. In real-world applications, the model should be robust to person variance. Thus, we performed person-independent experiments to emulate unseen conditions. This time, our train/test splits of the datasets were completely disjointed with respect to speakers. While testing, our models had to classify emotions and

20

Positive

−20

−20

−10

−10

0

All

0

Audio

10

10

20

20

Happy

−15 −15 −10

−10

−5

0

5

10

15

−15 −15

−10

−5 0

Sad

−5

Neutral

0

Video

5

10

15

5

Anger

10

15

−10

−5

0

5

10

15

−10

−5

0

−5

0

10 5

−10

Text

5

10

15

Fig. 2.3 Visualization of MOSI and IEMOCAP datasets when unimodal features and multimodal features are used

Negative

−20

−20

10

−10

−10

0

15

0

−10

10

0

5

10

−20

0

Video

10

−5

−20

−10

0

10

20

20

−15 −10

Text

20

−15

−10

−5

0

5

10

15

−10

−10

−5

−5

0

All

0

Audio

5

5

10

10

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data 33

34

E. Cambria et al.

sentiments from utterances by speakers they have never seen before. Below we enlist the procedure of this speaker-independent experiment: – IEMOCAP: As this dataset contains ten speakers, we performed a tenfold speaker-independent test, where in each round one of the speaker was in the test set. – MOUD: This dataset contains videos of about 80 people reviewing various products. Here, reviewers review products in Spanish. Each utterance in the video has been labeled to be either positive, negative, or neutral. In our experiments, we consider only the positive and negative sentiment labels. The speakers were divided into ﬁve groups and a ﬁvefold person-independent experiment was run, where in every fold one out of the ﬁve groups was in the test set. Finally, we took average of the macro f_score to summarize the results (see Table 2.2). – MOSI: The MOSI dataset is a dataset rich in sentimental expressions where 93 people review topics in English. The videos are segmented with each segment’s sentiment label scored between +3 to 3 by 5 annotators. We took the average of these labels as the sentiment polarity thus considering two classes positive and negative as sentiment labels. Like MOUD, speakers were divided into ﬁve groups and a ﬁvefold person-independent experiment was run. During each fold, around 75 people were in the train set and the remaining in the test set. The train set was further split randomly into 80–20% and shufﬂed to generate train and validation splits for parameter tuning. Comparison with the Speaker-Dependent Experiment In comparison with the speaker-dependent experiment, the speaker-independent experiment performance is poor. This is due to the lack of knowledge about speakers in the dataset. Table 2.3 shows the performance obtained in the speaker-dependent experiment. It can be seen that audio modality consistently performs better than visual modality in both MOSI and IEMOCAP datasets. The text modality plays the most important role in both emotion recognition and sentiment analysis. The fusion of the modalities shows more impact for emotion recognition than on sentiment analysis. RMSE and TP-rate of the experiments using different modalities on IEMOCAP and MOSI datasets are shown in Fig. 2.4.

Table 2.2 Macro F_score reported for speaker-independent classiﬁcation Modality Unimodal

Bimodal

Multimodal

Source Audio Video Text Text + Audio Text + Video Audio + Video Text + Audio + Video

IEMOCAP 51.52 41.79 65.13 70.79 68.55 52.15 71.59

MOUD 53.70 47.68 48.40 57.10 49.22 62.88 67.90

MOSI 57.14 58.46 75.16 75.72 75.06 62.4 76.66

IEMOCAP: tenfold speaker-independent average. MOUD: ﬁvefold speaker-independent average. MOSI: ﬁvefold speaker-independent average. Notes: A stands for Audio, V for Video, T for Text

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data Table 2.3 Tenfold crossvalidation results on IEMOCAP dataset and ﬁvefold CV results (macro F_Score) on MOSI dataset

Modality Unimodal

Bimodal

Multimodal a

Source Audio Video Text Text + Audio Text + Video Audio + Video Text + Audio + Video Text + Audio + Video

35 IEMOCAP 66.20 60.30 67.90 78.20 76.30 73.90 81.70 69.35a

MOSI 64.00 62.11 78.00 76.60 78.80 66.65 78.80 73.55b

By [69] By [23]

b

2.5.3

Contributions of the Modalities

As expected in all kinds of experiments, bimodal and trimodal models perform better than unimodal models. Overall, audio modality has performed better than visual on all the datasets. Except the MOUD dataset, the unimodal performance of text modality is notably better than other two modalities (Fig. 2.5). Table 2.3 also presents the comparison with the state of the art. The present method outperformed the state of the art by 12% and 5% on the IEMOCAP and MOSI datasets, respectively.1 The method proposed by Poria et al. is similar to us except they used a standard CLM-based facial feature extraction method. So, our proposed CNN-based visual feature extraction algorithm has helped to outperform the method by Poria et al.

2.5.4

Generalizability of the Models

To test the generalization ability of the models, we have trained framework on MOSI dataset in speaker-independent fashion and tested on MOUD dataset. From Table 2.4 we can see that the trained model on MOSI dataset performed poorly on MOUD dataset. While harvesting the reason for it, we have found mainly two major issues. First, reviews in MOUD dataset had been recorded in Spanish so audio modality miserably fails in recognition as MOSI dataset contains reviews in English. Second, text modality has performed very poorly too, for the same reason.

1

We have reimplemented the method by Poria et al. [23].

RMSE

0.5

0.6

0.7

0.8

0.9

0.30

0.35

0.40

0.45

0.50

0.55

0.60

A

A

V

T Models

T+A

V

A+V

T Models

T+A

Happy Sad Neutral Anger

A+V+T

A+V+T

T+V

T+V

TP rate on IEMOCAP dataset

A+V

IEMOCAP MOSI

0.5

0.6

0.7

0.8

0.9

0

500

1000

1500

2000

A

Happy

V

Sad

A+V

T Models

T+A

T+V

Pos

TP rate on MOSI dataset

Neutral Angry

Dataset distribution

A+V+T

Negative Positive

Neg

MOSI

IEMOCAP

Fig. 2.4 Experiments on IEMOCAP and MOSI datasets. Top-left ﬁgure shows the Root Mean Square Error (RMSE) of the models on IEMOCAP and MOSI. Top-right ﬁgure shows the dataset distribution. Bottom-left and bottom-right ﬁgures present TP rate on of the models on IEMOCAP and MOSI dataset respectively

TP rate

RMSE on IEMOCAP and MOSI

TP rate

0.65

36 E. Cambria et al.

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

37

Modality Comparison

85

F_score

80

75

70

+ 65

60

Text

Audio Modality

Visual

Fig. 2.5 Performance of the modalities on the datasets. Red line indicates the median of the F_score Table 2.4 Cross-dataset results

Modality Unimodal

Bimodal

Multimodal

Source Audio Video Text Text + Audio Text + Video Audio + Video Text + Audio + Video

Macro F_Score (%) 41.60 45.50 50.89 51.70 52.12 46.35 52.44

Model (with previous conﬁgurations) trained on MOSI dataset and tested on MOUD dataset

2.5.5

Visualization of the Datasets

The MOSI visuals (Fig. 2.3) present information regarding dataset distribution within single and multiple modalities. For the textual and audio modalities, comprehensive clustering can be seen with substantial overlap. However, this problem is reduced in the video and all modalities with structured de-clustering but overlap is reduced only in multimodal. This forms an intuitive explanation of the improved performance in the multimodality.

38

E. Cambria et al.

The IEMOCAP visualizations (Fig. 2.3) provide insight for the four-class distribution for uni- and multimodals, where clearly, the multimodal distribution has the least overlap (increase in red and blue visuals, apart from the rest) with sparse distribution aiding the classiﬁcation process.

2.6

Discussion

The need for considering context dependency (see Sect. 2.1) is of prime importance for utterance level sentiment classiﬁcation. For example, in the utterance: What would have been a better name for the movie, the speaker is attempting to comment on the movie by giving an appropriate name. However, the sentiment is expressed implicitly and requires the contextual knowledge about the mood of the speaker and opinion about the ﬁlm. The baseline unimodal-SVM and state of the art fail to classify it correctly.2 However, information from neighboring utterances, like: (1) And I really enjoyed it; (2) The countryside which they showed while going through Ireland was astoundingly beautiful, etc., indicate its positive context and help our contextual model to classify it correctly. Such contextual relationships are prevalent throughout the dataset. In order to have a better understanding on roles of modalities for overall classiﬁcation, we have also done some qualitative analysis. For example, the utterance: “who doesn’t have any presence or greatness at all,” was classiﬁed as positive by the audio classiﬁer (“doesn’t” was spoken normally by the speaker, but “presence and greatness at all” was spoken with enthusiasm). However, textual modality caught the negation induced by “doesn’t” and classiﬁed correctly. In another utterance “amazing special effects” as there was no jest of enthusiasm in speaker’s voice and face audiovisual classiﬁer failed to identify the positivity of this utterance. On the other, textual classiﬁer correctly detected the polarity as positive. On other hand, the textual classiﬁer classiﬁed this utterance: “that like to see comic book characters treated responsibly” as positive, possibly because of the presence of positive phrases such as “like to see,” “responsibly.” However, the high pitch of anger in the person’s voice and the frowning face helps identify this to be a negative utterance. In some cases, the predictions of the proposed method are wrong given the difﬁculty in recognizing the face and noisy audio signal in the utterances. Also, cases where the sentiment is very weak and noncontextual, the proposed approach shows some bias toward its surrounding utterances, which further leads to wrong predictions. This can be solved by developing a context-aware attention mechanism.

2

RNTN classiﬁes it as neutral. It can be seen here. http://nlp.stanford.edu:8080/sentiment/ rntnDemo.html

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

2.7

39

Conclusion

We have presented a framework for multimodal sentiment analysis and multimodal emotion recognition, which outperforms the state of the art in both tasks by a signiﬁcant margin. We have also discussed some major aspects of multimodal sentiment analysis problem such as the performance of speaker-independent models and cross-dataset performance of the models. Our future work will focus on extracting semantics from the visual features, relatedness of the cross-modal features, and their fusion.

References 1. Cambria, E., Das, D., Bandyopadhyay, S., Feraco, A.: A Practical Guide to Sentiment Analysis. Springer, Cham (2017) 2. Poria, S., Cambria, E., Bajpai, R., Hussain, A.: A review of affective computing: from unimodal analysis to multimodal fusion. Inf. Fusion. 37, 98–125 (2017) 3. Poria, S., Cambria, E., Hazarika, D., Mazumder, N., Zadeh, A., Morency, L.-P.: Contextdependent sentiment analysis in user-generated videos. ACL. 2, 873–883 (2017) 4. Chaturvedi, I., Ragusa, E., Gastaldo, P., Zunino, R., Cambria, E.: Bayesian network based extreme learning machine for subjectivity detection. J. Franklin Inst. 355(4), 1780–1797 (2018) 5. Cambria, E., Poria, S., Hazarika, D., Kwok, K.: SenticNet 5: discovering conceptual primitives for sentiment analysis by means of context embeddings. In: AAAI, pp. 1795–1802 (2018) 6. Oneto, L., Bisio, F., Cambria, E., Anguita, D.: Statistical learning theory & ELM for big social data analysis. IEEE Comput. Intell. Mag. 11(3), 45–55 (2016) 7. Cambria, E., Hussain, A., Computing, S.: A Common-Sense-Based Framework for ConceptLevel Sentiment Analysis. Springer, Cham (2015) 8. Cambria, E., Poria, S., Gelbukh, A., Thelwall, M.: Sentiment analysis is a big suitcase. IEEE Intell. Syst. 32(6), 74–80 (2017) 9. Poria, S., Chaturvedi, I., Cambria, E., Bisio, F.: Sentic LDA: improving on LDA with semantic similarity for aspect-based sentiment analysis. In: IJCNN, pp. 4465–4473 (2016) 10. Ma, Y., Cambria, E., Gao, S.: Label embedding for zero-shot ﬁne-grained named entity typing. In: COLING, pp. 171–180 (2016) 11. Xia, Y., Erik, C., Hussain, A., Zhao, H.: Word polarity disambiguation using bayesian model & opinion-level features. Cogn. Comput. 7(3), 369–380 (2015) 12. Zhong, X., Sun, A., Cambria, E.: Time expression analysis and recognition using syntactic token types and general heuristic rules. In: ACL, pp. 420–429 (2017) 13. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017) 14. Poria, S., Cambria, E., Hazarika, D., Vij, P.: A deeper look into sarcastic tweets using deep convolutional neural networks. In: COLING, pp. 1601–1612 (2016) 15. Xing, F., Cambria, E., Welsch, R.: Natural language based ﬁnancial forecasting: a survey. Artif. Intell. Rev. 50(1), 49–73 (2018) 16. Ebrahimi, M., Hossein, A., Sheth, A.: Challenges of sentiment analysis for dynamic events. IEEE Intell. Syst. 32(5), 70–75 (2017) 17. Cambria, E., Hussain, A., Durrani, T., Havasi, C., Eckl, C., Munro, J.: Sentic computing for patient centered application. In: IEEE ICSP, pp. 1279–1282 (2010) 18. Valdivia, A., Luzon, V., Herrera, F.: Sentiment analysis in tripadvisor. IEEE Intell. Syst. 32(4), 72–77 (2017)

40

E. Cambria et al.

19. Cavallari, S., Zheng, V., Cai, H., Chang, K., Cambria, E.: Learning community embedding with community detection and node embedding on graphs. In: CIKM, pp. 377–386 (2017) 20. Mihalcea, R., Garimella, A.: What men say, what women hear: ﬁnding gender-speciﬁc meaning shades. IEEE Intell. Syst. 31(4), 62–67 (2016) 21. Pérez-Rosas, V., Mihalcea, R., Morency, L.-P.: Utterancelevel multimodal sentiment analysis. ACL. 1, 973–982 (2013) 22. Wollmer, M., Weninger, F., Knaup, T., Schuller, B., Sun, C., Sagae, K., Morency, L.-P.: Youtube movie reviews: Sentiment analysis in an audio-visual context. IEEE Intell. Syst. 28 (3), 46–53 (2013) 23. Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of EMNLP, pp. 2539–2544 (2015) 24. Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 31(1), 39–58 (2009) 25. D’mello, S.K., Kory, J.: A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. 47(3), 43–79 (2015) 26. Rosas, V., Mihalcea, R., Morency, L.-P.: Multimodal sentiment analysis of spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013) 27. Sarkar, C., Bhatia, S., Agarwal, A., Li, J.: Feature analysis for computational personality recognition using youtube personality data set. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition, pp. 11–14. ACM (2014) 28. Poria, S., Cambria, E., Hussain, A., Huang, G.-B.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015) 29. Monkaresi, H., Sazzad Hussain, M., Calvo, R.A.: Classiﬁcation of affects using head movement, skin color features and physiological signals. In: Systems, Man, and Cybernetics (SMC), 2012 I.E. International Conference on IEEE, pp. 2664–2669 (2012) 30. Wang, S., Zhu, Y., Wu, G., Ji, Q.: Hybrid video emotional tagging using users’ eeg & video content. Multimed. Tools Appl. 72(2), 1257–1283 (2014) 31. Alam, F., Riccardi, G.: Predicting personality traits using multimodal information. In: Proceedings of the 2014 ACM Multi Media on Workshop on Computational Personality Recognition, pp. 15–18. ACM (2014) 32. Cai, G., Xia, B.: Convolutional neural networks for multimedia sentiment analysis. In: National CCF Conference on Natural Language Processing and Chinese Computing, pp. 159–167. Springer (2015) 33. Yamasaki, T., Fukushima, Y., Furuta, R., Sun, L., Aizawa, K., Bollegala, D.: Prediction of user ratings of oral presentations using label relations. In: Proceedings of the 1st International Workshop on Affect & Sentiment in Multimedia, pp. 33–38. ACM (2015) 34. Glodek, M., Reuter, S., Schels, M., Dietmayer, K., Schwenker, F.: Kalman ﬁlter based classiﬁer fusion for affective state recognition. In: Multiple Classiﬁer Systems, pp. 85–94. Springer (2013) 35. Dobrišek, S., Gajšek, R., Mihelič, F., Pavešić, N., Štruc, V.: Towards efﬁcient multi-modal emotion recognition. Int. J. Adv. Rob. Syst. 10, 53 (2013) 36. Mansoorizadeh, M., Charkari, N.M.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010) 37. Poria, S., Cambria, E., Howard, N., Huang, G.-B., Hussain, A.: Fusing audio, visual and textual clues for sentiment analysis from multimodal content. Neurocomputing. 174, 50–59 (2016) 38. Lin, J.-C., Wu, C.-H., Wei, W.-L.: Error weighted semi-coupled hidden markov model for audio-visual emotion recognition. IEEE Trans. Multimed. 14(1), 142–156 (2012) 39. Lu, K., Jia, Y.: Audio-visual emotion recognition with boosted coupled hmm. In: 21st International Conference on Pattern Recognition (ICPR), IEEE 2012, pp. 1148–1151 (2012)

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

41

40. Metallinou, A., Wöllmer, M., Katsamanis, A., Eyben, F., Schuller, B., Narayanan, S.: Contextsensitive learning for enhanced audiovisual emotion classiﬁcation. IEEE Trans. Affect. Comput. 3(2), 184–198 (2012) 41. Baltrusaitis, T., Banda, N., Robinson, P.: Dimensional affect recognition using continuous conditional random ﬁelds. In: Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on IEEE, pp. 1–8 (2013) 42. Wöllmer, M., Kaiser, M., Eyben, F., Schuller, B., Rigoll, G.: Lstm-modeling of continuous emotions in an audiovisual affect recognition framework. Image Vis. Comput. 31(2), 153–163 (2013) 43. Song, M., Jiajun, B., Chen, C., Li, N.: Audio-visual based emotion recognition-a new approach. Comput. Vis. Pattern Recognit. 2, II–1020 (2004) 44. Zeng, Z., Hu, Y., Liu, M., Fu, Y., Huang, T.S.: Training combination strategy of multi-stream fused hidden markov model for audio-visual affect recognition. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, pp. 65–68. ACM (2006) 45. Caridakis, G., Malatesta, L., Kessous, L., Amir, N., Raouzaiou, A., Karpouzis, K.: Modeling naturalistic affective states via facial & vocal expressions recognition. In: Proceedings of the 8th International Conference on Multimodal Interfaces, pp. 146–154. ACM (2006) 46. Petridis, S., Pantic, M.: Audiovisual discrimination between laughter and speech. In: International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008. IEEE, pp. 5117–5120 (2008) 47. Sebe, N., Cohen, I., Gevers, T., Huang, T.S.: Emotion recognition based on joint visual and audio cues. In: 18th International Conference on Pattern Recognition, ICPR 2006, IEEE, vol. 1, pp. 1136–1139 (2006) 48. Atrey, P.K., Anwar Hossain, M., Saddik, A.E., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimed. Syst. 16(6), 345–379 (2010) 49. Corradini, A., Mehta, M., Bernsen, N.O., Martin, J., Abrilian, S.: Multimodal input fusion in human-computer interaction. Comput. Syst. Sci. 198, 223 (2005) 50. Iyengar, G., Nock, H.J., Neti, C.: Audio-visual synchrony for detection of monologues in video archives. In: Proceedings of International Conference on Multimedia and Expo, ICME’03, IEEE, vol. 1, pp. 772–775 (2003) 51. Adams, W.H., Iyengar, G., Lin, C.-Y., Naphade, M.R., Neti, C., Nock, H.J., Smith, J.R.: Semantic indexing of multimedia content using visual, audio & text cues. EURASIP J. Adv. Signal Process. 2003(2), 1–16 (2003) 52. Neﬁan, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audiovisual speech recognition. EURASIP J. Adv. Signal Process. 2002(11), 1–15 (2002) 53. Nickel, K., Gehrig, T., Stiefelhagen, R., McDonough, J.: A joint particle ﬁlter for audio-visual speaker tracking. In: Proceedings of the 7th International Conference on Multimodal Interfaces, pp. 61–68. ACM (2005) 54. Potamitis, I., Chen, H., Tremoulis, G.: Tracking of multiple moving speakers with multiple microphone arrays. IEEE Trans. Speech Audio Process. 12(5), 520–529 (2004) 55. Morency, L.-P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis: harvesting opinions from the web. In: Proceedings of the 13th International Conference on Multimodal Interfaces, pp. 169–176. ACM (2011) 56. Gunes, H., Pantic, M.: Dimensional emotion prediction from spontaneous head gestures for interaction with sensitive artiﬁcial listeners. In: International Conference on Intelligent Virtual Agents, pp. 371–377 (2010) 57. Valstar, M.F., Almaev, T., Girard, J.M., McKeown, G., Mehu, M., Yin, L., Pantic, M., Cohn, J. F.: Fera 2015-second facial expression recognition and analysis challenge. Automat. Face Gesture Recognit. 6, 1–8 (2015) 58. Nicolaou, M.A., Gunes, H., Pantic, M.: Automatic segmentation of spontaneous data using dimensional labels from multiple coders. In: Proceedings of LREC Int’l Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality, pp. 43–48 (2010)

42

E. Cambria et al.

59. Chang, K.-H., Fisher, D., Canny, J.: Ammon: a speech analysis library for analyzing affect, stress & mental health on mobile phones. In: Proceedings of PhoneSense (2011) 60. Castellano, G., Kessous, L., Caridakis, G.: Emotion recognition through multiple modalities: face, body gesture, speech. In: Peter, C., Beale, R. (eds.) Affect and Emotion in HumanComputer Interaction, pp. 92–103. Springer, Heidelberg (2008) 61. Eyben, F., Wöllmer, M., Graves, A., Schuller, B., Douglas-Cowie, E., Cowie, R.: On-line emotion recognition in a 3-d activation-valence-time continuum using acoustic and linguistic cues. J. Multimodal User Interfaces. 3(1–2), 7–19 (2010) 62. Eyben, F., Wöllmer, M., Schuller, B.: Openear—introducing the Munich open-source emotion and affect recognition toolkit. In: 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops 2009, pp. 1–6. IEEE (2009) 63. Chetty, G., Wagner, M., Goecke, R.: A multilevel fusion approach for audiovisual emotion recognition. In: AVSP, pp. 115–120 (2008) 64. Zhang, S., Li, L., Zhao, Z.: Audio-visual emotion recognition based on facial expression and affective speech. In: Multimedia and Signal Processing, pp. 46–52. Springer (2012) 65. Paleari, M., Benmokhtar, R., Huet, B.: Evidence theory-based multimodal emotion recognition. In: International Conference on Multimedia Modeling, pp. 435–446 (2009) 66. Rahman, T., Busso, C.: A personalized emotion recognition system using an unsupervised feature adaptation scheme. In: 2012 I.E. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5117–5120. IEEE (2012) 67. Jin, Q., Li, C., Chen, S., Wu, H.: Speech emotion recognition with acoustic and lexical features. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, pp. 4749–4753. IEEE (2015) 68. Metallinou, A., Lee, S., Narayanan, S.: Audio-visual emotion recognition using Gaussian mixture models for face and voice. In: 10th IEEE International Symposium on ISM 2008, pp. 250–257. IEEE (2008) 69. Rozgić, V., Ananthakrishnan, S., Saleem, S., Kumar, R., Prasad, R.: Ensemble of svm trees for multimodal emotion recognition. In: Signal & Information Processing Association Annual Summit and Conference (APSIPA ASC), pp. 1–4. IEEE (2012) 70. DeVault, D., Artstein, R., Benn, G., Dey, T., Fast, E., Gainer, A., Georgila, K., Gratch, J., Hartholt, A., Lhommet, M., et al.: Simsensei kiosk: a virtual human interviewer for healthcare decision support. In: Proceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems, pp. 1061–1068 (2014) 71. Siddiquie, B., Chisholm, D., Divakaran, A.: Exploiting multimodal affect and semantics to identify politically persuasive web videos. In: Proceedings of the 2015 ACM on International Conference on Multimodal Interaction. ACM, pp. 203–210 (2015) 72. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efﬁcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013) 73. Eyben, F., Wöllmer, M., Schuller, B.: Opensmile: the Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia. ACM, pp. 1459–1462 (2010) 74. Baltrušaitis, T., Robinson, P., Morency, L.-P.: 3d constrained local model for rigid and non-rigid facial tracking. In: Computer Vision and Pattern Recognition (CVPR), pp. 2610–2617. IEEE (2012). 75. Gers, F.: Long Short-Term Memory in Recurrent Neural Networks, Ph.D. thesis, Universität Hannover (2001) 76. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 77. Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., Xu, B.: Attention-based bidirectional long short-term memory networks for relation classiﬁcation. In: The 54th Annual Meeting of the Association for Computational Linguistics, pp. 207–213 (2016) 78. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning & stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

2 Speaker-Independent Multimodal Sentiment Analysis for Big Data

43

79. Zadeh, A., Zellers, R., Pincus, E., Morency, L.-P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016) 80. Busso, C., Bulut, M., Lee, C.-C., Kazemzadeh, A., Mower, E., Kim, S., Chang, J.N., Lee, S., Narayanan, S.S.: Iemocap: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008)

Chapter 3

Multimodal Big Data Affective Analytics Nusrat Jahan Shoumy, Li-minn Ang, and D. M. Motiur Rahaman

Abstract Big data has generated signiﬁcant interest in various ﬁelds including healthcare, affective analytics, customer service, and satellite imaging. Among these ﬁelds, affective analytics has been a particularly interesting direction of research. Affective analytics refers to the automatic recognition of emotion; it aims to mine opinions, sentiments, and emotions based on observations of people’s actions that can be captured using their writings, facial expressions, speech, movements, and so on toward different events, issues, services, or other such interests. In the past, researchers focused on investigating a single modality in the form of text, speech, or facial images. However, with the advancement of computer processing power and the development of sophisticated sensors, multimodal approaches can now be used for emotion recognition that provide a more accurate and detailed result. Affective analytics is important in Big data applications due to its numerous uses in streamlining products, services, etc. This chapter presents a review of existing work for Big data affective analytics. We also propose a multimodal automatic sentiment recognition approach for video, speech, and text data that can be implemented on Big databases and validate our approach using the Youtube dataset.

3.1

Introduction

Since the emergence and development of the Internet-based social media, the amount of user-generated content and opinioned data has experienced an incredible exponential growth. With the assistance of the current web applications, it has turned

N. J. Shoumy (*) · D. M. Motiur Rahaman School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, NSW, Australia e-mail: [email protected]; [email protected] L.-m. Ang School of Information and Communication Technology, Grifﬁth University, Gold Coast, QLD, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_3

45

46

N. J. Shoumy et al.

out to be very easy to get limitless user emotions or sentiments or opinions on anything. Individuals utilizing the web are always welcomed to impart their insights and preferences with the rest of the world, which has prompted a blast of opinionated online journals, reviews of products and services, and remarks on anything virtually [1, 2]. This sort of online web content is increasingly perceived as a wellspring of information that has an included incentive for numerous application areas. Big data can be characterized in terms of three aspects: massive amount of collected data (volume), the data cannot be categorized into any regular database (variety), and the quick ﬂow of data that are generated and captured (velocity) [3]. Most of these data are contributed by web users, through social media such as Twitter and Facebook, blogs such as Tumblr and Blogger, online communities, and other such sites. These sites and mediums have been increasingly sharing information with the rest of the world. The information being shared includes a variety of topics related to everyday life areas such as health, commerce, education, and tourism. This opportunity to capture and save (in a database) the emotions or sentiment or opinions of the general public about social events, political movements, company strategies, marketing campaigns, and product preferences has raised growing interest both within the scientiﬁc community (leading to many exciting open challenges), as well as in the business world (due to the remarkable beneﬁts to be had from marketing and ﬁnancial market prediction). However, Big data are stored in a huge unstructured mass and sorting them out in a structural way (for use in research in this area and business purposes) is an extremely difﬁcult task for any machine. In line with this, new technologies have been developed to conduct Big data analytics on various applications. Some of these new techniques include Hadoop Distributed File Systems (HDFS) [4], Cloud technology [5], and Hive database [4]. One interesting area of research in the ﬁeld of Big data analytics is affective analysis. Affective analysis refers to user emotion or sentiment that can be captured from the general public about social events, political issues, marketing campaigns, and product preferences and then analyzed to predict the direction of user preference [6]. A survey of past research reveals that a huge amount of research has been done on affective analytics with single modality (unimodal), such as on the topic of sentiment analysis of text reviews [7] or emotion analysis of facial expressions [8]. There are also some available sentiment analysis tools [9–11], which can produce graphical pattern of emotions and feelings found in a set of data. Most of these available commercial tools consist of a very limited set of emotions features or modalities and hence have limitation to classify emotion properly. However, as one of the important features of Big data is the variety of data [3], it is important to focus on multimodality for data processing by including videos, audio, text, etc. The basic modalities and their methods of fusion for affective analytics, sentiment analysis, have been investigated and presented in this chapter. Most of the employed methods are unimodal and bimodal, with very few multimodal methods [12]. It also presents a general categorization of multimodal sentiment analysis and fusion techniques that have been developed so far. For all cases, the respective advances and disadvantages have been analyzed to ﬁnd out the possible research gaps in this

3 Multimodal Big Data Affective Analytics

47

area that need to be taken into account for enhancement as future research directions. Based on analysis, lacking in multimodal data analysis is more visible in terms of efﬁciency besides some other factors. Hence, a basic multimodal data (sentiment) analysis method is proposed here to ﬁnd its suitability in terms of general performances. This model can also be generalized for further enhancement in near future. This chapter demonstrates a multimodal data sentiment analysis method by combining three types of data: text, audio, and video. By combining these multimodal input data, a clear picture can be drawn on the emotion expressed by an individual. The considered approach for this investigation includes a combination of data processing, feature extraction, supervised classiﬁcation, and polarity detection. The ﬁndings show that multimodal data are able to produce signiﬁcantly better results compared to uni- or bimodalities. This chapter is organized in ﬁve sections as follows. Section 3.2 presents review detail of previous related research works along with their analysis, respective pros and cons. Section 3.3 presents the proposed multimodal sentiment affective analysis model with its detail explanation. This section also includes the development steps of data preprocessing and the neural network module. Section 3.4 presents the experiments, results, discussion, and their comparison analysis. Finally, Sect. 3.5 concludes the chapter with some possible future research directions.

3.2

Related Works

In this section, the available existing/past research related to Big data affective analysis, sentiment and emotion detection of single modality and multiple modality are presented. The existing literature on automatic sentiment/emotion recognition in conjunction with Big data has been growing dramatically due to the development of techniques in machine learning, computer vision, and text, visual, and speech analysis in recent years. However, Big data processing is still inefﬁcient and needs progress in reducing the computational time and memory requirement. For sentiment analysis, spontaneous emotional expression analysis holds a big challenge due to being difﬁcult to collect because such expressions are relatively rare and short-lived. Most studies of automatic emotion recognition are based on Ekman’s study [13] of six basic facial expressions, which are happiness, sadness, anger, fear, surprise, and disgust, which he indicates are universal across different races and cultures. Nevertheless, the human emotion range is a lot more complex and most of the emotion expressions that occur in human-to-human conversation are nonbasic emotions [14]. In addition, most of current automatic emotion recognition approaches are unimodal with Big data information being processed limited to either text mining [15–20], visual analysis [21–26], or speech analysis [27–30]. However, little research has been done in Big data multimodal affect analysis. Therefore, in this section, the focus is given on reviewing the efforts toward unimodal and multimodal Big data affective analytics. Different modalities (text, image, audio, and video) and combination of modalities are reviewed.

48

3.2.1

N. J. Shoumy et al.

Text: Emotion and Sentiment Recognition from Textual Data

A huge amount of big social data consists of textual data, as can be seen through platforms such as Facebook and Twitter. Affective analysis in the form of sentiment or emotion recognition from text has been proposed since the 1950s, when an interest in understanding text originally developed [31]. Wang et al. [32] proposed a method of harnessing big social data from Twitter for automatic emotion identiﬁcation. In their research, ﬁve million tweets were collected and categorized into seven emotion groups (joy, sadness, anger, love, fear, thankfulness, surprise). Then to ﬁnd the effective features for emotion identiﬁcation, a wide variety of features were explored including n-grams, emotion lexicons, part-of-speech (POS), n-gram positions, etc., using two machine learning algorithms: LIBLINEAR (A Library for Large Linear Classiﬁcation) and Multinomial Naive Bayes. In this research, the best results were obtained by a combination of unigram, bigram, sentiment and emotion lexicon, and POS features with highest accuracy around 65.57% using about two million tweets as training data. Tedeschi et al. [33] proposed a Cloud-based Big data sentiment analysis application for brand monitoring through Twitter. They developed an application prototype named Social Brand Monitoring (SBM) based on a Client–Server architecture using Java. For Twitter-based input platform, any enterprise can use this tool to monitor its brand and image as well as its competitor by analyzing the user-generated data. This research achieved favorable results through a practical application trial of SBM. Qazi et al. [34] proposed assessing consumers’ opinions through online surveys. Their focus was to ﬁnd users’ sentiment on multiple types of opinion and effect of sentiment words on customer satisfaction. They carried out a survey through LinkedIn and the university mail servers to collect data. They used expectancy disconﬁrmation theory (EDT), a set of seven hypotheses, conﬁrmatory factor analysis (CFA) and structural equation modeling (SEM) to analyze the data and evaluate their research model. Based on obtained results, they concluded that regular, comparative and suggestive opinions have a positive effect in raising users’ expectations. Besides, sentiment words were proven to accurately predict user satisfaction with any purchased items. Lo et al. [35] proposed a sparse overlapping user lasso model for mining opinions from social reviews. They performed resulting optimization using the Alternating Directions Method of Multipliers (ADMM) to generate sentiment classiﬁcation and opinion identiﬁcation from social reviews of various databases. This system showed good performances in terms of sentiment classiﬁcation, opinion identiﬁcation, and decreased running time. However, the drawbacks are: (1) the system is not generalized and can only be used with limited input data, and (2) the iteration time is not optimum and needs to be reduced further. Kherwa et al. [36] proposed that opinion can be mined from popular social media sites using the textual data from the constraints of grammatical and lexicon-based data. This system classiﬁes the

3 Multimodal Big Data Affective Analytics

49

sentiments that are extracted polarity-wise as either positive or negative. However, the polarity detection of reviews is categorized too broadly to be useful for product manufacturers or government in gauging response. Hence, the efﬁciency and/or accuracy of the system are not per expectation. Ha et al. [37] proposed a system where the frequency of sentimental words from a movie review was computed. Then, a Heatmap visualization was used to effectively discover the main emotions, and ﬁnally, a Sentiment-Movie Network combining the Multidimensional Scaling (MDS) Map and Social Network was formed. This research classiﬁed the movie review database according to 36 sentiment words, with 7 emotion categories (happy, surprise, boring, sad, anger, disgust, and fear). A sentiment-movie network was structured and k-means clustering operation was conducted for classifying cluster characteristics of each node (where each node is a feature on the network). This approach garnered faster reaction times; however, the participants need to discover the nodes from a certain number of movies in order to infer the characteristics of clustering. Hence, enhancement of the system is in demand. Guzman et al. [38] proposed a system on analysis App reviews. The textual review data of apps were combed to extract ﬁne-grained and coarse-grained features. An automated sentiment analysis technique was presented in this research for aggregating the sentiment of a certain feature. They claimed to have precision and recall efﬁciency of about 0.59 and 0.51, respectively, for their system. Their extracted features are found to be coherent and relevant to requirements evolution tasks. This approach could help apps developers to systematically analyze user opinions on single features and ﬁlter irrelevant reviews. However, features that are not mentioned/used frequently are often not detected; hence, the system needs to be enhanced and generalized. Pang et al. [39] proposed a method of classifying sentiments in movie reviews through machine learning techniques. Three machine learning techniques, Naïve Bayes, Maximum Entropy, and Support Vector Machines (SVM) were used and compared in terms of performances. The efﬁciency of machine learning techniques was acceptable for a single modality, with best and worst performance shown by SVMs and Naive Bayes, respectively. However, the approach needs further enhancement for better efﬁciency and acceptability. Socher et al. [40] proposed a method of using recursive deep model using the Stanford sentiment treebank. They used deep learning to categorize text into positive or negative sentiment. The accuracy of the proposed system was around 80.70%. However, the method needs to be generalized for more acceptability. Luyckx et al. [41] proposed a thresholding approach to multi-label classiﬁcation of emotion detection in suicide notes. They used the 2011 medical NLP challenge database with SVM to classify the text into positive and negative sentiments. The accuracy of the proposed system was 86.40%. SVM was used by Hasan et al. [42] and Sawakoshi et al. [43] to determine sentiment found in Tweets and customer travel reviews, respectively. Amazon reviews data were used by both Taboada et al. [44] and Hu et al. [45] using dictionary and deep learning methods, respectively. Finally, Li et al. [46] proposed Support Vector Regression (SVR) method on data

50

N. J. Shoumy et al.

from micro-blogging websites. However, all the proposed systems using machine learning (SVM/SVR) are not generalized and need further enhancement both in terms of generalization and efﬁciency. Table 3.1 presents summary of related works that have been carried out on textual modality in terms of: the used modality variety (opinion mining or emotion mining), database used, classiﬁcation technique, number of classiﬁed emotions, and outcome accuracy.

3.2.2

Image: Emotion Recognition from Visual Image Data

Facial expression and visual data analysis studies are under investigation since 1970; however, at the beginning, only six basic emotions (anger, sadness, surprise, fear, disgust, and joy) were under consideration [47]. Then, a seventh emotion “contempt” was added to create a more rounded facial expression database [48]. An interesting development done in a research in [21] is the facial expression coding system (FACS) that codes facial expression using action units of certain facial muscle. With the help of this development, facial expression decoding has become much simpler compared to earlier. Also, due to the popularity of social networks, images have become a convenient carrier for information among online users. Current approaches to conduct visual Big social data emotion or sentiment analysis include using low-level features [49, 50] through facial expression detection [51], user intent, and through attribute learning [52]. You et al. [53] proposed a method for non-facial image recognition where convolutional neural network was applied to images collected from Flickr and Twitter. They overcome the challenges to garner the image sentiments from noisy image samples in the large databases. This model showed an average accuracy of 75%; however, there were several mistakes in classiﬁed images, especially for the image samples with low-level features. Zhao et al. [54] proposed a visual data classiﬁcation, where personalized emotion perceptions of images were handled using factors such as visual content, social context, temporal evolution, and location inﬂuence. Their used method for image sentiment classiﬁcation was Rolling Multi-Task Hypergraph learning (RMTHG), where the classiﬁed features/factors were combined and a learning algorithm was designed for automatic feature optimization. This research also developed a largescale Flickr image dataset for experimentation and testing. Their proposed method outperformed previous baseline methods of feature extraction. However, limitation resides in the modeling of negative emotions with low accuracy; besides, the overall precision accuracy is quite low. Guntuku et al. [55] proposed a system to model personality traits of users using a collection of images the user likes on Flickr. This research used a machine learning approach to model the users’ personality based on the semantic features extracted from images and achieved results with up to 15% improvement. An empirical study was conducted by Setchi et al. [56], followed by a proposed model to validate the approach that the users’ experience can be directly

Movie Review (NAVER) data User Review (Apple App Store and Google Play) Movie Reviews

Stanford Sentiment TreeBank 2011 medical NLP challenge Tweets

Text (Opinion Mining)

Text (Opinion Mining)

Text (Emotion Mining)

Kherwa et al. [36] H. Ha et al. [37] E. Guzman et al. [38] B. Pang et al. [39] R. Socher et al. [40] K. Luyckx et al. [41] M. Hasan et al. [42]

Text (Opinion Mining)

Text (Emotion Mining)

Text (Opinion Mining)

Text (Opinion Mining)

SentiWordNet

Text (Opinion Mining)

Surveys through LinkedIn and the university mail servers Dianping, Douban, IMDB

Text (Opinion Mining)

W. Lo et al. [35]

Tweets

Text (Opinion Mining)

Tedeschi et al. [33] Qazi et al. [34]

Database Tweets

Modality Text (Emotion Mining)

Research W. Wang et al. [32]

Table 3.1 Overview of textual sentiment and emotion analysis

SVM and KNN

SVM

Deep Learning

2

2

2

2

5

– SVM

7

2

2

5

7

No. of emotion 7

Multidimensional Scaling (MDS) Map

Multimap structure

Sparse overlapping user lasso (SOUL) model

Classiﬁcation technique Unigram, bigram, sentiment and emotion lexicon, part-of-speech features with machine learning algorithms: LIBLINEAR and Multinomial Naive Bayes Social Brand Monitoring based on a Client– Server architecture developed in Java Partial least squares (PLS)

(continued)

The accuracy of the proposed system was 86.40% The accuracy of the proposed system was 80.70% The accuracy of the proposed system was 86.40% –

–

–

The average accuracy of the proposed system was 70% –

–

–

Outcome Highest accuracy achieved 65.57%

3 Multimodal Big Data Affective Analytics 51

Research Taboada et al. [44] Sawakoshi et al. [43] Hu et al. [45] W. Li et al. [46]

Customer Travel Review

Client Reviews on TripAdvisor and Amazon Micro-blogging website

Text (Emotion Mining)

Text (Opinion Mining)

Text (Emotion Mining)

Database Amazon

Modality Text (Opinion Mining)

Table 3.1 (continued)

Support Vector Regression

Deep Learning

SVM

Classiﬁcation technique Dictionary

2

2

2

No. of emotion 2

The accuracy of the proposed system was 87.50% –

–

Outcome –

52 N. J. Shoumy et al.

3 Multimodal Big Data Affective Analytics

53

linked to speciﬁc images used (in the model) when dealing with a product. However, the used linguistic resources and sentiment analysis techniques were very basic, which is the limitation of this work. Hence, more advanced sentiment analysis techniques and linguistic resources need to be considered for further enhancement. Frome et al. [57] proposed a deep visual–semantic-embedded model using deep neural network. The model was then trained to identify visual objects using both labeled image data and unannotated text. This model performs quite well when trained and evaluated on ﬂat 1-of-N metrics to correctly predict object category labels for unseen categories. However, it has some inaccuracies that need enhancement in future. A research work by Wang et al. [58] focuses on micro-expression recognition using local binary patterns on three orthogonal planes (LBP-TOP) on Tensor Independent Color Space (TICS). They used the Chinese Academy of Sciences Micro-Expression (CASME) databases to detect all seven emotions (happiness, surprise, disgust, fear, sadness, repression, and tense) with maximum accuracy of about 58.6366% in TICS color space. Table 3.2 shows a summary of previous related research works that have been carried out on image modality in terms of the modality, database used, classiﬁcation technique, number of classiﬁed emotions, and outcome accuracy.

Table 3.2 Overview of image sentiment and emotion analysis Research You et al. [53] Zhao et al. [54] Guntuku et al. [55]

Modality Image

Database Flickr images

Image

Flickr images

Image

Flickr images

Setchi et al. [56]

Image

Case study

Frome et al. [57]

Image

Wang et al. [58]

Image

1000-class ImageNet object recognition challenge Chinese Academy of Sciences MicroExpression (CASME) and CASME 2

Classiﬁcation technique Convolutional Neural Network Hypergraph Learning Feature Selection Based Ordinal Regression Novel algorithm using image schemas and lexicon-based approach Deep visual– semantic embedding model LBP-TOP on Tensor Independent Color Space (TICS)

No. of emotions 2

Outcome –

2

–

2

2

The accuracy of the proposed system was 80% (average) –

4

–

7

The best accuracy achieved is 58.6366%

54

3.2.3

N. J. Shoumy et al.

Audio: Emotion and Sentiment Recognition from Speech-Based Data

Emotions recognition from speech was ﬁrst conducted around the mid-1980s using the statistical properties of certain acoustic features [59]. Nowadays, through evolution of computer architectures, the implementation of more complex emotion recognition algorithms is possible. Therefore, tackling requirements for automatic speech recognition for Big data affective analysis services can now be carried out in a convenient way. In a recent research by Deb et al. [27], a method is proposed for speech emotion classiﬁcation using vowel-like regions (VLRs) and non-vowel-like regions (non-VLRs). They employed a region switching based classiﬁcation technique for the proposed method. Three databases (EMODB [60], IEMOCAP [61], and FAU AIBO [62]) were used to test this method with average accuracy of 85.1%, 64.2%, and 45.2%, respectively. However, the variation of features was not enough to enhance the accuracy of the system compared to other similar research works. Sawata et al. [29] proposed using kernel discriminative locality preserving canonical correlation analysis (KDLPCCA)-based correlation with electroencephalogram (EEG) features for favorite music classiﬁcation. The average accuracy of this system for favorite music classiﬁcation was claimed at around 81.4%. However, improvements need to be done with music and participation selection. Mairesse et al. [63] conducted experiment on short spoken reviews that were collected manually and processed with the openEAR/openSMILE toolkit. The sentiments were categorized according to positive and negative reviews with an accuracy of 72.9%. Caridakis et al. [64] proposed a system of multimodal emotion recognition that focuses on faces, body gesture, and speech. The HUMAINE EU-IST project [65] was used in conjunction with a Bayesian classiﬁer to obtain an average accuracy of 80%. Deng et al. [66] proposed a method for speech emotion recognition using a sparse autoencoder-based feature transfer learning. The FAU AIBO database [67] was used with an artiﬁcial neural network (ANN) module to categorize sentiments in terms of positive and negative. All the methods have their respective drawbacks and need further enhancements. Table 3.3 presents summary of previous related research works that have been carried out on audio modality in terms of: the modality, database used, classiﬁcation technique, number of classiﬁed emotions and outcome accuracy.

3.2.4

Video: Emotion and Sentiment Recognition from Video-Based Data

Social media users often share text messages with accompanying images or video, which contribute to additional information in expressing user sentiment. The amount of video data collected over time through video platforms, such as YouTube or

3 Multimodal Big Data Affective Analytics

55

Table 3.3 Overview of audio sentiment and emotion analysis Research Suman et al. [27]

Modality Audio

Sawata et al. [29]

Audio

Mairesse et al. [63]

Audio

Caridakis et al. [64]

Audio

Deng et al. [66]

Audio

Database EMODB, IEMOCAP, and FAU AIBO database Multiple Features Database (MFD) Spoken Reviews (manual collection) HUMAINE EU-IST project

FAU AIBO database

Classiﬁcation technique Region switching based method

Kernel Discriminative Locality Preserving Canonical Correlation Analysis (KDLPCCA) OpenEAR/openSMILE toolkit

No. of emotions 2

2

2

Bayesian classiﬁer

8

Sparse autoencoder (ANN)

2

Outcome The accuracy of the proposed system was 85.1% The accuracy of the proposed system was 81.4% The accuracy of the proposed system was 72.9% The accuracy of the proposed system was 80% (average) –

Facebook, serves to create Big databases that may be exploited for user sentiment/ emotion analysis. Video sequences also provide more information about how objects and scenes change over time compared to still images and, therefore, provide a more reliable emotion/sentiment recognition. Rangaswamy et al. [68] proposed extraction and classiﬁcation of YouTube videos, where certain aspects of metadata were retrieved within the video content. Then, the video datasets were categorized according to positive, negative, or neutral sets. However, the size of the dataset was quite small for a comprehensive study. In a recent work by Gupta et al. [69], a database of 6.5 million video clips of labeled facial expressions and 2777 videos labeled for seven emotions was created. The video datasets classiﬁcation were carried out using semi-supervised spatiotemporal Convolutional Neural Network (CNN). It was done by combining the CNN with an autoencoder with a classiﬁcation loss function, and then trained them in parallel. Finally, the video sets were classiﬁed according to seven emotions. However, further research needs to be carried out to make this method more robust by taking into account video transition boundaries and mixed facial expressions along with other features. In another recent work by Oveneke et al. [22], a framework for continuously estimating the human affective state using a Kalman ﬁlter as estimator and a multiple instance sparse Gaussian process as sensor model was proposed. It was an automated system based on Bayesian ﬁltering to estimate human affective states, given an incoming stream of image sequences as input. However, there are rooms for further enhancement of the system toward reliable affective state estimation. Chen et al. [23] proposed a multiple feature fusion for video emotion analysis for the Cohn–Kanade (CK+) and Acted Facial Expression in Wild (AFEW) 4.0 databases. The system used

56

N. J. Shoumy et al.

Histogram of Oriented Gradients from Three Orthogonal Planes (HOG-TOP) and multiple kernel support vector machine (SVM) to obtain an accuracy of 89.6%. Soleymani et al. [70] proposed a technique for video classiﬁcation using electroencephalogram (EEG), pupillary response, and gaze distance. Three affective labels (unpleasant, neutral, and pleasant) were determined through classiﬁcation of body responses using SVM classiﬁer with radial basis function (RBF) kernel. Xu et al. [71] proposed a method of heterogeneous knowledge transfer for video emotion recognition using the YouTube video datasets. The videos were classiﬁed into eight emotions (anger, anticipation, disgust, fear, joy, sadness, surprise, and trust) using SVM and convolutional neural network (CNN). In a recent research work by Zhu et al. [24], a depression diagnosis method based on deep neural networks was proposed. This paper used the AVEC2013 and AVEC2014 databases in conjunction with deep convolutional neutral networks (DCNN) to classify emotions into different range of depression severity. Another recent paper by Kaya et al. [72] proposed a video-based emotion recognition using deep transfer learning and score fusion. This research used the Emotion Recognition in the Wild (EmotiW) 2015/2016 datasets with DCNN to obtain seven emotion classes (angry, disgust, fear, happy, neutral, sad, and surprise) and achieved an accuracy of 52.11%. Table 3.4 presents a summary of previous related research that has been carried out on video modality in terms of: the modality, database used, classiﬁcation technique, number of classiﬁed emotions and outcome accuracy.

3.2.5

Sentiment Recognition from a Combination of Modalities

In the ﬁeld of affective/sentiment analysis within the Big data domain, information such as service and product reviews, social media content, etc., is gradually shifting from single modality (unimodal) to multimodality. Therefore, it is understandable that researchers ﬁnd it increasingly difﬁcult to keep up with this deluge of multimodal content; hence, it is essential to organize it. A recently developed method for processing these multimodal Big data is multimodal/hybrid fusion [73], which is basically integration of multiple media, their associated features or the intermediate decisions in order to perform a certain analytical task. Existing research works on multimodal Big data fusion can be categorized through several types of classiﬁcations-based fusion methods and the type of fusion (feature, decision, and hybrid). Feature-level fusion [74] is the most extensively used method; it fuses the information extracted at the feature level. Decision-level fusion [75] fuses multiple modalities in the semantic space. A combination of the feature-level fusion and decision-level fusion produces the hybrid-level-fusion approach [76]. Nicolaou et al. [77] proposed a hybrid output-associative fusion method for dimensional and continuous prediction of emotions in valence and arousal space

3 Multimodal Big Data Affective Analytics

57

Table 3.4 Overview of video sentiment and emotion analysis Classiﬁcation technique Metadata extraction using source code developed in Python NN

No. of emotions 3

Audio–Visual Emotion Challenge (AVEC 2012) and AVEC 2014 Case Study

Kalman ﬁltering (KF) with multiple instance sparse Gaussian process (MI-SGP) SVM

2

The accuracy of the proposed system was 94.18% and 66.15% for each respective database –

3

–

YouTube emotion datasets AVEC2013 and AVEC2014 EmotiW 2015 and EmotiW 2016

SVM and CNN

8

–

NN

2

–

CNN

7

The accuracy of the proposed system was 52.11%

Research Rangaswamy et al. [68]

Modality Video

Database YouTube dataset

Gupta et al. [69]

Video

Cohn– Kanade Dataset MMI Dataset

Oveneke et al. [22]

Video

Soleymani et al. [70] Xu et al. [71]

Video

Zhu et al. [24]

Video

Kaya et al. [72]

Video

Video

7

Outcome –

by integrating facial expression, shoulder gesture, and audio cues. They collected data from Sensitive Artiﬁcial Listener Database (SAL-DB) for emotion prediction and used Bidirectional Long Short-Term Memory Neural Networks (BLSTM-NNs) and Support Vector Regression (SVR) for classiﬁcation. They showed that BLSTMNN outperforms SVR, and the proposed hybrid output-associative fusion performs signiﬁcantly better than the individual feature-level and model-level fusion. Another research by Nicolaou et al. [78] proposed audiovisual emotion classiﬁcation method using model-level fusion in the likelihood space by integrating facial expression, shoulder and audio cues. They collected data from Sensitive Artiﬁcial Listener (SAL) database, and used Maximum Likelihood Classiﬁcation, Hidden Markov Models (HMM) and Support Vector Machines (SVM) for data/emotion classiﬁcation with classiﬁcation accuracy around 94.01%. Kanluan et al. [79] proposed a method for estimating spontaneously expressed emotions in audiovisual data. This research used the audiovisual database recorded from the German TV talk show Vera am Mittag (VAM) corpus to classify sentiment into three emotion primitive categories: (1) from negative to positive (valence), (2) from calm to excited (activation), and (3) from weak to strong (dominance).

58

N. J. Shoumy et al.

Support Vector Regression (SVR) and decision-level fusion was used to achieve an average performance gain of 17.6% and 12.7% over the individual audio and visual emotion estimation, respectively. Chen et al. [23] proposed a multiple feature fusion for video emotion analysis for the Cohn-Kanade (CK+) and Acted Facial Expression in Wild (AFEW) 4.0 databases. The system used Histogram of Oriented Gradients from Three Orthogonal Planes (HOG-TOP) and multiple kernel Support Vector Machine (SVM) to obtain an accuracy of 89.6%. Poria et al. [80–82] conducted multimodal data analysis of text, audio, and video using the YouTube dataset, Multimodal Opinion Utterances Dataset (MOUD), USC IEMOCAP database, International Survey of Emotion Antecedents and Reactions (ISEAR)1 dataset, and CK++ dataset, respectively. A combination of feature and decision-level fusion was used in their research together. For classiﬁcation, they used Deep Convolutional Neural Network (DCNN) based on multiple kernel learning, distributed time-delayed dependence using DCNN with multiple kernel learning, k-Nearest Neighbor (KNN), Artiﬁcial Neural Network (ANN), Extreme Learning Machine (ELM), and Support Vector Machine (SVM) for emotion classiﬁcation. Perez-Rosas et al. [83] proposed a method of speech, visual, and text analysis in order to identify the sentiment expressed in video reviews. The used the Multimodal Opinion Utterances Dataset (MOUD) and feature-level fusion of data in their research. For data/sentiment classiﬁcation, they used SVM classiﬁer with tenfold cross-validation obtaining an accuracy of around 74.66%. Paleari et al. [84] proposed an architecture that extracts affective information from an audio–video database and attaches the obtained semantic information to the data. The eNTERFACE database was used, and features were extracted through Semantic Affect-enhanced MultiMedia Indexing (SAMMI). The features were classiﬁed by neural networks (NN) and SVM together with feature- and decision-level fusion. The outcome showed an average recognition rate of feature fusion and decision fusion around 35% and 40%, respectively. Mansoorizadeh et al. [85] proposed multimodal emotion recognition from facial expression and speech using asynchronous feature-level fusion approach that creates a uniﬁed hybrid feature space out of the individual signal measurements. The TMU-EMODB database and eNTERFACE database were used in their research. The audio and video features were extracted with principal component analysis (PCA) and linear discrimination analysis (LDA), respectively. Adaptive classiﬁer together with a Kalman ﬁlter state estimator was used for classiﬁcation. The model used both feature- and decision-level fusion. The feature fusion and decision fusion for the TMU-EMODB database yielded an accuracy of 75% and 65%, respectively, whereas for the eNTERFACE database the accuracy was 70% and 38%, respectively. So far, all the proposed/developed methods for processing multimodal Big data along with multimodal fusion toward affective analysis and emotions prediction have their own respective drawbacks in terms of reliability, accuracy, efﬁciency, presentation, etc. Another major challenge for handling Big multimodal data is in reducing the computational time and memory consumption, which is not addressed

3 Multimodal Big Data Affective Analytics

59

in most of the research works. Hence, there is room for further research and enhancement in this area. Table 3.5 shows comparison summary of related research works, which has been carried out on multimodal sentiment analysis in terms of: the modality variety, database used, fusion technique, classiﬁcation technique, number of classiﬁed emotions, and outcome.

3.3

Proposed Big Multimodal Affective Analytics System

This section proposes a multimodal affective analytics system for Big data analysis. This system is able to classify videos from a known database using multiple modalities such as text, audio, and video into the sentiment that is being conveyed by the person in the video. Traditional approaches to Big data processing are inefﬁcient and impractical, with high computational time and memory requirement. To overcome these problems, a promising processing technique to handle such data are machine learning methods together with their strong mathematical background and accurate information extraction [86]. Therefore, feature selection techniques such as principal component analysis (PCA) that can deal with high-dimensional data can be used (more suitable) to process the Big data [87]. A general framework for processing Big data using PCA is shown in Fig. 3.1. In line with Big data description emphasizing on the 3Vs (volume, variety, velocity), the characteristics of volume and variety are present in YouTube Big data. This is due to millions of videos with different content being posted on the YouTube platform every day. Therefore, a suitable YouTube database [88] is chosen here to use for the proposed system model. It contains multimodal data in the form of video, audio, and text that has been extracted from 47 videos from the YouTube social media website. The visual, audio, and textual features are extracted through respective feature extraction software (which are OKAO and OpenEAR) and combined to form a ﬁnal fused feature vector. Then, supervised classiﬁers are used on the fused vector to obtain the sentiments of the video segments. The system overview is shown in Fig. 3.2 and described in the following subsections accordingly.

3.3.1

Data Preprocessing

The considered YouTube dataset contains 47 randomly chosen videos, consisting of a wide range of topics ranging from politics to product reviews. It consists of 20 female and 27 male speakers with various age level and ethnicity background. Each video contains a single speaker giving their thoughts/opinions on a particular subject/product in English while facing the camera for the majority portion of the time. The length of each video ﬁle/clip ranges from 2 to 5 min. The videos were ﬁrst

Modality Audio Video

Audio Video

Audio Video

Research Nicolaou et al. [77]

Nicolaou et al. [78]

Kanluan et al. [79]

VAM corpus recorded from the German TV talk show Vera am Mittag

Sensitive Artiﬁcial Listener Database

Database Sensitive Artiﬁcial Listener Database

Table 3.5 Overview of multimodal sentiment analysis

Model

Model

Fusion Feature, decision, outputassociative fusion

Support Vector Regression for three continuous dimensions

HMM and Likelihood Space via SVM

Technique SVR and BLSTM-NNs

3

2

No. of emotions 2

Outcome Over leave-one-sequence-out cross-validation, best result was attained by fusion of face, shoulder, and audio cues as: RMSE ¼ 0.15 and COR ¼ 0.796 for valence. Whereas, RMSE ¼ 0.21 and COR ¼ 0.642 for arousal Over tenfold cross-validation, best mono-cue result was approximately 91.76% from facial expressions. Whereas, best fusion result is around 94% by fusing facial expressions, shoulder and audio cues together Average estimation errors of the acoustic and visual modalities were 17.6% and 12.7%, respectively, the respective correlation between the prediction and ground truth was increased by 12.3% and 9.0%

60 N. J. Shoumy et al.

Audio Video

Text Audio Video

Text Audio Video

Text Audio Video

Chen et al. [23]

Poria et al. [80]

Poria et al. [81]

Poria et al. [82]

International Survey of Emotion Antecedents and Reactions (ISEAR)1 dataset and CK++ dataset and the eNTERFACE dataset

Multimodal Opinion Utterances Dataset (MOUD) and USC IEMOCAP database

YouTube Dataset

Extended Cohn–Kanade dataset, GEMEP-FERA 2011 dataset and the Acted Facial Expression in Wild (AFEW) 4.0 dataset

Feature

Feature

Feature and Decision

Feature

k-Nearest Neighbor (KNN), Artiﬁcial Neural Network (ANN), Extreme Learning Machine (ELM), and Support Vector Machine (SVM)

Distributed time-delayed dependence using deep convolutional neural network (CNN) with multiple kernel learning

Deep convolutional neural network (CNN) based on multiple kernel learning

Histograms of Oriented Gradients (HOG) to temporal Three Orthogonal Planes (TOP), inspired by a temporal extension of Local Binary Patterns, LBP-TOP

7

4

2

7

(continued)

Overall classiﬁcation accuracy obtained using HOG-TOP on the CK+ database, GEMEP-FERA 2011 database and AFEW 4.0 database were about 89.6%, 54.2% and 35.8%, respectively, while respective accuracies for LBP-TOP were around 89.3%, 53.6%, and 30.6%, respectively Results for accuracy of each modality separately, with feature-level fusion and with decision-level fusion were 87.89%, 88.60% and 86.27%, respectively Results obtained for accuracy of the four emotions angry, happy, sad and neutral were approximately 79.20%, 72.22%, 75.63%, and 80.35%, respectively The overall accuracy was around 87.95%, which outperformed the best stateof-the-art system by more than 10%, or in relative terms, a 56% reduction in error rate

3 Multimodal Big Data Affective Analytics 61

Modality Text Audio Video

Audio Video

Audio Video

Research Perez-Rosas et al. [83]

Paleari et al. [84]

Mansoorizadeh et al. [85]

Table 3.5 (continued)

The TMU-EMODB database and eNTERFACE database

The eNTERFACE database

Database Multimodal Opinion Utterances Dataset (MOUD)

Feature and Decision

Feature and Decision

Fusion Feature

Extraction of features through Semantic Affect-enhanced MultiMedia Indexing (SAMMI) and classiﬁed by neural networks (NN) and support vector machines (SVM) Principal component analysis (PCA) and linear discrimination analysis (LDA) used for feature extraction. Adaptive classiﬁer together with a Kalman ﬁlter state estimator was used for classiﬁcation

Technique SVM classiﬁer in tenfold cross-validation

6

6

No. of emotions 2

The feature fusion and decision fusion for the TMU-EMODB database yielded an accuracy of about 75% and 65%, respectively, whereas the accuracy for the eNTERFACE database was nearly 70% and 38%, respectively

Outcome The accuracy of multiple modalities with feature-level fusion was approximately 74.66%, with error rate reductions of up to 10.5% as compared to the use of single modality at a time The average recognition rate of feature fusion and decision fusion was around 35% and 40%, respectively

62 N. J. Shoumy et al.

3 Multimodal Big Data Affective Analytics

Very High Dimensional Raw Data

63

Classifier PCA Analysis

Feature Extraction Dimension Reduction

Fig. 3.1 General framework of PCA analysis carried out on Big data Big Data Analysis Big Multimodal Data

YouTube Platform

Data Processing (PCA)

Data Pre-processing

Raw video database

Videos prepossessed and segmented convert

OKAO Vision

extract

Facial features vector extraction

Visual data

Fusion Module

OpenEAR Audio features vector extraction

Audio data

transcribe

Textual data

Combined extracted features vector

MPQA opinion corpus Textual features vector extraction

Supervised Classifier

Classifier Module

Feature Extraction Module identify Sentiment Detection (Polarity)

Outcome

Fig. 3.2 Overview of multimodal affective analytics system

converted to mp4 format with a standard size of 360 480 pixel resolution. Then they were preprocessed by removing the ﬁrst 30 s duration, which mostly contained introductory titles and other nonrelevant materials. Each video was then manually segmented according to the spoken utterances and annotated with their respective sentiment (0 for neutral utterances, 1 for negative utterances, and +1 for positive

64

N. J. Shoumy et al.

utterances). Finally, using MATLAB function, each video was converted to image frames according to the frame rate of the video. The audio track was also extracted from the video for separate processing. To obtain the textual data, manual transcription was carried out to extract all the spoken words contained in the video. A transcription software was used to extract the utterances from the audio track of the video, each video contained 3–11 utterances. Each video data was then annotated with sentiment label, by labeling them as either positive, or negative, or neutral. This annotation was carried out manually by three annotators with the goal of associating each video to a sentiment label that best summarizes the opinion expressed in it. Finally, based on the majority voting of the annotators, out of the 47 videos clips, 13 were labeled as positive, 22 as neutral, and 12 as negative.

3.3.2

Feature Extraction

The software OKAO Vision [11] was used for facial feature extraction. OKAO was chosen because it can detect the face, the facial features, extrapolate some basic facial expression and also eye gaze direction for each image frame. One of the main facial expression recognized by this software is the smile. The following features were extracted from each of the image frames of a video clip: 1. Four points that deﬁne the rectangle where the face is and the conﬁdence of the four points 2. Facial pose (left face pose, frontal face, right face pose) 3. Thirty-eight facial feature points coordinates (x, y) with their respective conﬁdence 4. Whether the face is up or down, left or right, or roll, and the conﬁdence 5. Whether the eye gaze is up or down, left or right, and the conﬁdence 6. The openness of the left eye, right eye, mouth and its corresponding conﬁdence 7. The smile level Therefore, in total, there were 134 features (5 + 1 + 38 3 + 4 + 3 + 6 + 1 ¼ 134) obtained from each frame of a video clip. These features were normalized by the total number of frames during each utterance. For audio feature extraction, the open source software OpenEAR [10] was used for speech recognition and processing. This software was chosen due to its ability to automatically compute the pitch and voice intensity from an audio data ﬁle. In total, 6373 features were extracted from the audio data ﬁles including Spectral Centroid, Spectral Flux, Strongest Beat, Beat Sum, Voice Quality and Pause Duration. For textual feature extraction, the MPQA opinion corpus [89] was used to automatically identify the linguistic cues of the sentiment present in the transcribed text. MPQA has two lexicons of words that can be labeled as “positive” and “negative.” These words were identiﬁed with their respective polarities in the textual data. A lexicon of valence shifters were used in cases where there is a change in

3 Multimodal Big Data Affective Analytics

65

polarity of a certain word. For example, in the word “not good”: “good” has a positive polarity; however, the word “not” is a valence shifter and therefore will change the polarity of “good” from positive to negative. Hence, using these lexicons the total polarity for the text in each utterance was calculated.

3.3.3

Data Fusion Module

The three feature vectors (video, audio, and text) were then fused together using the two main fusion techniques: feature-level fusion and decision-level fusion. In feature-level fusion, discriminant correlation analysis (DCA) [90] was used to fuse together the three feature vectors to form a single long feature vector. This ﬁnal feature vector was then used to classify each video segment into its respective polarity. In decision-level fusion, the unimodal feature vectors were obtained ﬁrst; then, they were classiﬁed separately into their respective sentiment class. After that, ﬁnally they were classiﬁed into a single sentiment polarity depending on their separate sentiment scores.

3.3.4

Classiﬁer Module

For classiﬁcation purposes, three supervised classiﬁers, Artiﬁcial Neural Network (ANN), Extreme Learning Machines (ELM), and Hidden Markov Model (HMM) were developed using MATLAB function, following the development procedure and steps described in [82, 88]. These classiﬁers were chosen because they showed good accuracy and recall time to obtain sentiment from multimodal data using fused multiple feature vectors in research papers [82, 88].

3.4

Simulation

The sentiment classiﬁcation was executed on the YouTube dataset that was introduced in Sect. 3.3.1. From the dataset, the 47 videos were preprocessed and segmented into 276 short video segments according to each utterance. The segmented videos were labeled with 84 positive, 109 negative and 83 neutral in polarity. After that, from each segmented video, the textual, audio and video data features were extracted as described in Sect. 3.3.2, then the features were combined through DCA feature-level fusion. Through this approach, all three of the feature vectors from each of the modalities were combined into a single feature vector. This resulted in one feature vector per video segment, which is used to determine the sentiment

66

N. J. Shoumy et al.

orientation of that particular segment. After that, several comparative experiments were executed as follows: 1. First, with each of the single modalities (text, audio, and video separately) 2. With bimodalities (text and audio, text and video, audio and video) 3. With multimodality, all three modalities integrated together (text, audio, and video together) For all above three cases, the features extracted in Sect. 3.3.2 were fused (as per procedure in Sect. 3.3.3) to form a long single feature vector and used as input to MATLAB function for simulation. A leave-one-out testing was performed, where one video (4 video segments) was left out for testing and the rest of the 46 videos (272 video segments) were used for testing and validation. The entire set of 272 segments were used to execute (run) the MATLAB simulation for all three classiﬁers ANN, HMM, and ELM, respectively, with cross-validation being performed for each of them.

3.4.1

Results and Discussions

The performance parameters considered are as per [65], to determine the system performance as also to compare results with them. The parameters can be deﬁned as follows: TP Precision: Measure of exactness of classiﬁcation. Precision ¼ TPþFP Recall: Measure of correctly classiﬁed data for the class the user is interested in TP (Positive class). Recall ¼ TPþFN . Here, TP means the number of correct classiﬁcations of the positive examples (true positive), FN is the number of incorrect classiﬁcations of positive examples (false negative), FP is the number of negative examples that are incorrectly classiﬁed as positive (false positive), and TN is the number of correct classiﬁcations of negative examples (true negative). Through simulation, multimodal sentiment analysis is found to perform efﬁciently on multimodal data fused together. Table 3.6 shows the results obtained from feature-level classiﬁcation of the fused multimodal feature vector using ANN, HMM, and ELM, respectively. It shows that ELM performs better for classifying the fused vector to identify the sentiment of the video segment compared to ANN and HMN. In terms of accuracy, ELM outperforms ANN by 12% and HMM by 16%, showing its superiority. It can also be seen that the integrated multimodal data (text, audio, and video) produce signiﬁcantly better results (at least 10%) in terms of accuracy and recall compared to single modalities and/or bimodalities.

3 Multimodal Big Data Affective Analytics

67

Table 3.6 Results obtained from feature-level fusion of different modalities

Modality Textual only Audio only Video only Text and audio Text and video Audio and video Proposed: Text, audio and video

3.5

Classiﬁer ANN Precision 0.544 0.573 0.599 0.6261 0.637 0.6442 0.688

Recall 0.544 0.577 0.592 0.6262 0.624 0.6434 0.678

HMM Precision 0.431 0.449 0.408 0.4766 0.5085 0.5149 0.543

Recall 0.430 0.430 0.429 0.4765 0.5035 0.5148 0.564

ELM Precision 0.619 0.652 0.681 0.7115 0.7245 0.7321 0.782

Recall 0.59 0.671 0.676 0.7102 0.7185 0.7312 0.771

Conclusion

The basic modalities (uni-, bi-, and multimodal) and their methods of fusion for Big data affective analytics along with sentiment analysis have been investigated and presented in this chapter. It also presents a general categorization of multimodal sentiment analysis and fusion techniques that have been developed so far. Then a multimodal Big data (sentiment) analysis system is proposed here by combining three modalities: text, audio, and video. The proposed approach includes a combination of data processing, feature extraction, data fusion, supervised classiﬁcation and polarity detection. As classiﬁer, Extreme Learning Machines (ELM), Artiﬁcial Neural Network (ANN), and Hidden Markov Model (HMM) have been considered to ﬁnd their suitability for multimodal Big data classiﬁcation. This approach with multimodal data is able to produce signiﬁcantly better results (at least 10%) in terms of emotion classiﬁcation accuracy and recall time compared to uni- and/or bimodalities. Besides, ELM classiﬁer performed at least 12% better than ANN and HMM. However, despite the progress made in the ﬁeld of multimodal sentiment analysis, there is still a long way to go research-wise. Human emotions do not only manifest in the face, voice, or words but the whole body, through hand gestures, body movement, pupil dilation, etc. For the computer to fully understand human emotion, a lot more research needs to be carried out. Future work would need to be done on much larger scale of multimodal data and also explore different domains of emotion analysis to generalize the proposed system.

References 1. Fong, B., Member, S., Westerink, J.: Affective computing in consumer electronics. IEEE Trans. Affect. Comput. 3(2), 129–131 (2012) 2. Abbasi, A., Chen, H., Salem, A.: Sentiment analysis in multiple languages. ACM Trans. Inf. Syst. 26(3), 1–34 (2008)

68

N. J. Shoumy et al.

3. Laney, D.: 3D data management: controlling data volume, velocity, and variety. Appl. Deliv. Strateg. 949, 4 (2001) 4. White, T.: Hadoop: The Deﬁnitive Guide (2010) 5. Carlin, S., Curran, K.: Cloud computing technologies. Int. J. Cloud Comput. Serv. Sci. 1(2), 59–65 (2012) 6. Yadollahi, A.L.I., Shahraki, A.G., Zaiane, O.R.: Current state of text sentiment analysis from opinion to emotion mining. ACM Comput. Surv. 50, 2 (2017) 7. Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classiﬁcation with label propagation over lexical links and the follower graph. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 53–56 (2011) 8. Bartlett, M.S., Littlewort, G., Frank, M., Lainscsek, C., Fasel, I., Movellan, J.: Recognizing facial expression: machine learning and application to spontaneous behavior. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2, 568–573 (2005) 9. Saragih, J.M., Lucey, S., Cohn, J.F.: Face alignment through subspace constrained mean-shifts. In: 2009 I.E. 12th International Conference on Computer Vision, Clm, pp. 1034–1041 (2009) 10. Eyben, F., Wöllmer, M., Schuller, B.: OpenEAR – Introducing the Munich open-source emotion and affect recognition toolkit. In: Proceedings of 2009 3rd International Conference on Affective Computing and Intelligent Interaction Work. ACII 2009 (2009) 11. Lao, S., Kawade, M.: Vision-Based Face Understanding Technologies and Their Applications 2 The Key Technologies of Vision-Based Face Understanding. In: Sinobiometrics, pp. 339–348 (2004) 12. Soleymani, M., Garcia, D., Jou, B., Schuller, B., Chang, S.-F., Pantic, M.: A survey of multimodal sentiment analysis. Image Vis. Comput. 65, 3–14 (Sep. 2017) 13. Ekman, P.: Emotion in the Human Face, 2nd edn. Cambridge University Press, New York (1982) 14. Russell, J.A., Bachorowski, J.-A., Fernández-Dols, J.-M.: Facial and vocal expressions of emotion. Annu. Rev. Psychol. 54(1), 329–349 (2003) 15. Fang, X., Zhan, J.: Sentiment analysis using product review data. J. Big Data 2, 1 (2015) 16. Bikel, D.M., Sorensen, J.: If we want your opinion. In: International Conference on Semantic Computing, ICSC 2007, pp. 493–500 (2007) 17. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011) 18. Turney, P.D.: Thumbs up or thumbs down? Semantic orientation applied to unsupervised classiﬁcation of reviews. In: Proceedings of the 40th Annual Meeting of the Associationfor Computational Linguistics, pp. 417–424 (2002) 19. Mejova, Y.A.: Sentiment analysis within and across social media streams (2012) 20. Nair, D.S., Jayan, J.P., Rajeev, R.R., Sherly, E.: Sentiment analysis of Malayalam Film review using machine learning techniques, pp. 2381–2384 (2015) 21. Ma, B., Yao, J., Yan, R., Zhang, B.: Facial expression parameter extraction with Cohn-Kanade based database. Int. J. Electr. Energy. 2(2), 103–106 (2014) 22. Oveneke, M., Gonzalez, I., Enescu, V., Jiang, D., Sahli, H.: Leveraging the Bayesian ﬁltering paradigm for vision-based facial affective state estimation. IEEE Trans. Affect. Comput. 14(8), 1–1 (2017) 23. Chen, J., Chen, Z., Chi, Z., Fu, H.: Facial expression recognition in video with multiple feature fusion. IEEE Trans. Affect. Comput. 3045(c), 1–1 (2016) 24. Zhu, Y., Shang, Y., Shao, Z., Guo, G.: Automated depression diagnosis based on deep networks to encode facial appearance and dynamics. IEEE Trans. Affect. Comput. X(X), 1–1 (2017) 25. Khan, M.M., Ward, R., Ingleby, M.: Toward use of facial thermal features in dynamic assessment of affect and arousal level. IEEE Trans. Affect. Comput. 3045(c), 1–1 (2016) 26. Wang, S., Chen, H.-L., Yan, W., Chen, Y., Fu, X.: Face recognition and micro-expression recognition based on discriminant tensor subspace analysis plus extreme learning machine. Neural Process. Lett. 39(1), 25–43 (2014)

3 Multimodal Big Data Affective Analytics

69

27. Deb, S., Dandapat, S.: Emotion classiﬁcation using segmentation of vowel-like and non-vowellike regions. IEEE Trans. Affect. Comput. 3045(c), 1–1 (2017) 28. Watanabe, K., Greenberg, Y., Sagisaka, Y.: Sentiment analysis of color attributes derived from vowel sound impression for multimodal expression. In: 2014 Asia-Paciﬁc Signal and Information Processing Association, 2014 Annual Summit and Conference, APSIPA 2014, pp. 0–4 (2014) 29. Sawata, R., Ogawa, T., Haseyama, M.: Novel EEG-based audio features using KDLPCCA for favorite music classiﬁcation, vol. 3045, pp. 1–14 (2016) 30. Zeng, Z., Hu, Y., Roisman, G.I., Wen, Z., Fu, Y., Huang, T.S.: Audio-visual spontaneous emotion recognition. In: Artiﬁcal Intelligence for Human Computing. Lecture Notes in Computer Science, vol. 4451 (2007) 31. Moreno, A., Redondo, T.: Text analytics: the convergence of Big Data and Artiﬁcial Intelligence. Int. J. Interact. Multimed. Artif. Intell. 3(6), 57 (2016) 32. Wang, W., Chen, L., Thirunarayan, K., Sheth, A.P.: Harnessing Twitter ‘Big Data’ for automatic emotion identiﬁcation (2012) 33. Tedeschi, A., Benedetto, F.: A cloud-based big data sentiment analysis application for enterprises’ brand monitoring in social media streams. In: 2015 I.E. 1st International Forum on Research and Technologies for Society and Industry, pp. 186–191 (2015) 34. Qazi, A., Tamjidyamcholo, A., Raj, R.G., Hardaker, G., Standing, C.: Assessing consumers’ satisfaction and expectations through online opinions: expectation and disconﬁrmation approach. Comput. Human Behav. 75, 450–460 (2017) 35. Lo, W., Tang, Y., Li, Y., Yin, J.: Jointly learning sentiment, keyword and opinion leader in social reviews. In: 2015 IEEE International Conference on Collaboration and Internet Computing, pp. 70–79 (2015) 36. Kherwa, P., Sachdeva, A., Mahajan, D., Pande, N., Singh, P.K.: An approach towards comprehensive sentimental data analysis and opinion mining. In: 2014 I.E. International Advance Computing Conference IACC 2014, pp. 606–612 (2014) 37. Ha, H., et al.: CosMovis: semantic network visualization by using sentiment words of movie review data. In: 19th International Conference Information Visualisation, vol. 19, pp. 436–443 (2015) 38. Guzman, E., Maalej, W.: How do users like this feature? A ﬁne grained sentiment analysis of app reviews. In: Proceedings of the 2014 I.E. 22nd International Requirements Engineering Conference RE 2014, vol. 22, pp. 153–162 (2014) 39. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classiﬁcation using machine learning techniques. In: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp. 79–86 (2002) 40. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013) 41. Luyckx, L., Vaassen, F., Peersman, C., Daelemans, W.: Fine-grained emotion detection in suicide notes: a thresholding approach to multi-label classiﬁcation. Biomed. Inform. Insights. 5, 61 (2012) 42. Hasan, M., Rundensteiner, E., Agu, E.: EMOTEX: detecting emotions in Twitter Messages. In: ASE BIGDATA/SOCIALCOM/CYBERSECURITY Conference, pp. 27–31 (2014) 43. Sawakoshi, Y., Okada, M., Hashimoto, K.: An investigation of effectiveness of ‘Opinion’ and ‘Fact’ sentences for sentiment analysis of customer reviews. In: Proceedings of the 2015 International Conference on Computer Application Technologies, CCATS 2015, pp. 98–102 (2015) 44. Taboada, M., Brooke, J., Toﬁloski, M., Voll, K., Stede, M.: Lexicon-based methods for sentiment analysis. Comput. Linguist. 37(2), 267–307 (2011) 45. Hu, Z., Hu, J., Ding, W., Zheng, X.: Review sentiment analysis based on deep learning. In: 2015 12th IEEE International Conference on E-Business Engineering, pp. 87–94 (2015)

70

N. J. Shoumy et al.

46. Li, W., Xu, H.: Text-based emotion classiﬁcation using emotion cause extraction. Expert Syst. Appl. 41(4 PART 2), 1742–1749 (2014) 47. Cohn, J.F.: Foundations of human computing: facial expression and emotion. In: Proceedings of Eighth ACM Int’l Conference on Multimodal Interfaces (ICMI ’06), vol. 8, pp. 233–238 (2006) 48. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: Proceedings of IEEE Conference on Face Gesture Recognition, p. 46 (2000) 49. Jia, J., Wu, S., Wang, X., Hu, P., Cai, L., Tang, J.: Can we understand Van Gogh’s Mood?: learning to infer affects from images in social networks. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 857–860 (2012) 50. Wang, X., Jia, J., Hu, P., Wu, S., Tang, J., Cai, L.: Understanding the emotional impact of images. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1369–1370 (2012) 51. Vonikakis, S., Winkler, V.: Emotion-based sequence of family photos. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1371–1372 (2012) 52. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.-F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International Conference on Multimedia – MM ’13, pp. 223–232 (2013) 53. You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively trained and domain transferred deep networks. In: Twenty-Ninth AAAI, pp. 381–388 (2015) 54. Zhao, S., Yao, H., Gao, Y., Ding, G., Chua, T.-S.: Predicting personalized image emotion perceptions in social networks. IEEE Trans. Affect. Comput. X(X), 1–1 (2016) 55. Guntuku, S.C., Zhou, J.T., Roy, S., LIN, W., Tsang, I.W.: Who likes what, and why? Insights into personality modeling based on image ‘Likes’. IEEE Trans. Affect. Comput. 3045(c), 1–1 (2016) 56. Setchi, R., Asikhia, O.K.: Exploring user experience with image schemas, sentiments, and semantics. IEEE Trans. Affect. Comput. 1–1 (2017) 57. Frome, A., et al.: DeViSE: a deep visual-semantic embedding model. In: Burges, C.J.C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 2121–2129. Curran Associates, Nevada (2013) 58. Wang, S., Yan, W., Li, X., Zhao, G., Fu, X.: Micro-expression recognition using dynamic textures on tensor independent color space. In: 2014 22nd International Conference on Pattern Recognition, pp. 4678–4683 (2014) 59. Van Bezooijen, R.: The Characteristics and Recognizability of Vocal Expression of Emotions. Foris, Drodrecht (1984) 60. Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., Weiss, B.: A database of German emotional speech. Proc. Interspeech. 1517–1520 (2005) 61. Busso, C., et al.: IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Eval. 42(4), 335–359 (2008) 62. Batliner, A., Hacker, C., Steidl, S., Nöth, E.: ‘You Stupid Tin Box’-Children Interacting with the AIBO Robot: A Cross-linguistic Emotional Speech Corpus, Proc. Lr. (2004) 63. Mairesse, F., Polifroni, J., Di Fabbrizio, G.: Can prosody inform sentiment analysis? Experiments on short spoken reviews. In: Proceedings of IEEE International Confernce on Acoustics, Speech and Signal Processing (ICASSP), pp. 5093–5096 (2012) 64. Caridakis, G., et al.: Multimodal emotion recognition from expressive faces, body gestures and speech. IFIP Int. Fed. Inf. Process. 247, 375–388 (2007) 65. Douglas-Cowie, E., et al.: The HUMAINE database: addressing the collection and annotation of naturalistic and induced emotional data. Affect. Comput. Intell. Interact. 488–500 (2007) 66. Deng, J., Zhang, Z., Marchi, E., Schuller, B.: Sparse autoencoder-based feature transfer learning for speech emotion recognition. In: Proceedingss of 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction ACII 2013, pp. 511–516 (2013) 67. Steidl, S.: Automatic Classiﬁcation of Emotion-Related User States in Spontaneous Children’s Speech (2009)

3 Multimodal Big Data Affective Analytics

71

68. Rangaswamy, S., Ghosh, S., Jha, S.: Metadata extraction and classiﬁcation of YouTube videos using sentiment analysis, pp. 1–7 (2016) 69. Gupta, O., Raviv, D., Raskar, R.: Multi-velocity neural networks for facial expression recognition in videos. IEEE Trans. Affect. Comput. 3045(c), 1–1 (2017) 70. Soleymani, M., Pantic, M.: Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 3(2), 211–223 (2012) 71. Xu, B., Fu, Y., Jiang, Y.-G., Li, B., Sigal, L.: Heterogeneous knowledge transfer in video emotion recognition, attribution and summarization. IEEE Trans. Affect. Comput. 3045(c), 1–13 (2015) 72. Kaya, H., Gürpınar, F., Salah, A.A.: Video-based emotion recognition in the wild using deep transfer learning and score fusion. Image Vis. Comput. 65, 66–75 (2017) 73. Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016) 74. Datcu, D., Rothkrantz, L.J.M.: Emotion recognition using bimodal data fusion. In: Proceedings of the 12th International Conference on Computer Systems and Technologies, pp. 122–128 (2011) 75. Hall, D.L., Llinas, J.: An introduction to multisensor data fusion. Proc. IEEE. 85(1), 6–23 (1997) 76. Wu, Z., Cai, L., Meng, H.: Multi-level fusion of audio and visual features for speaker identiﬁcation. In: Advances in Biometrics, pp. 493–499 (2005) 77. Nicolaou, M.A., Gunes, H., Pantic, M.: Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. Affect. Comput. 2(2), 92–105 (2011) 78. Nicolaou, M.A., Gunes, H., Pantic, M.: Audio-visual classiﬁcation and fusion of spontaneous affective data in likelihood space. In: 2010 20th International Conference on Pattern Recognition, pp. 3695–3699 (2010) 79. Kanluan, I., Grimm, M., Kroschel, K.: Audio-visual emotion recognition using an emotion space concept. In: 16th European Signal Processing Conference, vol. 16, pp. 486–498 (2008) 80. Poria, S., Cambria, E., Gelbukh, A.: Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, September, pp. 2539–2544 (2015) 81. Poria, S., Chaturvedi, I., Cambria, E., Hussain, A.: Convolutional MKL based multimodal emotion recognition and sentiment analysis. In: Proceedings – IEEE International Conference on Data Mining, ICDM, pp. 439–448 (2017) 82. Poria, S., Cambria, E., Hussain, A., Bin Huang, G.: Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 63, 104–116 (2015) 83. Perez-Rosas, V., Mihalcea, R., Morency, L.: Utterance-level multimodal sentiment analysis. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 973–982 (2013) 84. Paleari, M., Huet, B.: Toward emotion indexing of multimedia excerpts. In: 2008 International Workshop on Content-Based Multimedia Indexing, pp. 425–432 (2008) 85. Mansoorizadeh, M., Charkari, N.M.: Multimodal information fusion application to human emotion recognition from face and speech. Multimed. Tools Appl. 49(2), 277–297 (2010) 86. Domingos, P.: A few useful things to know about machine learning. Commun. ACM. 55(10), 78 (2012) 87. Giri, S., Bergés, M., Rowe, A.: Towards automated appliance recognition using an EMF sensor in NILM platforms. Adv. Eng. Inform. 27(4), 477–485 (2013) 88. Morency, L.-P., Mihalcea, R., Doshi, P.: Towards multimodal sentiment analysis. In: Proceedings of the 13th International Conference on Multimodal Interfaces – ICMI ’11, p. 169 (2011) 89. Wiebe, J.: Annotating Expressions of Opinions and Emotions in Language, pp. 1–50 (2003) 90. Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11 (9), 1984–1996 (2016)

Chapter 4

Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications Yanghui Rao, Haoran Xie, Fu Lee Wang, Leonard K. M. Poon, and Endong Zhu

Abstract In this chapter, we focus on the detection of sentiment strength values for a given document. A convolution-based model is proposed to encode semantic and syntactic information as feature vectors, which has the following two characteristics: (1) it incorporates shape and morphological knowledge when generating semantic representations of documents; (2) it divides words according to their part-of-speech (POS) tags and learns POS-level representations for a document by convolving grouped word vectors. Experiments using six human-coded datasets indicate that our model can achieve comparable accuracy with that of previous classiﬁcation systems and outperform baseline methods over correlation metrics.

4.1

Introduction

The Big data era has descended on e-commerce, health organizations, and many other communities [1]. While there are numerous forms of multimedia content including images, photos, and videos, text is increasingly becoming a major part of enterprise data [2], allowing for readers to understand and extract knowledge from it. For instance, users produce large-scale sentiment-embedded documents about products continuously on the Internet [3]. These can be utilized for capturing opinions of consumers and the general public about product preferences, company strategies, and marketing campaigns [4]. Sentiment analysis, which has raised

Y. Rao · E. Zhu School of Data and Computer Science, Sun Yat-Sen University, Guangzhou, China e-mail: [email protected]; [email protected] H. Xie (*) · L. K. M. Poon Department of Mathematics and Information Technology, The Education University of Hong Kong, New Territories, Hong Kong e-mail: [email protected]; [email protected] F. L. Wang School of Science and Technology, The Open University of Hong Kong, Kowloon, Hong Kong e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_4

73

74

Y. Rao et al.

increasing interest of both the scientiﬁc community and the business world [5, 6], refers to the inference of users’ views, positions, and attitudes in their written or spoken documents [7]. Both lexical and learning-based approaches have been utilized for this task [8, 9]. Lexical-based methods detect sentiments by exploiting a predeﬁned list of words or phrases, where each word or phrase is associated with a speciﬁc sentiment [10]. Learning-based methods often use labeled data to train supervised algorithms, which could adapt and create trained models for speciﬁc purposes and contexts [11]. To further improve the system performance, hybrid approaches that combine supervised, unsupervised, and semi-supervised learning algorithms have been developed to classify sentiments over multi-domain reviews [12]. With the rapid development of neural language [13] and word embedding learning [14–18], neural networks became popular in the ﬁeld of natural language processing [19]. As a common variant of neural networks, the convolutional neural network (CNN) [20] has been widely used in sentiment analysis and text representation. For instance, Severyn and Moschitti [21] proposed a deep CNN architecture for twitter sentiment analysis. Ren et al. [22] proposed a context-based CNN to classify sentiments using contextual features of current tweet and its relevant tweets. To obtain representation from nonconsecutive phrases, Lei et al. [23] proposed a nonconsecutive n-grams model by revising the CNN’s temporal convolution operation. In their method, low-rank n-gram tensor was utilized to extract the nonlinear information from documents. On the Stanford Sentiment Treebank dataset, the above model achieved better performance than CNN and other neural networks in predicting sentiments. Different from the aforementioned sentiment analysis studies on identifying documents’ coarse-grained labels, this chapter is concerned with the measurement of users’ intensity over each sentimental category [24–26]. Particularly, we propose a hybrid feature-based model for single modality to multimodality data analytics in Big data environments. Different from previous studies, we obtain the representation of documents from both shape and semantic aspects. Firstly, we extract characterlevel and word-level features from the text itself. Secondly, part-of-speech (POS) tag information is introduced to learn syntactic-level features, in which, words with the same POS category are convolved to learn the representation vector of the POS tag. Finally, we combine character-, word-, and syntactic-level representations and feed the derived feature vector into a prediction module. Extensive experiments using real-world datasets validate the effectiveness of the proposed model. Note that although our model was proposed to perform sentiment strength detection by exploiting hybrid linguistic features, it can also be employed to encode audio features (e.g., pause duration, loudness, pitch, and voice intensity) and visual features (e.g., smile and look-away duration) for multimodal sentiment analysis [27–29]. This is because existing methods to multimodal sentiment analysis rely on mapping multimodal information to parts of text primarily [30]. Comprehensive research and evaluation on audiovisual emotion recognition using a hybrid deep model have been carried out in the most recent literature [31].

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

4.2

75

Related Work

In this section, we ﬁrst summarize recent works related to sentiment strength detection. Then, we introduce some literature on social emotion mining, shedding light on research in this area.

4.2.1

Sentiment Strength Detection

With the growth of the social web and crowdsourcing, sentiment can be assessed for the strength or intensity with which a set of sentiment labels is expressed [25]. Unlike other coarse-grained or aspect-based [32] sentiment analysis tasks, sentiment strength detection aims to predict the ﬁne-grained intensity of sentiments for unlabeled documents [33]. For example, dos Santos et al. [26] used two convolutional layers to extract relevant features by exploiting morphological and shape information. In their work, two layers of convolution mechanisms were adopted to learn character-level embeddings and generate text representations, respectively. Unfortunately, the above method focused mainly on sentiment strength detection speciﬁc to one context. To enhance the representation learning of corpusspeciﬁc information, Chen et al. [34] proposed a convolution neural network by adding the one-hot vector of each word on top of the general real-valued vector (e.g., Word2Vec). The limitation of the above study, however, is that the dimension of the one-hot vector would be too high for a large-scale corpus.

4.3

Social Emotion Mining

Social emotion mining is concerned with the annotation of diverse emotional responses (e.g., joy, anger, fear, shame, and sadness) of various contributors from documents. For instance, many social media services provide channels that allow users to express their emotions after browsing certain online news articles. These news articles are then combined with the emotional responses expressed as some votes against a set of predeﬁned emotion labels by various readers. The aggregation of such emotional responses is called social emotion. Social emotion classiﬁcation aims to predict the aggregation of emotional responses shared by different users; such a computational task has been introduced as one of the benchmark tasks since the “SemEval” conference was held in 2007 [35]. Prior studies on social emotion classiﬁcation often adopted word-level classiﬁcation techniques that failed to effectively distinguish different emotional senses carried by the same word. To address such a weakness, the emotion–topic model [36] was developed to classify social emotions with reference to “topics,” which represents a semantically coherent “concept.” However, existing topic-level emotion classiﬁcation methods suffer

76

Y. Rao et al.

from the data sparsity problem (e.g., the sparse word co-occurrence patterns found in an online textual corpus), and the issue of context-adaptation. To alleviate the aforementioned problems, two supervised topic models were proposed to detect social emotions over short documents [37], and a contextual supervised topic model was developed for adaptive social emotion classiﬁcation [38], as follows. Firstly, documents that only include a few words are becoming increasingly prevalent. More and more users post short messages to express their feelings and emotions through Twitter, Flickr, YouTube, and other apps. However, short messages, as indicated by the name, typically only include a few words and result in the sparsity of word co-occurrence patterns. Thus, traditional topic models would suffer from this severe data sparsity problem when inferring latent topics. In light of these considerations, two supervised intensive topic models [37] are proposed by modeling multiple emotion labels, valence scored by numerous users, and the word pair for short text emotion detection jointly. Secondly, adaptive social emotion classiﬁcation is concerned with a reader’s emotional responses of unlabeled data in a context that is different with that of labeled data. Online data is found in many domains and contexts such as economics, entertainment, and sports; adaptive social emotion classiﬁcation uses a model trained on one source context to build a model for another target context. This is challenging, as topics that evoke a certain emotion in readers are often context-sensitive. To this end, a contextual supervised topic model [38] was developed to classify reader emotions across different contexts. Although social emotion mining is homologous to sentiment strength detection in terms of assessing the intensity with which a set of emotion or sentiment categories is expressed, the methods of measuring social emotions from the reader’s perspective are inappropriate for detecting writers’ sentiment strengths [38].

4.4

Proposed Model

In this section, we ﬁrstly introduce the prediction of sentiment strengths using the convolutional-pooling operation in our method. Secondly, we illustrate the proposed CNN. Finally, we describe the process of parameter estimation and network training.

4.4.1

Problem Deﬁnition

Sentiment strength prediction aims to predict the sentimental intensity of unlabeled documents. In this study, we focus on the prediction of sentimental strength distribution for a given document. A convolution-based model is proposed to encode semantic and syntactic information that holds contextual features as feature vectors, which achieves competitive performance on sentiment strength prediction. The frequently used terms are summarized in Table 4.1.

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

77

Table 4.1 Notations of frequently used terms Notation Vx Wx b ch W

Description Vocabulary of instances on the x-level Embeddings lookup matrix on the x-level Character-level word embeddings

dimx dimox W kx winx Ex Cx

Dimension of the input vectors to the x-level convolutional layer Dimension of the output vectors from the x-level convolutional layer Weight matrix of the convolutional kernel Size of the convolutional kernel kx Input feature matrix of the convolutional layer Generated feature maps from the x-level convolutional layer

We employ character-level, word-level, and part-of-speech (POS)-level convolution in our model, where superscripts ch, w, and pos are used to specify character, word, and POS features in the related convolutional a vocabulary Vx x layer. Given (x ¼ ch, w, pos) and a sequence of input instances I 1 ; . . . ; I mx , we ﬁrstly construct x the input embedding matrix E x 2 Rmdim . In the above, dimx is the dimension of feature vector. Secondly, the vector-level convolution is performed by convolutional kernel kx with ﬁxed-sized window winx. Finally, we perform max-over-time pooling as in [39] over the generated feature maps, which produces high-level feature vectors with the same length, i.e., dimox .

4.4.2

Network Architecture

In this part, we detail the architecture of the proposed Hybrid Convolutional Neural Network (HCNN). Our HCNN encodes character-, word-, and POS-level features by convolution, with the following characteristics: (1) it incorporates shape and morphological knowledge when generating semantic representations of documents, and (2) it divides words via their POS tags and learns POS-level representations by convolving grouped word vectors, thus enriching features for sentiment strength prediction. The architecture of HCNN is shown in Fig. 4.1. Regarding the semantic representation learning, we map each document to a vector with ﬁxed length. Although one-hot encoding is straightforward to represent documents, it cannot capture the sentimental meaning and relevance between different words [40]. Furthermore, it will suffer from high-dimensionality of word vectors if applied to a large-scale dataset [41]. To address these limitations, we leverage real-valued word vectors trained from large-scale text corpus [17, 18]. Unlike previous works [20, 42] that directly applied convolution over word-level embeddings, we also learned character-level embeddings to enrich meanings of these vectors. The reason is that some information of the characters such as “# ” and adverb sufﬁxes “ly” is helpful to enhance the quality of word embeddings [22, 26, 48]. Inspired by [26], we utilize the CNN to explore each

78

Y. Rao et al.

Fig. 4.1 Architecture of the proposed HCNN

word’s character-level representation. Given a word consisting of m characters ch I 1 ; . . . ; I mch , we ﬁrstly index vectors for each character from a global lookup ch matrix Wch and construct an embedding matrix E ch 2 Rmdim fed into the convolutional layer as follows: r ich ¼ W ch I ich ði ¼ 1; . . . ; mÞ E ch ¼ r 1ch ; ; r mch

ð4:1Þ ð4:2Þ

Then, the character-level kernel kch convolved winch character vectors in the current window and yielded a feature map Cch as follows: C ch ½ j ¼ W kch Ech ½ j E ch j þ winch 1 þ bkch

ð4:3Þ

where Cch[ j] is the j-th generated feature map, bkch is the bias item, and W kch ch ch ch 2 Rdimo ðwin dim Þ denotes the kernel weight matrix in the character-level convolutional layer. Additionally, represents the matrix multiplication operation and is the vector concatenation operator. Particularly, index j speciﬁes the position of the character; i.e., Ech[j] is actually the vector representation of the j-th input character. From the above equations, we can know that size of Cch is dependent on the number of characters in the word. To produce character-level embeddings with the same length, we perform max-over-time pooling over these feature maps where the most active feature in each “temporal” dimension is kept and a vector with size dimoch was produced after pooling. We denote character-level embeddings of words b w . After getting character-level representations of that encode shape information as W

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

79

words, we begin to learn the semantic representation of the document or text. Assume that the input document contains n word tokens I 1w ; . . . ; I nw . Considering its inborn property that captures contextual information, we still borrow the convolution mechanism to obtain a better feature vector of the document. We ﬁrstly look up the vectors for the words in the document, as follows: b w I iw ði ¼ 1; . . . ; mÞ r iw ¼ W w I iw W

ð4:4Þ

where r iw is the representation vector of the i-th input word token. The above equation is a little different to Eq. (4.1) because we integrate morphological knowledge into the word representations. Similar to Eq. (4.2), a document matrix E w w ch 2 Rnðdim þdimo Þ is constructed vector by vector and fed into the word-level convolutional layer, as follows: C w ½j ¼ W kw Ew ½j Ew ½j þ winw 1 þ bkw

ð4:5Þ

w w w ch W kw 2 ℝdimo ðwin ðdim þdimo ÞÞ

ð4:6Þ

where a word-level kernel weight matrix W kw is used to extract contextual features from winw neighboring word vectors with length dimw þ dimoch , and feature maps with the same dimensionality, i.e., dimow , are produced. We also note that the number of semantic features maps for different documents are not equal, so we perform maxover-time pooling [39], which only preserves the most important features, as follows: f sem ¼ max C W ½:; 1 max CW :; dimow

ð4:7Þ

where CW[:, 1] is a set of the ﬁrst elements from all feature maps. If we regard each feature map as a discrete-time signal, CW[:, 1] is a set of signal values at the ﬁrst time unit. The pooling is actually a process that keeps the highest value and dropout values from other “signals” at each time unit. The generated fsem having the same dimensionality with each feature map records semantic information of the input document. As for the POS-level features, previous works [43, 44] observed that adjectives, verbs, and nouns have high correlation with sentiment. Inspired by this conclusion, we learn POS tag representations by encoding semantic representations of the words belonging to the tags and send them to the sentiment strength predictor. Our intuition is that convolving words with the same POS annotation can help to extract more sentimental features related to this kind of tag. Also, real-valued word vectors hold more sentiment information than other representations, so we perform convolution operation on vectors generated by Eq. (4.4). After POS tagging, we divide words into four groups according to the annotation results [34] and transform words into

80

Y. Rao et al.

input embedding matrix of POS-level convolutional layer. We learn POS-tag representations of the document via convolution as follows:

f tpos

pos C tpos ½j ¼ W kpos , t E t ½j E tpos ½j þ wintpos 1 þ bkpos ,t ¼ max C tpos ½:; 1 max C tpos :; dimopos

ð4:8Þ ð4:9Þ

where W kpos , t is the parameter matrix of convolutional layer used to learning the representation of POS tag t (t ¼ J, N, V, O). We get four POS-tag feature vectors, f Jpos , f Npos , f Vpos , f Opos , with the same size dimopos after we conduct pooling operation over feature map matrix C tpos (t ¼ J, N, V, O). The feature vector of document d for strength prediction is generated by concatenating its semantic representation fsem and POS-tag representations mentioned above, i.e., f d ¼ f sem f Jpos f Npos f Vpos f Opos

ð4:10Þ

The last component of our HCNN is the sentiment strength prediction module, which contains a fully connected layer and a softmax layer. For each unlabeled document d, the sentiment strength is calculated as follows: sd ¼ W soft h W fully f d þ bfully þ bsoft

ð4:11Þ

where notations with W and b denote parameter matrix and bias vector in the layer, respectively. System output sd is a vector and its value on each dimension is the predicted intensity of corresponding sentiment.

4.4.3

Parameter Estimation

The parameter set in this research is θ ¼ [Wch; Ww; W kch; W kpos; Wfully; Wsoft; bkch; bkw; bkpos ; bfully; bsoft]. The training objective is to minimize the dissimilarity of the real sentiment strength distribution and the predicted strength distribution. We use Kullback–Leibler divergence (KL divergence) [45] to measure the dissimilarity of two probability distributions, sgold and sd, as deﬁned in the following equations: d Lossðd Þ ¼

n X i¼1

n X gold sgold sgold logðsd ½iÞ d ½i log sd ½i d i¼1

ð4:12Þ

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

LossðθÞ ¼

X

LossðdÞ

81

ð4:13Þ

d2D

where Loss(d ) is the prediction loss of training document d and Loss(θ) is the total loss of D, which denotes a set of training documents. n is the number of predeﬁned sentiments and sgold is the real sentiment strength of d. Model training d is conducted by back-propagation (BP) and stochastic gradient descent (SGD). As optimizer, Adadelta [46] is adopted to avoid tuning learning rate manually.

4.5

Experiments

To test the effectiveness and robustness of our model on sentiment strength detection over text data, a real-world corpus was employed in the experiment1. The corpus includes BBC Forum posts (BBC), Digg.com posts (Digg), MySpace comments (MySpace), Runners World forum posts (Runners World), Twitter posts (Twitter), and YouTube comments (YouTube). Each document was manually labeled by annotators who were allowed to use their own judgments rather than being trained to annotate in a predeﬁned way, with the positive and negative sentiment strengths. The positive sentiment strength value ranges from 1 (not positive) to 5 (extremely positive), and the negative sentiment strength value ranges from 1 (not negative) to 5 (extremely negative). The process of evaluation was set according to [34]. Table 4.2 summarizes the statistics of the dataset, where the second column presents the number of documents in the dataset and the third speciﬁes the average document length (i.e., the mean number of words) of each subset. For each subset, we randomly selected 60% of documents as training samples, 20% as validation samples, and the remaining 20% for testing.

Table 4.2 Statistics of the dataset

1

Dataset BBC Digg MySpace Runner World Twitter YouTube

http://sentistrength.wlv.ac.uk/documentation/

# of documents 1000 1077 1041 1046 4242 3407

Average length 64.76 33.63 19.76 64.25 16.81 17.38

82

Y. Rao et al.

Table 4.3 Parameter settings of HCNN

Parameter dimch dimw dimpos winch winw winpos

Value 10 200 70 1 1 1 20

dimoch dimow

4.5.1

80

Experiment Design

In order to evaluate the performance of the proposed HCNN, we implemented the character to sentence convolutional neural network (CharSCNN) [26], convolutional neural networks (CNNs) [20], and the long short-term memory (LSTM) [47] for comparison. Particularly, LSTM takes the whole corpus as a single sequence and the ﬁnal state of all words is used as feature for prediction. To obtain rich semantic features, we use pretrained GloVe word vectors in the proposed HCNN. GloVe is a log-bilinear regression model for unsupervised word representation learning. It aggregates global word-word co-occurrence statistics from a corpus and learns the representations that show interesting linear substructures of the word vector space [18]. Hyper parameters are selected by testing on the validation set. We summarize the parameter settings of the proposed HCNN in Table 4.3. Hyper parameters used in Adadelta are kept the same as [46]. As mentioned earlier, the aim of this chapter is to perform ﬁne-grained sentiment detection, i.e., to predict the sentimental strength of documents. To evaluate the prediction performance, we ﬁrstly estimate Micro-averaged F1 values of the output strength. We can get a predicted label with the highest sentimental strength and several actual top-ranked labels, which are observed from gold sentimental strength distribution of document d. The prediction is true if the predicted label exists in the actual top-ranked real labels. If two or more labels have the same strength value, then their positions are interchangeable. Assume a binary value predd^ denotes true or false output, as follows:

predd^ ¼

1, yd^ 2 Y gold d^ 0, otherwise

ð4:14Þ

where yd^ is the label having the highest predicted sentimental intensity and Y gold d^ represents a set of labels derived from gold sentimental strength of d^ according to strength value. Then, we compute the Micro-averaged F1 value based on predd^ in which only the best match is the acceptable prediction, as follows:

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

P Micro-averaged F1 ¼

d^ 2Dtest predd^

jDtest j

83

ð4:15Þ

where Dtest is the collection of testing documents. The larger value of Microaveraged F1 indicates that the model is more effective in predicting the top-ranked sentiment. We also note that Micro-averaged F1 is not enough to measure system performance since it does not take strength distributions into account. In more detail, Micro-averaged F1 only focuses on relevance between labels with strong polarities and neglects the relevance between predicted strength vector and gold strength vector. To address this problem, we employ a ﬁne-grained metric AP (the averaged Pearson’s correlation coefﬁcient) to evaluate performance of predictions. For each class e, its AP is calculated as follows: Cov Se ; Sgold e APðθÞ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ VarðSe Þ Var Sgold e

ð4:16Þ

where Cov and Var denote the covariance and variance, Se 2 RjDtest j is a concatenation vector of Sd[e] (in Eq. (4.11), d ¼ 1. . .|Dtest|), i.e., the predicted strength gold ¼ Sgold values over e for testing documents. Similarly, Sgold e 1 ½e SjDtest j ½e. From the deﬁnition, we can observe that AP can measure relevance between real strength and predicted strength for each sentiment label, which is more reasonable and conﬁdential than Micro-averaged F1. The value of AP ranges from 1 to 1, where 1 indicates that real distributions and system-produced distributions are correlated perfectly.

4.5.2

Results and Analysis

In this stage, we ﬁrstly conducted internal testing on six subsets in Table 4.2 and demonstrated the effectiveness of the proposed HCNN using Micro-averaged F1 and AP. Table 4.4 presents the experiment results and top-performed items for each metric are highlighted in boldface. According to Table 4.4, although our method performed worse than the baseline of CharSCNN in terms of Micro-averaged F1 over the BBC dataset, it obtained better AP value than all other baselines. As for the Digg dataset, our method outperformed baselines except for CNNs, maybe because both Digg and BBC datasets contain long and formal texts primarily, and word-level features already have enough sentiment information for the task. With respect to the MySpace dataset, our method improved 18.22%, 21.39%, and 14.91% in terms of AP compared with CharSCNN, CNNs, and LSTM, respectively, and attained the highest Micro-averaged F1 value, the same as that of CNNs. For the remaining

84 Table 4.4 Performance of different models using default representations

Y. Rao et al. Dataset BBC

Digg

MySpace

Runners World

Twitter

YouTube

Model CharSCNN CNNs LSTM HCNN CharSCNN CNNs LSTM HCNN CharSCNN CNNs LSTM HCNN CharSCNN CNNs LSTM HCNN CharSCNN CNNs LSTM HCNN CharSCNN CNNs LSTM HCNN

F1 0.9250 0.9150 0.9000 0.9200 0.8465 0.8698 0.7581 0.8558 0.9087 0.9135 0.8990 0.9135 0.8469 0.8469 0.8134 0.8660 0.8561 0.8585 0.8373 0.8750 0.8546 0.8678 0.8341 0.8693

AP(pos) 0.3044 0.3268 0.0997 0.3855 0.3119 0.4773 0.2166 0.4613 0.3443 0.3126 0.3774 0.5265 0.3011 0.3058 0.2055 0.4023 0.2779 0.3391 0.3230 0.4235 0.5708 0.5722 0.5505 0.6461

AP(neg) 0.3044 0.3268 0.0997 0.3855 0.3119 0.4773 0.2166 0.4613 0.3443 0.3126 0.3774 0.5265 0.3011 0.3058 0.2055 0.4023 0.2779 0.3391 0.3230 0.4235 0.5708 0.5722 0.5505 0.6461

subsets, i.e., Runners World, Twitter, and YouTube, our method achieved best results on both Micro-averaged F1 and AP. Through the comparison between baselines and HCNN, we can observe that the introduced POS information does enhance the performance of convolutional architecture for sentimental strength detection. To test the effectiveness of our neural network architecture, we also conducted experiments on baselines models by using the same word-, character-, and POS-level hybrid embeddings. For CNN and LSTM, we added character and POS embeddings to their original word embedding matrix. For CharSCNN, we added POS embeddings to its word embedding matrix and left the character embedding matrix unchanged. The experiment results are shown in Table 4.5. As the results indicate, all the baseline models decreased in performance, possibly because baseline models treat different level features equally and put them into the same convolutional layers. Thus, our HCNN model’s architecture, which convoluted three-level features separately, is more suitable for hybrid inputs. Since documents in the real-world environment may come from different data sources or even different domains, we also conducted external testing over datasets using different models. In more detail, we conducted experiments over HCNN and baseline models to test their adaptiveness and robustness when transferring from one

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

85

Table 4.5 Performance of baseline models using hybrid embeddings Dataset BBC

Digg

MySpace

Runners World

Twitter

YouTube

Model CNNs LSTM CharSCNN CNNs LSTM CharSCNN CNNs LSTM CharSCNN CNNs LSTM CharSCNN CNNs LSTM CharSCNN CNNs LSTM CharSCNN

F1 0.9000 0.9100 0.9050 0.8150 0.8200 0.8050 0.9040 0.9042 0.9038 0.8333 0.7952 0.8333 0.8303 0.8351 0.8516 0.7703 0.7474 0.7709

AP(pos) 0.2521 0.1868 0.0627 0.2901 0.1673 0.1038 0.1432 0.0961 0.0953 0.0598 0.0196 0.0631 0.0701 0.0243 0.3658 0.0532 0.0301 0.0676

AP(neg) 0.2521 0.1868 0.0627 0.2901 0.1673 0.1038 0.1432 0.0961 0.0953 0.0598 0.0196 0.0631 0.0701 0.0243 0.3658 0.0532 0.0301 0.0676

dataset to another, where these two datasets held different feature distributions. Considering we trained HCNN on dataset A and tested prediction performance of HCNN on dataset B, 20% documents in A were selected as the validation set and the rest, as training samples. After model converging, all documents of the dataset B were fed into the testing module. Experimental results are shown in Figs. 4.2, 4.3, and 4.4. We observed that HCNN was comparable with or outperformed other baseline methods on coarse-grained Micro-averaged F1 except for the results in “Runners World versus Others.” This indicates that POS-level information can provide meaningful sentimental features for strength prediction even though the training set and the testing set are from different sources. We also observed that CNNs achieved better results than CharSCNN on both metrics in most cases. This indicates that shape and morphological information (i.e., character-level features) may introduce some noise, since the gap between datasets from different sources is quite large. Unlike character-level features, POS-level features are polarity-speciﬁc rather than dataset-speciﬁc, which results in performance improvement with POS-level features while models with character-level features do not perform well in external testing. Overall, HCNN obtained competitive polarity classiﬁcation results (i.e., Micro-averaged F1) and achieved better results than baseline methods on the AP metric.

86

a

Y. Rao et al.

b

0.85

HCNN

CharSCNN

CNNs

HCNN

LSTM

CharSCNN

CNNs

LSTM

0.9

0.8 0.75

0.8

F1

F1

0.7 0.65 0.6

0.7

0.6

0.55 0.5 0.5 0.45 Digg

MySpace Runners World

Twitter

0.4

YouTube

c

BBC

Twitter

YouTube

d HCNN

0.85

CharSCNN

CNNs

LSTM

HCNN

0.9

0.8

CharSCNN

CNNs

LSTM

0.85

0.75

0.8

0.7

0.75 F1

0.65 F1

MySpace Runners World

0.6

0.7 0.65

0.55 0.6

0.5

0.55

0.45 0.4

0.5

0.35

0.45

BCC

e

Digg

HCNN

Runners World

CharSCNN

Twitter

CNNs

YouTube

LSTM

BCC

f

0.9

Digg

HCNN

MySpace

CharSCNN

Twitter

CNNs

YouTube

LSTM

0.9

0.85 0.85

F1

F1

0.8 0.8

0.75 0.75 0.7 0.7

0.65 0.6 BCC

Digg

MySpace Runners World YouTube

0.65

BCC

Digg

MySpace Runners World

Twitter

Fig. 4.2 External testing of different models on F1. (a) BBC vs. Others. (b) Digg vs. Others, (c) MySpace vs. Others. (d) Runners World vs. Others. (e) Twitter vs. Others. (f) YouTube vs. Others

4.6

Conclusion

With the development of Web 2.0 technology, many users express their feelings and opinions through reviews, blogs, news articles, and tweets/microblogs. Since sentimental information that is embedded in the user-generated subjective documents is useful for product recommendation and other applications, sentiment analysis is quite popular in the ﬁeld of natural language processing. Sentiment classiﬁcation is a

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

a

87

b

0.4 HCNN

CharSCNN

CNNs

LSTM

HCNN

0.35

CharSCNN

CNNs

LSTM

0.5

0.3 0.4 AP(pos)

AP(pos)

0.25 0.2

0.3

0.15 0.2 0.1 0.1 0.05 0

0 Digg

MySpace Runners World

Twitter

YouTube

c

BBC

MySpace Runners World

Twitter

YouTube

d 0.5 HCNN

CharSCNN

CNNs

LSTM

HCNN

CharSCNN

CNNs

LSTM

0.45

0.5

0.4 0.4 0.35 0.3 AP(pos)

AP(pos)

0.3 0.2

0.25 0.2

0.1 0.15 0

0.1 0.05

−0.1

0 BCC

e

Digg

HCNN

Runners World

CharSCNN

Twitter

CNNs

BCC

YouTube

LSTM

f

Digg

HCNN

0.6

MySpace

CharSCNN

Twitter

CNNs

YouTube

LSTM

0.6 0.5 0.5

AP(pos)

AP(pos)

0.4 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 BCC

Digg

MySpace Runners World YouTube

BCC

Digg

MySpace Runners World

Twitter

Fig. 4.3 External testing of different models on AP(pos). (a) BBC vs. Others. (b) Digg vs. Others. (c) MySpace vs. Others. (d) Runners World vs. Others. (e) Twitter vs. Others. (f) YouTube vs. Others

topic of sentiment analysis by supervised learning, which aims to automatically assign sentimental categories to unlabeled documents. Sentiment classiﬁcation can be regarded as coarse-grained sentiment analysis, since it mainly determines the sentimental polarity of documents and neglects the intensity of each category. Different from sentiment classiﬁcation tasks, sentiment strength detection predicts sentiment strengths rather than sentiment indicators. As real-world documents can

88

a

Y. Rao et al.

b

0.4 HCNN

CharSCNN

CNNs

LSTM

HCNN

0.35

CharSCNN

CNNs

LSTM

0.5

0.3 0.4 AP(neg)

AP(neg)

0.25 0.2

0.3

0.15 0.2 0.1 0.1 0.05 0

Digg

MySpace

Runners World

Twitter

0

YouTube

c

BBC

MySpace Runners World

Twitter

YouTube

d 0.5 HCNN

CharSCNN

CNNs

LSTM

HCNN

CharSCNN

CNNs

LSTM

0.45

0.5

0.4 0.4 0.35 0.3 AP(neg)

AP(neg)

0.3 0.2

0.25 0.2

0.1 0.15 0

0.1 0.05

−0.1

0 BCC

Digg

Runners World

Twitter

YouTube

e

BCC

Digg

MySpace

Twitter

YouTube

f HCNN

CharSCNN

CNNs

LSTM

HCNN

0.6

CharSCNN

CNNs

LSTM

0.6 0.5 0.5

AP(neg)

AP(neg)

0.4 0.4 0.3

0.2

0.2

0.1

0.1 0

0.3

BCC

Digg

MySpace

Runners World

YouTube

0

BCC

Digg

MySpace Runners World

Twitter

Fig. 4.4 External testing of different models on AP(neg). (a) BBC vs. Others. (b) Digg vs. Others. (c) MySpace vs. Others. (d) Runners World vs. Others. (e) Twitter vs. Others. (f) YouTube vs. Others

express ﬁne-grained sentiments, sentiment strength detection is quite meaningful since it takes both polarity and intensity into consideration. In this chapter, we propose a framework HCNN for sentiment strength prediction which convolves hybrid (character-level, word-level, and POS-level) features to predict intensity over sentimental labels. The main characteristics of HCNN are as follows: (1) it incorporates shape and morphological knowledge when generating semantic representations of documents, and (2) it divides words according to their

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

89

part-of-speech (POS) tags and learns POS-level representations for a document by convolving grouped word vectors, enriching features for sentiment strength prediction. We conduct experiments on six human-coded datasets in which internal testing and external testing are included. We also compare the performance with other baselines. Experiment results validate the effectiveness of the proposed HCNN for sentimental strength prediction. Particularly, our model can achieve comparable accuracy with that of previous classiﬁcation systems and outperform baseline methods over correlation metrics. Acknowledgment The authors are thankful to the reviewers, Huijun Chen, and Xin Li for their constructive comments and valuable feedback and suggestions on this chapter. This research was supported by the National Natural Science Foundation of China (61502545), the Internal Research Grant (RG 92/2017-2018R) of The Education University of Hong Kong, and a grant from Research Grants Council of Hong Kong Special Administrative Region, China (UGC/FDS11/E03/16).

References 1. Chen, H., Chiang, R.H.L., Storey, V.C.: Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012) 2. Lim, E.-P., Chen, H., Chen, G.: Business intelligence and analytics: research directions. ACM Trans. Manage. Inf. Syst. 3(4), 17–27 (2013) 3. Cambria, E.: Affective computing and sentiment analysis. IEEE Intell. Syst. 31(2), 102–107 (2016) 4. Ravi, K., Ravi, V.: A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl.-Based Syst. 89, 14–46 (2015) 5. Poria, S., Gelbukh, A., Hussain, A., Howard, N., Das, D., Bandyopadhyay, S.: Enhanced SenticNet with affective labels for concept-based opinion mining. IEEE Intell. Syst. 28(2), 31–38 (2013) 6. Poria, S., Cambria, E., Gelbukh, A.: Aspect extraction for opinion mining with a deep convolutional neural network. Knowl.-Based Syst. 108, 42–49 (2016) 7. Katz, G., Ofek, N., Shapira, B.: ConSent: context-based sentiment analysis. Knowl.-Based Syst. 84, 162–178 (2015) 8. Weichselbraun, A., Gindl, S., Scharl, A.: Enriching semantic knowledge bases for opinion miningin big data applications. Knowl.-Based Syst. 69, 78–85 (2014) 9. Bravo-Marquez, F., Mendoza, M., Poblete, B.: Meta-level sentiment models for big social data analysis. Knowl.-Based Syst. 69, 86–99 (2014) 10. Cambria, E., Schuller, B., Xia, Y., Havasi, C.: New avenues in opinion mining and sentiment analysis. IEEE Intell. Syst. 28(2), 15–21 (2013) 11. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classiﬁcation using machine learning techniques. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 79–96 (2002) 12. Kim, K., Lee, J.: Sentiment visualization and classiﬁcation via semi-supervised nonlinear dimensionality reduction. Pattern Recognit. 47(2), 758–768 (2014) 13. Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) 14. Mitchell, J., Lapata, M.: Composition in distributional models of semantics. Cognit. Sci. 34(8), 1388–1429 (2010) 15. Turney, P.D.: Domain and function: a dual-space model of semantic relations and compositions. J. Artif. Intell. Res. 44(1), 533–585 (2012)

90

Y. Rao et al.

16. Clarke, D.: A context-theoretic framework for compositionality in distributional semantics. Comput. Linguist. 38(1), 41–71 (2012) 17. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (NIPS), pp. 3111–3119 (2013) 18. Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) 19. Majumder, N., Poria, S., Gelbukh, A., Cambria, E.: Deep learning-based document modeling for personality detection from text. IEEE Intell. Syst. 32(2), 74–79 (2017) 20. Kim, Y.: Convolutional neural networks for sentence classiﬁcation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014) 21. Severyn, A., Moschitti, A.: Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 959–962 (2015) 22. Ren, Y., Zhang, Y., Zhang, M., Ji, D.: Context-sensitive twitter sentiment classiﬁcation using neural network. In: Proceedings of the 30th AAAI Conference on Artiﬁcial Intelligence (AAAI), pp. 215–221 (2016) 23. Lei, T., Barzilay, R., Jaakkola, T.: Molding CNNs for text: non-linear, non-consecutive convolutions. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1565–1575 (2015) 24. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection in short informal text. J. Am. Soc. Inf. Sci. Technol. 61(12), 2544–2558 (2010) 25. Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63(1), 163–173 (2012) 26. dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of the 25th International Conference on Computational Linguistics (COLING), pp. 69–78 (2014) 27. Rosas, V.P., Mihalcea, R., Morency, L.-P.: Multimodal sentiment analysis of Spanish online videos. IEEE Intell. Syst. 28(3), 38–45 (2013) 28. Zadeh, A., Zellers, R., Pincus, E., Morency, L.P.: Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages. IEEE Intell. Syst. 31(6), 82–88 (2016) 29. Zadeh, A., Chen, M., Poria, S., Cambria, E., Morency, L.P.: Tensor fusion network for multimodal sentiment analysis. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1114–1125 (2017) 30. Dragoni, M., Poria, S., Cambria, E.: OntoSenticNet: a commonsense ontology for sentiment analysis. IEEE Intell. Syst. 33, 2 (2018) 31. Zhang, S., Zhang, S., Huang, T., Gao, W., Tian, Q.: Learning affective features with a hybrid deep model for audio-visual emotion recognition. IEEE Trans. Circuits Syst. Video Technol. (2017). https://doi.org/10.1109/TCSVT.2017.2719043 32. Ma, Y., Peng, H., Cambria, E.: Targeted aspect-based sentiment analysis via embedding commonsense knowledge into an attentive LSTM. In: Proceedings of the 32nd AAAI Conference on Artiﬁcial Intelligence (AAAI) (2018) 33. Cambria, E., Poria, S., Gelbukh, A., Thelwall, M.: Sentiment analysis is a big suitcase. IEEE Intell. Syst. 32(6), 74–80 (2017) 34. Chen, H., Li, X., Rao, Y., Xie, H., Wang, F.L., Wong, T.-L.: Sentiment strength prediction using auxiliary features. In: Proceedings of the 26th International World Wide Web Conference (WWW), Companion volume, pp. 5–14 (2017) 35. Strapparava, Mihalcea, R.: Semeval-2007 task 14: Affective text. In: Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval), pp. 70–74 (2007) 36. Bao, S., Xu, S., Zhang, L., Yan, R., Su, Z., Han, D., Yu, Y.: Mining social emotions from affective text. IEEE Trans. Knowl. Data Eng. 24(9), 1658–1670 (2012)

4 Hybrid Feature-Based Sentiment Strength Detection for Big Data Applications

91

37. Rao, Y., Pang, J., Xie, H., Liu, A., Wong, T.-L., Li, Q., Wang, F.L.: Supervised intensive topic models for emotion detection over short text. In: Proceedings of 22nd International Conference on Database Systems for Advanced Applications (DASFAA), pp. 408–422 (2017) 38. Rao, Y.: Contextual sentiment topic model for adaptive social emotion classiﬁcation. IEEE Intell. Syst. 31(1), 41–47 (2016) 39. Collobert, R., Weston, J.: A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 160–167 (2008) 40. Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: Proceedings of the 24th International Joint Conference on Artiﬁcial Intelligence (IJCAI), pp. 1236–1242 (2015) 41. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efﬁcient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013) 42. Kalchbrenner, N., Grefenstette, E., Blunsom, P.: A convolutional neural network for modeling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 655–665 (2014) 43. Hatzivassiloglou, V., WIebe, J.: Effects of adjective orientation and gradability on sentence subjectivity. In: Proceedings of the 18th International Conference on Computational Linguistics (COLING), pp. 299–305 (2000) 44. Riloff, E., Wiebe, J., Wilson, T.: Learning subjective nouns using extraction pattern bootstrapping. In: Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL (CoNLL), pp. 25–32 (2003) 45. Murphy, K.P.: Machine Learning: A Probabilistic Perspective, pp. 27–71. MIT Press, Cambridge (2012) 46. Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701 (2012) 47. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 48. Mohammad, S.M., Kiritchenko, S., Zhu, X.: NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets. CoRR, abs/1308.6242 (2013)

Part III

Unsupervised Learning Strategies for Big Multimodal Data

Chapter 5

Multimodal Co-clustering Analysis of Big Data Based on Matrix and Tensor Decomposition Hongya Zhao, Zhenghong Wei, and Hong Yan

Abstract In this chapter, we ﬁrst give an overview of co-clustering based on matrix/ tensor decomposition with which the effective signals and noise can be separately ﬁltered. A systematic framework is proposed to perform co-clustering for multimodal data. Based on tensor decomposition, the framework can successfully identify co-clusters with hyperplanar patterns in vector spaces of factor matrices. According to the co-clustering framework, we develop an alternative algorithm to perform tensor decomposition with the full rank constraint on slice-wise matrices (SFRF). Instead of the commonly used orthogonal or nonnegative constraint, the relaxed condition makes the resolved proﬁles stable with respect to model dimensionality in multimodal data. The algorithm keeps a high convergence rate and greatly reduces computation complexity with the factorization technology. The synthetic and experimental results show the favorable performance of the proposed multimodal co-clustering algorithms.

5.1

Introduction

Information about a phenomenon or a system of interest can be obtained from multiple types of sources. With recent technological advances, massive amounts of multimodal data are generated. For instance, in neuroscience, spatiotemporal image H. Zhao (*) Industrial Central, Shenzhen Polytechnic, Shenzhen, China Department of Electronic Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong e-mail: [email protected] Z. Wei Department of Statistics, Shenzhen University, Shenzhen, China H. Yan Department of Electronic Engineering, City University of Hong Kong, Kowloon Tong, Hong Kong e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_5

95

96

H. Zhao et al.

of the neuronal activity within the brain can be measured using different techniques, e.g., electrocardiography (ECG), magnetoencephalography (MEG), functional magnetic resonance imaging (fMRI). The measurements from such platforms are highly complementary in understanding brain functionality. The joint analysis has the potential to enhance knowledge discovery. However, the new degrees of freedom introduced by multiple sources raise a number of questions in multimodal data analysis beyond those related to exploiting each dataset separately [1–3]. Clustering analysis is a fundamental tool in statistics, machine learning, and exploratory data analysis. As an unsupervised machine learning technique, clustering can discover global patterns of datasets and it has been used by researchers in many disciplines [4, 5]. However, this process has several limitations, such as the adoption of a global similarity and the selection of a representative for each group. In contrast, there are nowadays more and more applications wherein one is interested in “localized” grouping [6–8]. For example, identiﬁcation of genes that are co-expressed under certain conditions in gene expression analysis and market segmentation of customer groups that are characterized by certain product features are two representatives of this type of study [7, 9–11]. Such local patterns may be the key to uncover many latent structures that are not apparent otherwise. Therefore, it is highly desirable to move beyond the clustering paradigm, and to develop robust approaches for the detection and analysis of these local patterns of Big data [12]. Inspired by the concept of direct clustering of two-dimensional (2D) data matrix [13], bi-clustering is proposed to perform simultaneous clustering on the row and column dimensions. Consequently, a subset of rows exhibiting signiﬁcant coherence within a subset of columns in the matrix can be extracted. These coherent rows and columns are accordingly regarded as a bi-cluster, which corresponds to a speciﬁc coherent pattern. Practically, bi-clustering is quite challenging, especially for Big datasets [6–9]. Cheng and Church developed an efﬁcient node-detection algorithm (CC) to ﬁnd valuable submatrices in high-dimensional gene expression data, based on mean squared residue scores [14]. Since then, many different bi-clustering algorithms have been developed. Currently, there exists a diverse spectrum of bi-clustering tools that follow different strategies and algorithmic concepts [15–20]. Several comprehensive reviews on bi-clustering can be found in [6–10]. From the theoretical viewpoint, Busygin et al. (2008) pointed out that singular value decomposition (SVD) represents a capable tool for identifying bi-clusters [7]. Thus, SVD-based methods play an important role and have been broadly applied to detect signiﬁcant bi-clusters. Representative methods include sparse SVD (SSVD) [21], regularized SVD (RSVD) [22], robust regularized SVD (RobRSVD) [23], nonnegative matrix factorization (NMF), [24] and non-smooth-NMF (nsNMF) [25]. These factorizationbased algorithms use matrix decomposition technique and detect bi-clusters based on linear mapping in factor spaces. As high-order data, tensor is a multi-way extension of matrix. It provides a natural representation of multimodal data exhibiting dimensionality [26]. For example, many image and video data are naturally tensor objects such as color images [(row, column, color)]. In microarray analysis, a lot of experiments were designed

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

97

to identify gene expression patterns under different conditions across time [(gene, condition, time)]. Similarly, DNA microarray data can be integrated from different studies as a multimodal tensor [27]. Consequently, tensor analysis frequently occurs in clustering-related studies, demanding effective techniques that can deal with such datasets and identify useful co-clusters in them [26–30]. The bi-clustering concept is readily generalized to tensors and referred to as multimodal co-clustering. However, there are few papers dealing with multimodal co-clustering in comparison with the large amount of bi-clustering analyses [28]. It is ambiguous to generalize bi-clustering framework to multimodal Big data. This is in part because the algebraic properties of multimodal data are very different from those of two-way data matrix. Therefore, applying the pattern recognition or machinelearning methods directly to such data spaces can result in high computation and memory requirements [29–32]. Considering the signiﬁcance of matrix decomposition in bi-clustering, it is natural to resort to the multilinear tensor decomposition in multimodal co-clustering [27– 32]. In theory, tensor factorizations, such as high-order singular value decomposition (HOSVD), canonical decomposition (CP), and nonnegative tensor factorization (NTF), can extend the matrix view to multiple modalities and support dimensionality reduction methods with factor matrices to identify co-clusters in multimodal Big data [28–35]. It is proven that some local patterns of multimodal data are embedded in linear relations of factor matrices of HOSVD [28, 30]. Thus, the hyperplanar patterns can be extracted and successfully support the identiﬁcation of co-clusters in multimodal data. A number of synthetic and experiment results show the favorable performance of the multimodal co-clustering based on hyperplane detection in factor spaces [23, 28, 30, 36]. Long et al. [37] proposed the collective factorization of related matrices model (CFRM) for co-clustering on multi-type relational data. The optimal approximation in CFRM is obtained via NMF. Another method to detect co-clusters in multimodal data is based on multilinear decomposition with sparse latent factors [28]. In our previous work [30], the multimodal co-clustering algorithm, HDSVS (hyperplane detection in singular vector spaces), is accomplished by the conjunction of HOSVD and linear grouping algorithm (LGA) [38, 39]. Huang et al. have also employed HOSVD [40], together with K-Means clustering, in their co-cluster method. However, the K-Means algorithm could only form clusters around “object” centers in the singular vector spaces, which is mainly related to constant bi-clusters. Comparatively, LGA could ﬁnd more linear structures (points, lines, planes, and hyperplanes) in the singular vector spaces. These linear structures correspond to other types of co-clusters (constant-row/column, additive and multiplicative co-clusters) in addition to constant ones in the original data. There are several advantages in HDSVS, but it is hard to clarify the latent factors from core tensor and factor matrices of HOSVD [30]. Alternatively, CP decomposition is useful in many real-world scenarios, such as chemometrics, psychometrics, and EEG, because of the uniqueness and easy interpretation of factor matrices [41–46]. In this chapter, a systematic framework of multimodal co-clustering based on CP is developed. Instead of the commonly used alternating least squares (ALS) in CP,

98

H. Zhao et al.

we develop a novel and fast tensor decomposition strategy, slice-wise full rank (SFRF), which is based on the fact that multi-way data are viewed as a collection of matrices by slicing. Thus, the factor matrices are optimized with the full rank constraint of sliced matrices. This strategy provides a reasonable way to avoid the so-called two-factor degeneracy problem, which is difﬁcult to deal with in ALS [41, 47–49]. In the step of optimization, the compression procedure based on SVD is used to reduce the dimension and memory consumption. The convergence rate of our decomposition is improved and the computation complex is greatly simpliﬁed. As such, the SFRF-based algorithm provides an effective tool for multimodal co-clustering. This chapter is organized as follows: We ﬁrst give a background on bi-clustering in Sect. 5.2. The algebraic expression of bi-clusters is explored in Sect. 5.2.1, and some bi-clustering algorithms based on matrix decomposition are reviewed in Sect. 5.2.2. In Sect. 5.3, we provide a few necessary deﬁnitions and notation of multimodal tensors and then present a selection of the most widely used tensor decompositions. In Sect. 5.4, the systematic framework is formulated for multimodal co-clustering. The deﬁnitions of co-clusters are also given. Then two HOSVDand CP-based co-clustering algorithms are developed for multimodal data in Sects. 5.4.1 and 5.4.2 respectively. In Sect. 5.6, the experiment and synthetic data are used to compare and validate the performance of the proposed algorithms. Finally, the conclusion and discussion are given in Sect. 5.7.

5.2

Background

In this section, some background about co-clustering in 2D matrix (bi-clustering) is provided. We use standard mathematical notations. Scalars, vectors, matrices and tensors are denoted, e.g., as a, a, A, and H, respectively. (∙)T denotes transpose or conjugate transpose, where the exact interpretation should be understood from the context.

5.2.1

Bi-clusters in Matrices

Let a dataset of M samples and N features be organized as a rectangular matrix A ¼ (aij)MN, where aij is the value of the ith sample in the jth feature. Denoting the row and column indices of A as R ¼ {1, 2, . . ., M} and C ¼ {1, 2, . . ., N}, we have A ¼ (R, C) 2 ℝMN. Bi-clusters are represented in the literature in different ways [6–8]. Generally, a bi-cluster is deﬁned as a submatrix B ¼ (X, Y ) 2 ℝIJ where X ¼ {i1, i2, . . ., iI} R, Y ¼ { j1, j2, . . ., jJ} C are separate subsets of R and C. At present, most of the existing bi-clustering techniques search for the following patterns or some correlated coherent evolution:

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

99

(a) Bi-clusters with constant values, i.e., {aij ¼ μ| i 2 X, j 2 Y} (b) Bi-clusters with constant rows or columns, i.e., {aij ¼ μ + αi| i 2 X, j 2 Y} or {aij ¼ μ + βj| i 2 X, j 2 Y} (c) Bi-clusters with additive model, i.e., {aij ¼ μ + αi + βj| i 2 X, j 2 Y} (d) Bi-clusters with multiplicative model, i.e., {aij ¼ μαiβj| i 2 X, j 2 Y} Most techniques permutate the original matrix and optimize a scoring function. Commonly used scoring functions to evaluate constant bi-clusters are the sum of squares in Eq. (5.1): SSQðBÞ ¼

1 X 2 aij B i2X , j2Y jX jjY j

ð5:1Þ

is the mean of the submatrix B ¼ (X, Y ). Another one is the following where B mean squared residue score used in Cheng and Church [14]: MSRðBÞ ¼

1 X 2 a aiY aXj þ B i2X , j2Y ij jX jjY j

ð5:2Þ

where aiY and aXj are the mean of the ith row and jth column, respectively. B ¼ (X, Y ) can be deﬁned as a δ-bi-cluster if MSR(B) δ, where δ (δ > 0) is a pre-speciﬁed residue score threshold value. MSR can be used to detect the bi-clusters of types (a), (b), and (c), but not type (d). A more general score was proposed in [36]. This merit function, as expressed in Eq. (5.3), is derived from Pearson’s correlation CSðBÞ ¼ mini2X , j2Y CSXj ; CSiY

ð5:3Þ

P 1 and where two terms are calculated as CSXj ¼ 1 jY j1 j 6¼ k2Y ρ aXj ; aXk P 1 Þj. Pearson’s correlation of two vectors x and y CSiY ¼ 1 jX j1 i 6¼ k2X jρðaiY ; akY P x x y y ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ is deﬁned as ρ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 qP 2ﬃ. Normally, a lower score CS(B) P y y x x represents a strong coherence among the involved rows and columns [30]. Similarly, we can deﬁne a δ-corbicluster if CS(B) δ (δ > 0). In contrast to the previous optimization-based approaches, a novel geometric perspective for the bi-clustering problem is proposed in [50–52]. Based on the spatial interpretation, bi-clusters become hyperplanes in the high-dimensional data space. The geometric viewpoint provides a uniﬁed mathematical formulation for the simultaneous detection of different types of bi-clusters (i.e., constant, additive, multiplicative, and mixed additive and multiplicative bi-clusters) and allows bi-clustering to be done with the generic plane detection algorithms.

100

5.2.2

H. Zhao et al.

Bi-clustering Analysis Based on Matrix Decomposition

In bi-clustering analysis, matrix decomposition, especially singular value decomposition (SVD), nonnegative matrix factorization (NMF), binary matrix factorization (BMF), plays a key role and has been widely applied to detect the signiﬁcant bi-clusters. Representative methods include sparse SVD (SSVD), regularized SVD (RSVD), robust regularized SVD (RobRSVD), nonnegative matrix factorization (NMF) [24], spectral Bi-clustering (SB), iterative signature algorithm (ISA), and pattern-based bi-clustering (BicPAM) et al. [6–11, 15–25, 50–52]. The matrix decomposition-based bi-clustering algorithms are dependent on the strong relations between factor matrices and bi-clusters. Let A ¼ (aij)MN be a matrix embedded with p bi-clusters, we can formulate bi-clustering as sparse matrix factorization. Each bi-cluster is represented as the outer product of two sparse vectors and then we have the two factor models A¼

Xp

γ z T þnoise i¼1 i i

¼ ΥZT þΓ

ð5:4Þ

where Υ ¼ [γ1 γ 2. . .γ p] 2 ℝMp, Z ¼ [z1 z2. . .zp] 2 ℝNp, and Γ 2 ℝMN is additive noise. At ﬁrst, the well-known SVD is used for the matrix factorization in Eq. (5.4). SVD is expressed as the product of three matrices as follows: A UΛV T ¼

Xr

λ u vT i¼1 i i i

ð5:5Þ

where r is the rank of A, U ¼ ½u1 u2 . . . ur is an Mr matrix of left orthonormal left singular vectors, V ¼ ½ v1 v2 . . . vr is an Nr matrix of right orthonormal left singular vectors, Λ ¼ diag (λ1, λ2, . . ., λr) is an rr diagonal matrix with positive singular values λ1 λ2 λr. SVD can decompose A into a summation of rank-one matrices λi ui viT , which is called an SVD layer. Normally, the SVD layers corresponding to large λi values are regarded as effect signals, while the rest are considered as the noise. It is effective in Big data analysis to focus on only parts of SVD layers to ﬁlter out noise and reduce the computing time [15]. Based on effective SVD layer, a rank-l (l < r) approximation of A can be derived by minimizing the squared Frobenius norm, A AðlÞ ¼

Xl

λ u vT i¼1 i i i

¼ argminrankðA∗ Þ¼l k A A∗ k2F

ð5:6Þ

In recent years, NMF is popularized as a low-rank approximation of matrix A with the constraint that the data matrix and factorizing matrices are nonnegative [18]. NMF is characterized by the following factorization: A AðlÞ ¼ WH¼argminW , H0 k A 2 WH k2F

ð5:7Þ

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

101

where A,W,H 0, W 2 ℝMl,H 2 ℝlN, and l is the number of components (the rank) in which the data matrix will be represented. Naturally, the optimal algorithms are developed in many applications. The two typical matrix factorizations are widely used in bi-clustering [7, 21, 25]. This technique can be extended to co-clustering in multimodal Big data, which will be discussed in the next section. The decomposition-based bi-clustering makes full use of linear mappings between bi-clusters and factor vectors. For example, in SVD-based bi-clustering analysis it is proved that a bi-cluster corresponding to a low-rank matrix can be represented as an intersection of hyperplanes in singular vector spaces [7, 15, 30]. Similar conclusion is employed in nsNMF [25]. The orthogonal or nonnegative constraints on matrix decomposition play an important role in algebraic theory, but the constraints have no effect on bi-clustering based on the hyperplane detection in factor spaces. On the other hand, it causes the increase of computation complexity. It is proved in the following part that in some cases the orthogonality or nonnegativity in matrix decomposition is unnecessary to detect signiﬁcant bi-clusters. The constraint on factoring matrices can be relaxed to full-column rank instead. Proposition 1 Let A be an MN nonzero matrix with rank(A) ¼ r min (M, N ). Then it can be expressed as a product A ¼ ΥZT where Υ 2 ℝMr and ZT 2 ℝrN have full column ranks, i.e., rank(Υ) ¼ rank (ZT) ¼ r. The form of decomposition is called full rank factorization. Proof Denote A ¼ [a1, . . ., aN] as N column vectors and LðAÞ as the spanned column space. Because rank(A) ¼ r, a basis for LðAÞ can be expressed as Υ ¼ [γ1 γ2. . .γr]Mr. Because LðAÞ ¼ LðΥÞ, every vector of LðΥÞ can be expressed as the linear combination of γi(i ¼ 1, . . ., r). In particular, there exist N r-dimensional vectors zi to satisfy ai ¼ Υzi(i ¼ 1, . . ., N ). If ZT ¼ [z1,z2,. . ., zN]rN, A can be written as A ¼ ΥZT. Because r ¼ rankðAÞ ¼ rank ΥZT rank ZT ¼ r Thus, both Υ and ZT have full column ranks. The existence of full rank factorization of any nonzero matrix A is proven in Proposition 1. Different from the “almost uniqueness” of SVD up to orders, the full rank factorization is not unique. If A ¼ ΥZT is full rank, then any other full rank factorization can be written in the form A ¼ (ΥW1)(ZWT)T, where W 2 ℝrr is a nonsingular matrix. Although the full rank decomposition is mathematically very simple, it can be an amazingly appropriate tool to detect the bi-clusters as mentioned in the following: Propositions 2 and 3. The submatrices with coherent patterns are still embedded in the compressed factor matrices of the full rank decomposition. Proposition 2 Let A ¼ ΥZT 2 ℝMN be the full rank decomposition with rank e ¼ l r, where e ¼ Υ l Z T be a compressed matrix with rank B (A) ¼ r. Let B l Υ ¼ [Υl Υr l]Mr and Z ¼ [Zl Zr l]Nr are full rank matrices with rank (Υ) ¼ rank (Z) ¼ l. If the two column vectors of A satisfy ai ¼ kaj (k 6¼ 0), then the ith and jth columns of ZlT are also multiplicative zi ¼ kzj.

102

H. Zhao et al.

e , we denote Υ ¼ [Υl Proof According to the full rank decomposition of A and B T T T Υr l], Z ¼ [Zl Zr l] with Zl ¼ ðz1 . . . zN ÞlN and Zrl ¼ z1 . . . zNT ðrlÞN . Then we have

A¼

Υ l ZlT

ZT ¼ ½Υ l Υ rl Tl Zrl

Considering that the two columns of B satisfy ai ¼ kaj(k 6¼ 0), we have ai ka j ¼ Υ

zi ziT

kΥ

zj z Tj

¼Υ

zi kz j ziT kzjT

When Υ is the full column rank matrix, bi kbj is zero vector if and only if T and both zi kzj and ziT kzjT are zero vectors. Similarly, let Υ l ¼ ðΥ 1 . . . Υ M ÞMl the same result can be deduced between the corresponding rows of A and Υl. As such, we consider the following properties of bi-clusters that can be easily proved based on Propositions 1 and 2. Proposition 3 Assume that B ¼ ΥZT is the full rank decomposition. If the factor matrix Υ or Z is embedded with a type of bi-clusters, then B is a correlated bi-cluster δ-corbicluster. It is known that any type of bi-clusters can be considered as a matrix whose rank is no more than two [24]. According to the propositions, it is enough to employ the full rank factor vectors to extract low-rank patterns embedded in a large matrix. In the practical applications, more than two bi-clusters can be embedded in the largescale matrix with noise. Then the rank of the matrix can be larger than two and we need to retain more factor vectors of compressed decomposition. It is feasible and efﬁcient in theory to identify bi-clusters with the generic plane detection in vector e ¼ Υ lZ T . spaces of the full-rank factorized matrices Υl or Zl of B l According to the propositions, bi-clustering is transformed to hyperplane detection in vector spaces of compressed factor matrices. The constraint on matrix decomposition in bi-clustering algorithms can be relaxed to be full rank. Some properties, such as uniqueness, orthogonality, and interpretability, are of no use to detect the hyperplanes in factor spaces. The relaxation of full rank decomposition can ﬁlter out the noise, improve the performance, and reduce the computation time. Further, we discuss the generalization of decomposed-based bi-clustering to multimodal data in the next section.

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

5.3

103

Co-clustering Analysis in Tensor Data

An N-mode tensor data T ¼ fat1 t2 tN ; t k ¼ 1; . . . ; T k g 2 ℝT 1 T 2 T N can be roughly deﬁned as a multidimensional array where N is the number of dimensions known as way or order. A ﬁber of tensor T is deﬁned by ﬁxing every index but one. And a two-dimensional slice is deﬁned by ﬁxing all but two indices. The symbol of TðSÞ is used to represent the sub-tensor by ﬁxing every index but the indices in S, which is the nonempty index subset of {1, 2, . . ., N}. For example, a 3-mode tensor data T 2 ℝT 1 T 2 T 3 has column, row, and tube ﬁbers denoted by Tð1Þ, Tð2Þ, and Tð3Þ; the horizontal, lateral, and frontal slice matrices denoted by T ð2; 3Þ, Tð1; 3Þ, and Tð1; 2Þ. Obviously, Tð1; 2; 3Þ is the full tensor. Based on the coherent patterns of bi-clusters in a matrix, we attempt to extend the deﬁnition to co-cluster in tensor data [30, 53]. For example, assume that one co-cluster H 2 ℝI 1 I 2 I 3 is embedded in a three-mode tensor T 2 ℝT 1 T 2 T 3 , where {I1 T1, I2 T2, I3 T3}. As the high-order analogue of matrix row and column, the structures of ﬁber and slice are involved in the co-cluster deﬁnition. We deﬁne the different types of co-clusters H along ﬁbers and slices instead of rows and columns in a matrix. For example in three-mode tensor, the following co-clusters are deﬁned with the corresponding algebraic expression: (a) (b) (c) (d) (e)

Full constant: Hð1; 2; 3Þ¼fai1 i2 i3 ¼ μ ji1 2 I 1 ; i2 2 I 2 ; i3 2 I 3 g; Horizontal-slice constant: Hð2; 3 Þ¼fai1 i2 i3 ¼ μ þ αi1 ji2 2 I 2 ; i3 2 I 3 g; Mode-1 ﬁber constant: Hð1Þ ¼ ai1 i2 i3 ¼ μ þ βi2 þ γ i3 ji1 2 I 1 ;

Full additive: Hð1; 2; 3Þ¼ ai1 i2 i3 ¼ μ þ αi1 þ βi2 þ γ i3 ji1 2 I 1 ; i2 2 I 2 ; i3 2 I 3 ;

Full multiplicative: Hð1;2;3Þ¼ ai1 i2 i3 ¼ μ αi1 βi2 γ i3 j i1 2 I 1 ;i2 2 I 2 ;i3 2 I 3 g.

For types (d) and (e), we can also deﬁne the additive and multiplicative co-clusters along slice or ﬁber by ﬁxing the corresponding indices of three-mode tensor. Generally, co-clusters of N-mode tensor can be expressed along different modes besides slice or ﬁber. The algebraic expression of co-clusters is more complicated and the number of co-clusters types is much larger than that of bi-clusters. Given the subset S of N modes, the corresponding types of co-clusters of tensor can be represented by sub-tensor as follows: 1. Constant co-cluster along mode-S: XN

HðSÞ ¼ ai1 i2 iN ¼ μ þ ω n I ðnÞ jin 2 I n ; n ¼ 1; 2; . . . ; N g; n¼1 in S 2. Additive co-cluster along mode-S-S: n o XN n HðSÞ ¼ ai1 i2 iN ¼ μ þ ω I ð n Þ ji 2 I ; n ¼ 1; 2; . . . ; N ; S n n i n n¼1

104

H. Zhao et al.

3. Multiplicative co-cluster along mode-S: YN

HðSÞ¼ ai1 i2 iN ¼ μ n¼1 ωinn I S ðnÞjin 2 I n ; n ¼ 1; 2; . . . ; N g where ωinn 2 ℝ is any value and the indicate function is deﬁned as I S ð nÞ ¼

if n 2 S if n 2 =S

1 0

where S is the complementary of set S.

5.4

Tensor Decomposition

There is a rich variety of tensor decomposition in the literature. In this section, we only introduce the commonly used approached in co-clustering of multimodal data. For ease of presentation, we only derive the proposed method in three-mode tensor in some cases.

5.4.1

High-Order Singular Vector Decomposition

Similar to SVD of a matrix, high-order singular vector decomposition (HOSVD) is introduced to compress and decompose tensor data [54]. For example, a three-mode tensor data T 2 ℝIJK can be decomposed into a core tensor multiplied by a matrix along each mode: T G1 U ð1Þ 2 U ð2Þ 3 U ð3Þ

ð5:8Þ

where Uð1Þ 2 ℝIT 1 , U ð2Þ 2 ℝJT 2 , U ð3Þ 2 ℝKT 3 are the factor matrices and G 2 ℝT 1 T 2 T 3 is called core tensor. The n-mode product of a tensor G 2 ℝT 1 T N with matrix U 2 ℝMT n is denoted by Gn U and is a tensor of size T1 Tn 1MTn + 1 TN with the entries ðGn UÞt1 tn1 mtnþ1 tN ¼

XT n t n ¼1

Gt1 tn1 tn tnþ1 tN U mtn :

Element-wise, the decomposition of Eq. (5.8) can be written as Tijk¼

XT 1 t 1 ¼1

XT 2 t 2 ¼1

XT 2 t 2 ¼1

ð1Þ

ð2Þ

ð 3Þ

G t1 t2 t3 U it1 Ujt2 Ukt3

ð5:9Þ

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

105

Fig. 5.1 High-order SVD of a three-mode tensor data

The pictorial example is illustrated in Fig. 5.1. The core tensor G captures interactions between the columns of U(n). In fact, the special case of Eq. (5.8) is called Tucker decomposition, which is known as HOSVD if the core tensor is all orthogonal. HOSVD is a convincing generalization of SVD. Similar to the orthogonality of factor matrices U and V in SVD, there exist the orthogonal transformations U(n), such that the core tensor T ¼ G1 U ð1Þ 2 Uð2Þ 3 U ð3Þ is all-orthogonal and ordered [48]. An N-mode tensor T 2 ℝT 1 T N can be transformed into a matrix by reordering the elements of tensor data. The process is known as matricization, unfolding, or ﬂattening. The mode-n unfolding matrices TðnÞ is deﬁned as a matrix with size Tn∏k 6¼ nTk whose columns are the mode-n ﬁbers [26]. HOSVD is associated with SVD of the matrices TðnÞ along each mode. The factor matrix U(n) in HOSVD is calculated as the left singular matrix of TðnÞ in SVD as follows: TðnÞ U ðnÞ ΣðnÞ VðnÞT , n ¼ 1, 2, . . . , N:

ð5:10Þ

When the tensor T is unfolded to a series of TðnÞ , a co-cluster embedded in T can be unfolded to be a bi-cluster in TðnÞ of the same type. According to SVD-based bi-clustering, we can ﬁnd the row indices of TðnÞ that contain a bi-cluster using hyperplanes’ detection in vector space of factors U(n) and V(n) [30]. These row indices correspond to the locations in the tensor along mode-n. By combing the indices of bi-clusters in every unfold matrix TðnÞ , the co-cluster is identiﬁed in an Ndimensional space of tensor T. Now the major task is to detect hyperplanes in each singular factor matrix TðnÞ . That is, the problem of detecting co-clusters in a multidimensional space has been effectively converted to detect the linear structures in singular vector spaces. As discussed in Eq. (5.6), the low-rank-l (l < r) A(l ) approximate to A can be used to decrease the noise inﬂuence and extract the effect signals. Similarly, the idea of the truncated SVD is introduced in tensor data [26, 30, 54]. Let T 2 ℝI 1 I N , the nrank of T, denoted rank n ðTÞ, is the column rank of TðnÞ . Let r n ¼ rank n ðTÞ ðn ¼ 1; . . . ; N Þ and we can say that T is a rank-(r1, r2, . . ., rN) tensor. For a given tensor T, we can easily ﬁnd an exact HOSVD of rank-(r1, r2, . . ., rN) based on SVD of TðnÞ along each mode [30]. Further, the SVD computational

106

H. Zhao et al.

complexity of A ¼ (aij)MN is O N 2 M and so that of Krylov truncated SVD techniques is OðlNM Þ. Based on that of SVD, the computational complexity of three-way tensor ℝIJK of a rank-(r1, r2, r3) HOSVD 2 3 T2 may be reduced to 2 2 3 O 2IJK ðI þ J þ K Þ þ 5 r 1 JK þ Ir 2 K þ IJr3 2 I þ J þ K 3 =3 r 21 þ r 22 þ r23 =3 ). If ri F

¼ min (I, J, K ), one can simplify the complexity as O 2IJK ðI þ J þ K Þ 2 I 3 þ J 3 þ K 3 =3 þ IJKF [47–49].

5.4.2

Canonical Polyadic Decomposition

The canonical polyadic (CP) decomposes a tensor into a sum of component rank-one tensor. For example, we can represent a three-mode tensor T 2 ℝIJK by the trilinear model as follows: T

XN n¼1

an ∘bn ∘cn

ð5:11Þ

where an, bn, cn are the proﬁle vectors in three modes respectively of the nth component n ¼ 1, . . ., N and the outer product is given by ðan ∘bn ∘cn Þijk ¼ ain b jn ckn for i ¼ 1, . . . , I, j ¼ 1, . . . , J, k ¼ 1, . . . , K;

ð5:12Þ

The pictorial representation of CP decomposition is shown in Fig. 5.2. The factor matrices of CP in Eq. (5.11) refer to the combination of the vectors from the rank-one components. They are denoted as A¼½a1 a2 . . . aN IN B¼½b1 b2 . . . bN JN C¼½c1 c2 . . . cN KN Using the trilinear model, the tensor T A; B; C is sometimes written in terms of slices along each mode: ðAÞ

Ti ð2; 3Þ ¼ BDi CT , i ¼ 1, . . . , I ðBÞ T j ð1; 3Þ ¼ AD j CT , j ¼ 1, . . . , J Tk ð1; 2Þ ¼

Fig. 5.2 Canonical polyadic decomposition of a three-mode tensor

ðC Þ ADk BT , k

¼ 1, . . . , K

ð5:13Þ

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

107

where the diagonal matrices ðAÞ

Di

ðBÞ

Dj

ðC Þ

Dk

¼ diagðai1 ; . . . ; aiN Þ ¼ diag b j1 ; . . . ; b jN ¼ diagðck1 ; . . . ; ckN Þ

for all i, j, and k. In theory, any tensor can be factorized as the sum of ﬁnite number of rank-one tensors without the error section for sufﬁciently high N, which is referred to as CP decomposition. However, it is NP-hard to compute the exact decomposition with the smallest number of rank-one tensors [26]. In practice, some effective algorithms are proposed with the ﬁxed number of components to seek an, bn, cn that satisfy the equality as good as possible. The “workhorse” algorithm for CP is alternating least squares (ALS) [26, 47]. The motivation of ALS is to ﬁnd A, B, C by minimizing the following least square:

2 XN

min T a ∘b ∘c n n n¼1 n A, B, C F where the square Frobenius norm is deﬁned

kTk2F

ð5:14Þ

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ XI XJ XK ¼ t2 . i¼1 j¼1 k¼1 ijk

The ALS approach ﬁxes B and C to solve for A, then ﬁxes A and C to solve for B, then ﬁxes A and B to solve for C, and continues to repeat the entire procedure until some convergence criterion is satisﬁed [26]. In fact, with ﬁxed B and C, this optimization problem in Eq. (5.14) is equivalent to

2 min Tð1Þ AðB⨀CÞT F A

ð5:15Þ

where Tð1Þ is mode-1 unfolding matrices and ⨀ is the matrix Khatri–Rao product. Here Tð1Þ is just an IJK matrix, and B ⨀ C is JKN matrix. Then the optimal solution of Eq. (5.15) can be easily calculated as 1 A ¼ Tð1Þ ðC⨀BÞ{ ¼ Tð1Þ ðC⨀BÞ CT C∗BT B where { is the Moore–Penrose inverse and ∗ is the element-wise matrix Hadamard product. Similarly, B and C can be updated until some convergence criterion is satisﬁed. This gives rise to the ALS algorithm for CP decomposition [20]. The general complexity of ALS is computed as OðJK þ KI þ IJ Þ 7F 2 þ F þ 3FIJK. [47]

108

H. Zhao et al.

In addition to ALS, there exist many variants of CP decomposition with the different constraints. For example, nonnegative tensor factorization (NNTF) is proposed with the nonnegative constraint on tensor and factor matrices [55].

2 XN

min T a ∘b ∘c n n n¼1 n A, B, C F

subject to A, B, C 0:

Recently in [28], the sparsity on the latent factors is considered as follows:

2 X X X XN

min T a ∘b ∘c jair j þ λb j, r b jr þ λc k, r jckr j n n n þ λa i , r n¼1 A, B , C F The sparsity on the latent factors of CP results in co-clustering. The factor matrices computed by the algorithm is used to extract co-clusters in tensor data. [28] As a viable alternative to reduce computation complexity and resolve proﬁle meaning in some application ﬁelds, alternating slice-wise diagonalization (ASD) in [41] utilizes a compression procedure to minimize the following loss function of sliced matrices along one mode instead of Eq. (5.14): min LðA; B; CÞ¼min A , B, C

2 XK

ðC Þ T T ð 1; 2 Þ AD B

: k k k¼1 F

ð5:16Þ

If there exist two matrices PIN and QJN that satisfy PTA ¼ 1N and QTB ¼ 1N, the loss function of Eq. (5.16) is modiﬁed as

2 XK

T ðC Þ min LðA; B; C; P; QÞ¼min P T ð 1; 2 ÞQ D

k k k¼1 A, B , C, P, Q F

2 2 þ λ PT A 2 1N F þ QT B 2 1N F

ð5:17Þ

where 1N is the NN identity matrix. The latter two terms are the penalty terms enforcing the existence of P and Q and λ > 0 is the strength of the penalties. Alternating slice-wise diagonalization (ASD) focuses on the fact that a threemode tensor can be viewed as a collection of matrices, by slicing. Then it utilizes a standard compression procedure to the sliced matrices. By this compression, the matrices appearing in the iterative process are greatly simpliﬁed, and in fact, the numerical examples in [41] conﬁrm the efﬁciency of ASD. Furthermore, we extend the approach to all slicing matrices along every mode. The loss function in Eq. (5.17) can be expressed as

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

LðA; B; CÞ ¼

109

2 X

2 I J X

ðAÞ ðBÞ

Ti ð2; 3Þ BDi CT þ

T j ð1; 3Þ AD j CT F

i¼1

þ

F

j¼1

2 K X

ðC Þ

Tk ð1; 2Þ ADk BT

ð5:18Þ

F

k¼1

Assuming that A, B, and C have full rank, it is known from Proposition 1 that there exist matrices PIN, QJN and RKN that satisfy PT A ¼ 1N , QT B ¼ 1N and RT C ¼ 1N ,

ð5:19Þ

Furthermore, if P, Q, and R are column-wise normalized, we can multiply Eq. (5.13) with P, Q, or R at two sides respectively and obtain the following expressions: ðAÞ

QT Tk ð2; 3ÞR ¼ Di

ðBÞ

PT Tk ð1; 3ÞR ¼ D j

ðC Þ

PT Tk ð1; 2ÞQ ¼ Dk : Thus, the least squares criterion of Eq. (5.18) can be formulated as

2 XI

T ðAÞ min LðA; B; C; P; Q; RÞ¼min Q T ð 2; 3 ÞR D

i i i¼1 F

2 X K

2 XJ

T

T ðBÞ ðC Þ þ P T ð 1; 3 ÞR D þ P T ð 1; 2 ÞQ D

j k j k j¼1 k¼1 F F

2 2 2 þ λ PT A 2 1N F þ QT B 2 1N F þ RT C 2 1N F

ð5:20Þ

Instead of directly minimizing the loss function of Eq. (5.18), we employ the compression technique to transform the original optimization into a reduced one of Eq. (5.20) in our algorithm. Further, denote UA, UB, and UC as IN, JN, and KN matrices whose column vectors are the N left singular vectors of SA, SB, and SC, respectively, where SA ¼ SB ¼

XJ

T ð1; 3ÞT j ð1; 3ÞT ¼ j¼1 j

XK

T ð1; 2ÞT Tk ð1; 2Þ ¼ k¼1 k

XK k¼1

XI i¼1

Tk ð1; 2ÞTk ð1; 2ÞT

Ti ð2; 3ÞTi ð2; 3ÞT

ð5:21Þ

110

H. Zhao et al.

SC ¼

XI i¼1

Ti ð2; 3ÞT Ti ð2; 3Þ ¼

XJ j¼1

T j ð1; 3ÞT T j ð1; 3Þ

Because A, B, C belong to the column or row subspaces SA, SB, and SC, we can estimate A ¼ UA XA , B ¼ UB XB , C ¼ UC XC

ð5:22Þ

where XA, XB, and XC are NN matrices that deﬁne the transformations UA, UB, and UC to A, B, C, respectively. Since P, Q, and R belong to the subspaces A, B, C respectively, they can be estimated by P ¼ UA Y A , Q ¼ UB Y B , R ¼ UC Y C

ð5:23Þ

where YA, YB, and YC are NN matrices that deﬁne the transformations from U, V, and W to P, Q, and R, respectively. Substitution of Eqs. (5.22) and (5.23) into Eq. (5.19) yields X AT Y A ¼ 1N , X BT Y B ¼ 1N , XCT Y C ¼ 1N

ð5:24Þ

Thus, the loss function of Eq. (5.20) is compressed with Eqs. (5.22) and (5.23) as follows:

2 PI

Te ðAÞ minLðX A ;X B ;X C ;Y A ;Y B ;Y C Þ¼ min i¼1

Y B T i ð2;3ÞY C Di

2 P

2 PJ

Te ðBÞ K e k ð1;2ÞY B DðCÞ þ j¼1

Y A T j ð1;3ÞY C D j þ k¼1 Y AT T k

2

2

2 þλ Y T X A 1N þ Y T X B 1N þ Y T X C 1N A

B

C

ð5:25Þ where we denote e i ð2; 3Þ ¼ U T T j ð2; 3ÞU C T B e j ð1; 3Þ ¼ U T T j ð1; 3ÞU C T A

ð5:26Þ

e k ð1; 2Þ ¼ U T T j ð1; 2ÞU B T A This loss function is the dimensionality-reduced version of L(A, B, C, P, Q, R) in Eq. (5.20). The minimization of this loss function over XA, XB, XC, YA, YB, YC can yield the estimates of XA, XB, XC. Subsequently, we can achieve the resolution of the factor matrices A, B, C.

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

111

Similar to the optimal procedure in ALS and ASD algorisms, the necessary condition for YA in Eq. (5.25) by ﬁxing the other parameters is h i XJ ∂L e j ð1; 3ÞY C Y T T e j ð1; 3ÞT Y A DðBÞ ¼2 T j C j¼1 ∂Y A h i XK e k ð1; 2ÞY B Y T T e k ð1; 2ÞT Y A DðCÞ þ2 T B k k¼1 T þ2λX A XA Y A 1N ¼ 0

ð5:27Þ

The solution of Eq. (5.27) is derived as P 1 T PK e J e T Te Te þ λX A X AT YA ¼ j¼1 T j ð1;3ÞY C Y C T j ð1; 3Þ þ k¼1 T k ð1;2ÞY B Y B T k 1;2 P PK e ðBÞ ðC Þ J e j¼1 T j ð1;3ÞY C D j þ k¼1 T k 1; 2 Y B Dk þ λX A ð5:28Þ Thus, Eq. (5.28) gives the updating equation of YA by ﬁxing XA, XB, XC, YB, YC, and YC. As such, the computing equation of YB and YC by ﬁxing the other parameters can be obtained as P 1 T P K I e Te T e k 1;2 T Y A Y T T e YB ¼ þ k¼1 T A k 1;2 þ λX B X B i¼1 T i ð2;3ÞY C Y C T i 2;3 P PK e ðAÞ ðC Þ I e T ð 2;3 ÞY D þ ð 1;2 Þ Y D þ λX T T C i A k B i¼1 i k¼1 k ð5:29Þ and P 1 PJ e I e T T Te Te T YC ¼ i¼1 T i ð2; 3Þ Y B Y B T i ð2; 3Þ þ j¼1 T j ð1; 3Þ Y A Y A T j 1; 3 þ λX C X C P P J e T ðAÞ ðBÞ I e T 1; 3 ð 2; 3 Þ Y D þ Y D þ λX T T i B j A C i¼1 j¼1 i j ð5:30Þ Further, the necessary condition for XA in Eq. (5.25) is ∂L ¼ 2λY A Y AT XA 1N ¼ 0 ∂X A

ð5:31Þ

Considering YA is a square matrix, the updating equation of XA with Eqs. (5.24) and (5.31) is 1 X A ¼ Y AT :

ð5:32Þ

112

H. Zhao et al.

The similar equations of XB and XC by ﬁxing the other parameters can be obtained as 1 1 X B ¼ Y BT and X C ¼ Y CT :

ð5:33Þ

With the updating Eqs. (5.28)–(5.31) for the parameter matrices in Eq. (5.25), we can give the general algorithm of slice-wise full rank decomposition (SFRF) as follows:

5.5

Co-clustering in Tensors

As discussed in Sect. 5.4, when a tensor T is decomposed in the product of factors derived from unfolded TðnÞ , co-clusters in T will be unfolded to bi-clusters in TðnÞ . For example, U(n) and V(n) in HOSVD are the SVD factor matrices of unfolded matrices along N modes. According to Propositions 1–3, we can detect the bi-clusters of TðnÞ with their factor matrices. By combing the row indices of bi-clusters in TðnÞ , co-clusters of N-mode tensor T can be identiﬁed. The major task is to detect hyperplanes in the factor spaces of TðnÞ . That is, the problem of detecting co-clusters in a multidimensional space has been effectively converted to the detection of bi-clusters in factor spaces. In this chapter, we select linear grouping algorithm (LGA) to detect hyperplane patterns in factor spaces and introduce it in Sect. 5.5.1. Then co-clustering framework based on tensor decomposition and factor matrices is proposed in Sect. 5.5.2.

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

5.5.1

113

Linear Grouping in Factor Matrices

There are many methods for hyperplane detection such as K-plane, LGA, and Hough transform [4, 5, 32, 33]. We employ LGA in our algorithms because of robustness, consistency, and convergence. It was ﬁrst proposed in [38] to detect the linear patterns of data points by ﬁtting a mixture of two simple linear regression models. Van Aelst et al. (2006) in [38] addressed the problem of linear grouping by using an orthogonal regression approach and obtained very good performance in several problems where no outlying observations were present and it is improved to be robust linear clustering in [39]. The typical LGA can be performed on the column vectors as follows. Algorithm: Procedure of Linear Grouping

114

5.5.2

H. Zhao et al.

Co-clustering Framework Based on Matrix/Tensor Decomposition

By combining the indices of linear groups in factor spaces, many co-clusters are identiﬁed. We ﬁlter the merged co-clusters H 2 ℝI 1 I 2 I N with the following score: SðI 1 ; I 2 ; . . . ; I N Þ ¼ mini1 2I 1 , ..., iN 2I N ðSi1 , I 2 , ..., I N ; . . . ; SI 1 , I 2 , ..., iN Þ

ð5:34Þ

which is also based on Pearson’s correlation as Eq. (5.3) and SI 1 , ..., in , ..., I N ¼ 1

1 X ρ aI , ..., i , ..., I ; aI , ..., i , ..., I 1 j N 1 k N i ¼ 6 i 2I j k n jI n j 1

A small score represents the better coherence in H. Given threshold value δ, Hδ is deﬁned as δ-co-clusters if S(I1, I2, . . ., IN) δ, which plays the important role in our algorithm to ﬁlter out the signiﬁcant δ-co-clusters with the larger size. In summary, matrix and tensor decomposition provides an alternative to detect co-clusters in multimodal data. For example in [30], SVD-based bi-clustering is extended to HOSVD-based co-clustering. The linear dependence among the singular vectors U(n) of HOSVD is of our interest. Classifying the linear groups in U(n) can extract co-clusters along each dimension instead of direct optimization of some merit function in tensor [30]. Similarly, factor matrices in other tensor decomposition algorithms have the same effect on co-clustering. The linearly dependent points in factor spaces are grouped with LGA. The combined co-clusters are ﬁltered out with the score in Eq. (5.34). The co-clustering framework based on tensor decomposition is summarized in Fig. 5.3 with HOSVD and CP-based decomposition as example separately.

5.6

Experiment Results

The matrix decomposition-based bi-clustering has been successfully applied to analyze Big data matrix [6–10]. In this section, we focus on co-clustering framework based on tensor decomposition. To verify the performance of these algorithms, several big datasets, including multiple synthetic and biological tensors, are used in the experiments. First, a series of synthetic tensors are constructed to evaluate the effects of noise and overlapping complexity on co-clustering algorithms. Then we compare the performance of the two proposed algorithms. Finally, the biological tensor from gene expression data from 12 multiple sclerosis patients under an IFN-β therapy [56] is used to test the performance of our methods in practical problems. Some results were published in our paper [30].

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

115

Fig. 5.3 Co-clustering framework based on tensor decomposition

5.6.1

Noise and Overlapping Effects in Co-cluster Identiﬁcation Using Synthetic Tensors

Two sets of synthetic data are constructed to evaluate the effect of TD-based co-clustering algorithms on noise and overlapping complexity. In our synthetic experiment, a matching scoring, generated by the Jaccard coefﬁcient [12], is deﬁned to evaluate between a detected co-cluster and the true one. Let the agreement H1 ¼ I 11 ; I 12 ; . . . ; I 1N and H2 ¼ I 21 ; I 22 ; . . . ; I 2N be two co-clusters of tensor data T 2 ℝT 1 T 2 T N and I i j T i ði ¼ 1; . . . ; N; j ¼ 1; 2Þ is the subset of ith dimension of T, respectively. The matching score is deﬁned as P N 1 2 i¼1 I i \ I i MSðH1 ; H2 Þ ¼ maxH1 maxH2 P N 1 2 i¼1 I i [ I i

ð5:35Þ

Furthermore, we denote a true co-cluster in T as Htrue and an identiﬁed one as Hδ . Obviously, a larger value of MSðHtrue ; Hδ Þ represents a better identiﬁcation. Based on such matching scores in Eq. (5.35), the effects of noise and overlapping complexity on co-cluster identiﬁcation will be discussed with synthetic tensors.

116

H. Zhao et al.

In the ﬁrst case, a mode-1 ﬁber additive co-cluster Htrue 2ℝ101010 is embedded in a three-mode tensor T 2 ℝ100100100 , whose background is generated based on the standard normal distribution. The additive parameters of every ﬁber are randomly obtained from U (1, 1). The performance of the algorithms is validated with noisy tensors. The Gaussian white noise is added to synthetic tensors with different signal-to-noise ratios (SNRs). First the proposed HOSVD-based algorithm is applied to the noisy tensors and then MSðHtrue ; Hδ Þ is calculated. Every experiment is performed 100 times and the matching scores from all experiments are averaged to obtain the ﬁnal scores for comparison, as shown in Fig. 5.4a where x-axis is SNRs used to generate noisy tensors versus y-axis, which is the corresponding matching scores of co-clusters. And the error bars of standard deviation are provided in Fig. 5.4a by repeating the procedures 100 times. The smaller stand deviation shows the reliability of the algorithm to identify co-clusters. In Fig. 5.4a, the mean of matching scores is increased with small noise. And the Htrue can be perfectly detected when SNR 25. Similar to the previous case of noise, we implant two overlapped ﬁber additive co-clusters Htrue 2ℝ101010 into T 2 ℝ100100100 . Only the cubic patterns overlapped are considered in the synthetic tensors, so the overlapped degree υ (0 < υ < 10) is deﬁned as the size of the overlapped cube in every dimension. The result is shown in Fig. 5.4b. The matching scores of two overlapped co-clusters are V-shaped with the overlap degree. We found that one large-size co-cluster merged by two overlapped co-clusters is always falsely detected, especially with the high overlap degree. As mentioned in bi-clustering analysis, the low matching scores in Fig. 5.4b shows that it was still difﬁcult to separate the overlapped co-clusters in tensor data. Comparatively, the proposed algorithm shows the better performance with the intervention of some noise. The HOSVD can reduce some noise by selecting only part of singular vectors and the regression in LGA may endure some noise and outliers in part [30].

Fig. 5.4 Results of the simulation study in HOSVD-based co-clustering to detect ﬁber additive co-clusters Hδ : (a) with different SNRs (left) and (b) with different overlapped degrees υ (right)

5 Multimodal Co-clustering Analysis of Big Data Based on Matrix and. . .

117

Next, the CP-based co-clustering algorithms are discussed in the following examples. The component number N in Eq. (5.11) plays an important role in CP decomposition. If the component number is wrongly determined in the algorithms, the algorithms generally collapse. So the effect of the parameter is considered in the ﬁrst case. To evaluate the effect of N on CP-based co-clustering, a series of three-mode tensors T 2 ℝ808040 is generated from the standard normal distribution. Further a constant co-cluster Htrue 2ℝ20205 is embedded into these tensors. We denote Hδ as the detected co-cluster. Instead of the criteria with the matching scores, we deﬁne the following relative error to evaluate the decomposition accuracy: ε¼

kHtrue Hδ k2F

Htrue k 2 F

The small relative error can show the better representation of tensor with CP decomposition [30]. The ALS and SFRF are selected in the step of tensor decomposition in co-clustering framework. The synthetic tensors are ﬁrst decomposed with the two methods separately. According to the framework, co-clusters are identiﬁed based on hyperplane detection in the corresponding factor matrices. The procedure is repeated 20 times and the averages of the relative errors are showed in Fig. 5.5a. The relative errors of the embedded co-clusters are increased with the addition of component rank-one tensors in ALS and SFRF. That is, the performance of CP-based co-clustering is not greatly improved with the larger N. Generally, the proﬁles in Fig. 5.5a are stable to the component number. It seems that it is unnecessary to determine the component number N accurately. The computation complexity is a big challenge in bi- and co-clustering. The result makes it possible to deduce the co-clustering complexity with a small number of N. In Fig. 5.5a, the curve of ALS-based algorithm is slightly higher than SFRFbased one. It is normal if we compare their constraint to minimize the loss function in Eqs. (5.14) and (5.25). The ALS complexity to estimate A, B, C in Eq. (5.14) is greatly deduced to estimate NN mapping matrices of YA, YB, YC in Eq. (5.25). 0.08

0.2 SFRF ALS

Relative Error

Relative Error

0.15 0.1 0.05

SFRF ALS

0.07 0.06 0.05 0.04 0.03

0 2 4 6 8 10 12 14 The Number of Component Rank-one Tensors

0

5

10 15 20 25 30 Signal-to-Noise Ratio (dB)

35

40

Fig. 5.5 The result of simulation study in ALS- and SFRF-based co-clustering to detect co-clusters: (a) the effect of component number on (left) and (b) the effect of different SNRs (right)

118

H. Zhao et al.

Thus, some information is actually lost in SFRF and so the relative error is a little high. However, the difference between the errors in Fig. 5.5a was not signiﬁcant ( 1. In multi-objective optimization, there is no unique answer; instead, there exists a set of solutions that constitute the Pareto front solutions. We can formally deﬁne the Pareto front by the following: Deﬁnition 1 Objective space (Ss): space of objective vectors. Deﬁnition 2 Pareto optimal: x* 2 Ss is Pareto optimal if Ff i(x*) Ff i(x) for all i 2 {1, . . ., n} (n is the number of objectives) and at least one Ff i(x*) < Ff i(x). Deﬁnition 3 Domination: x* 2 Ss dominates x 2 Ss (x* ≺ x) if x* is Pareto optimal x* is called a non-dominated individual and x is called a dominated individual. Deﬁnition 4 Pareto front solutions: x* 2 Ss is Pareto front solutions if x* ≺ x where x 2 Ss. For solutions in the Pareto front, we cannot improve any objective without degrading some of the other objectives.

130

M. Golchin and A. W.-C. Liew

Fig. 6.4 The concept of multi-objective solution in a minimization problem

Mathematically, the different solutions in a Pareto front represent different tradeoffs on the conﬂicting objectives and are therefore equally good. Figure 6.4 shows a two-dimensional objective space. Hollow dot solutions dominate solid dot solutions and form the Pareto front of a multi-objective minimization problem. In case a unique solution is required, one can select a single solution from the Pareto front solutions based on some subjective preferences as deﬁned by the problem or the human decision maker.

6.2.3

Bi-cluster Validation

The quality of bi-clustering results can be validated by using statistics or by using domain knowledge [5]. We can compute the statistics of a bi-clustering result to evaluate the accuracy of the detected bi-clusters when the ground truth is known, i.e., synthetic datasets. In this case, Jaccard index (matching score) counts the number of rows and columns that are common between the detected bi-cluster and the ground truth as in Eq. (6.1). The value of Jaccard index differs from 0 to 1 where 0 indicates no similarity and 1 indicates 100% similarity. The bigger the value of Jaccard index, the better the performance of the algorithm. The Jaccard index J is deﬁned by J ðB; GÞ ¼

1 X max j RB \ RG j þ j C B \ CG j j B j RB0 CB RG , cG j RB [ RG j þ j C B [ CG j

ð6:1Þ

where B is the detected bi-cluster and G is the ground truth. RB and CB are row and columns of the detected bi-cluster, RG and CG are rows and columns of the ground truth, |•| represents the number of elements. In multi-objective, evolutionary-based bi-clustering techniques, the objectives of the algorithms deﬁne the accuracy of the detected bi-clusters. For example,

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

131

bi-clusters with big size and small mean square residue (MSR) error are usually preferred. For gene expression analysis, the domain knowledge about the gene expression dataset helps to assess the biological relationship of the detected genes. A common way to check the enrichment in the bi-clusters is by using p-value statistics. P-value shows the signiﬁcance of the results and the probability of obtaining genes in a bi-cluster by chance. For example, the p-value can be used to measure the probability of ﬁnding the number of genes with a speciﬁc GO term in a bi-cluster by chance. Smaller p-value indicates strong evidence that the selected genes in a bi-cluster are highly correlated. Equation (6.2) calculates the p-value where M is the total number of genes in the background set, A is the number of annotated genes in the background set, R is the total number of detected genes in the bi-cluster, and k is the number of annotated genes within the detected genes. The smaller the p-value, the better is the result. Tools such as GeneCodis [20, 21] can be used to study the biological relationship of the extracted bi-clusters by analysing their modular and singular enrichment. P¼1

k1 X i¼0

6.3

A MA i

Ri

ðiM Þ

ð6:2Þ

Evolutionary-Based Bi-clustering Techniques

Many bi-clustering algorithms have been proposed in the literature and they differ based on the bi-cluster model, search strategy and the algorithmic framework [5]. Here, we focus on the study of evolutionary-based bi-clustering methods. Evolutionary algorithm (EA) is a popular metaheuristic technique for global optimization because of its excellent ability to explore a search space and to solve complex problems [22]. Hence, many researchers proposed to apply EA as a search strategy to the bi-clustering problem [17–19, 23–35]. Two popular EA search strategies include artiﬁcial immune system (AIS) and genetic algorithms (GA).

6.3.1

AIS-Based Bi-clustering

Artiﬁcial immune system (AIS) is a subﬁeld of EA inspired by the biological immune system. The techniques in artiﬁcial immune system can be classiﬁed into clonal selection algorithm, negative selection algorithm, immune network algorithm, and dendritic cell algorithm [36]. In [23], the authors use artiﬁcial immune network to search for multiple bi-clusters concurrently. In their proposed method, called multi-objective multi-

132

M. Golchin and A. W.-C. Liew

population artiﬁcial immune network (MOM-aiNet), they minimize the MSR and maximize the bi-cluster volume through multi-objective search. To detect multiple bi-clusters concurrently, the authors generate a subpopulation for each bi-cluster by randomly choosing one row and one column of the dataset and running the multiobjective search on each subpopulation separately. In each iteration, each subpopulation undergoes cloning and mutation. Then, all non-dominated bi-clusters are used to generate the new population for the next iteration. The algorithm aims to converge to distinct regions of the search space. To do this, MOM-aiNet compares the degree of overlap of the largest bi-clusters of each population. If the overlap value is greater than a threshold, two subpopulations are merged to a single subpopulation. The algorithm also generates new random subpopulations from time to time. Liu et al. [25] propose a dynamic multi-objective immune optimization bi-clustering technique (DMOIOB). They detect maximized bi-clusters with minimized MSR and maximized row variance. A binary string encodes the antibodies of a ﬁxed number of rows and a ﬁxed number of columns. Their algorithm starts by generating antibodies population and antigen population. Then the size of the antibodies population is increased to ensure a sufﬁcient number of individuals and to explore unvisited areas of the search space. For each antibody, best local guide is selected using the Sigma method and the best antibodies are used to produce the next generation. Non-dominated individuals are used to update the antigens population. Moreover, the size of the antibodies population is decreased to prevent excessive growth in the population. In order to ﬁnd the local best solution, they apply the basic idea of the Sigma method and immune clonal selection among the archive individuals. The quality of the objective values and a biological analysis of the bi-clusters are used to validate their method. DMOIOB achieves the diversity of solutions by using the concept of crowding distance and ε-distance. However, using only these distances does not guarantee diversity among archive individuals. For example, if there are two bi-clusters in a dataset with the same volume and the same values, then these bi-clusters will have the same objective values in the objective space and the algorithm will ignore one of them.

6.3.2

GA-Based Bi-clustering

Genetic algorithm is a subﬁeld of evolutionary algorithms inspired by natural evolution to generate optimal solutions to optimization problems and search problems by relying on the process of mutation, crossover and selection. In a genetic algorithm, a population of solutions evolves towards better solutions over several generations. Each solution has a set of properties encoded as a binary string or other encoding schemes. In the bi-clustering problem, one tries to ﬁnd the hidden bi-clusters inside a dataset. Hence, random bi-clusters are generated as the initial solutions for each individual. Each solution consists of two parts, the ﬁrst part includes bit string for rows and the second part includes bit string for columns where a ‘1’ indicates that the

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

133

corresponding row or column is in the solution and zero otherwise. In this case, the encoding length of a solution equals the number of rows plus the number of columns in a dataset. At each generation, crossover and mutation generate new offspring and a predeﬁned number of solutions survive based on their ﬁtness values. Cheng and Church (CC) [17] are the ﬁrst to apply bi-clustering algorithm to analyse the gene expression data. They use a heuristic greedy search technique to detect δ-bi-clusters one at a time. For a predeﬁned number of bi-clusters, they iteratively remove and insert rows and columns to a detected bi-cluster while the mean square residue (MSR) error remained below δ. Many recent evolutionarybased methods use this method as a local search strategy [23, 25, 28–33]. MSR measures the degree of coherence of a bi-cluster. Equation (6.3) calculates MSR value where eij is an element of the data matrix, and R and C are a set of samples and features in a bi-cluster. eiC, eRj and eRC are the mean of the ith row, the mean of the jth column, and the mean of the bi-cluster B ¼ (R,C), respectively, and are calculated by Eq. (6.4)–(6.6). If MSR(R,C) δ, then a bi-cluster is called a δ-bi-cluster. The smaller the δ, the better is the coherence of the rows and columns. MSRðR; C Þ ¼

X

e eiC eRj þ eRC i2R, j2C ij X e eiC ¼ =jCj ij j2C X e eRj ¼ =jRj ij i2R X e eRC ¼ = j RkC j ij i2R, j2C

2

= j RkC j

ð6:3Þ ð6:4Þ ð6:5Þ ð6:6Þ

Following the work of CC, Divina and Aguilar-Ruiz [37] propose an evolutionary computation technique that combines the bi-cluster size, MSR and row variance in a single-objective cost function. In their algorithm, the authors ﬁnd bi-clusters with bigger size, higher row variance, smaller MSR, and low level of overlapping among bi-clusters. They use Eq. (6.3) to calculate the MSR value and Eq. (6.7) to calculate the row variance of a bi-cluster B ¼ (R,C). varRC ¼

X

2 e e = j RkC j ij iC i2R, j2C

ð6:7Þ

In order to avoid overlapping bi-clusters, the authors use a penalty value as the sum of weight matrix associated with the expression matrix (penalty ¼ ∑ wp(eij)). The weight of an element depends on the number of bi-clusters containing that element and Eq. (6.8) is used to update the weight matrix. In this equation, the covering of eij, denoted by |Cov(eij)|, is the number of bi-clusters containing eij.

134

M. Golchin and A. W.-C. Liew

8P > ejCoνðenm Þj < n2N , m2M wP eij ¼ ejCovðeij Þj > :0

if j Cov eij j> 0 if j Cov eij j¼ 0

ð6:8Þ

The overall ﬁtness function is Ff (B) ¼ MSR(B)/δ þ 1/varB + wd þ penalty, where wd ¼ wv(wr(δ/R) þ wc(δ/C)), and wv, wr and wc are the different weights that are assigned to the bi-cluster. However, due to the conﬂicting nature of the ﬁtness criteria, their algorithm based on a single-objective function does not produce optimal bi-clusters. In [38], the authors propose ECOPSM based on the evolutionary computation algorithm using the order preserving sub-matrix (OPSM) constraint. According to them, a bi-cluster is a group of rows with strictly increasing values across a set of columns. Using evolutionary computation algorithm, they search for bi-clusters with a certain column length. The algorithm evaluates the probability of a bi-cluster participating in the best OPSM by different permutation of OPSMs length from two to c such that it maximizes this score. Here, each individual is encoded asX an LC length permutation columns index. This encoding reduces the search space to L¼2 cPL where P stands for permutation. The ﬁtness function (Ff (L ) ¼ count(RCM,L)) is the number of rows stored in RCM, each of which has a subsequence equal to L. Single point crossover and single bit mutation are used to generate the next population. ECOPSM evaluates the results based on the size of the detected bi-clusters and their biological relation.

6.3.3

Multi-objective Bi-clustering

Some of the objectives in a bi-clustering problem are conﬂicting and cannot be combined into a single function. In fact, in many real-world problems optimization of two or more conﬂicting objectives is often required. This leads to a multiobjective bi-clustering problem. In [29], Seridi et al. use a combination of minimizing the similarity (Eq. 6.3), maximizing the size (number of elements |R| |C|) and maximizing the row variance (Eq. 6.7) as three objectives in a multi-objective bi-clustering algorithm. Maximizing row variance requires signiﬁcant ﬂuctuation among the set of columns, which is a property of additive pattern bi-clusters. They use the index of rows and columns as the encoding of the individuals, a single point crossover operator, and a heuristic search based on CC algorithm as the mutation operator and NSGA-II/IBEA as the multi-objective algorithms. Their algorithm returns a set of solutions in the approximate Pareto front of one bi-cluster. Only statistical validation is performed on their results. In [28], the authors propose a multi-objective non-dominated sorting genetic algorithm (NSGA) with local search strategy based on CC algorithm. NSGA-II

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

135

[39] is based on the use of a non-dominated crowding distance to retain the diversity among Pareto front, and a crowding selection operator. Here, a binary vector encodes the bi-clusters with a ﬁxed size equal to the number of rows plus the number of columns in the dataset. A value of 1 indicates that corresponding row or column is present in the bi-cluster and 0 otherwise. Homogeneity [mean square residue as in Eq. (6.3)] and size (|R| |C|) are the two objectives. Single point crossover and single bit mutation are the genetic operators. As each individual only encodes one bi-cluster, this method only search for one bi-cluster at a time, and it is not clear how the authors detect multiple bi-clusters in a dataset. In [27], Maulik et al. propose a multi-objective genetic bi-clustering technique. In their algorithm, they use variable length string that encodes the centre of M row clusters and the centre of N column clusters, thereby representing M N bi-clusters in one individual. Their algorithm optimizes two conﬂicting objectives by minimizing mean square residue (Eq. 6.3) and maximizing the row variance of the bi-cluster (Eq. 6.7). The search strategy conducts NSGA-II with two-point crossover and single bit string mutation. Rows and columns undergo crossover and mutation separately. The average of the ﬁtness of the bi-clusters encoded in the string gives the ﬁtness of a string. The ﬁnal bi-clusters include every bi-clusters encoded in the individuals that constitute the Pareto front. They validated their results both biologically and statistically. It is not clear how the algorithm handles similar bi-clusters or suboptimal bi-clusters in the ﬁnal solution since Pareto optimality is only with respect to the individual and not with respect to the bi-clusters encoded in an individual.

6.4

Multi-objective SPEA-Based Algorithm

Following these works, and in contrast to most existing multi-objective methods that use NSGA-II [39], such as the multi-objective optimization algorithm, in our previous works [31–33] we used the strength Pareto front algorithm (SPEA2) [40] as the multi-objective optimization algorithm. One advantage of SPEA2 is that the distribution of Pareto front solutions in SPEA2 is wider and more homogenous than NSGA-II, especially with a larger number of objectives [41]. In addition, SPEA2 has faster convergence rate in higher-dimensional objective space [40]. In [31], we use binary bit string to encode each individual. The algorithm starts with generating a random number of individuals as an initial population and a ﬁxed size empty archive. In each iteration, the algorithm copies all non-dominated individuals into the archive. If the size of the archive exceeds the maximum archive size, a truncation operator removes any dominated individuals or duplicated individuals, or individuals with higher ﬁtness value until it satisﬁes the maximum archive size. On the other hand, if the number of non-dominated individuals is smaller than the size of the archive, the truncation operator joins dominated individuals with low ﬁtness value in the population and archive from the previous iteration to the current archive. Binary tournament selection by replacement selects parents from the

136

M. Golchin and A. W.-C. Liew

Fig. 6.5 Pseudo code of heuristic search

archive. Single point crossover and one-bit mutation generate the next population, with rows and columns undergoing these processes separately. A heuristic search based on CC reﬁnes the generated bi-clusters as in Fig. 6.5, where α determines the rates of deleting rows and columns. To evaluate the quality of a bi-cluster, mean square residue error (Eq. 6.3) and size of the bi-clusters (Eq. 6.10) are calculated. In order to calculate the Pareto front individuals, for each individual a strength value is calculated, representing the number of individuals it dominates. The raw ﬁtness value of an individual is determined by summing the strength values of individuals that dominate that individual. The non-dominated individual has raw ﬁtness value equal to 0. In addition, to increase the diversity of results we calculate a density value by taking the inverse of the Euclidean distance of the ith individual to the kth nearest neighbour point (σ ik ): DensityðiÞ ¼ 1= σ ik þ 2

ð6:9Þ

The constant 2 in Eq. (6.9) is to ensure that the denominator value is greater than 0 and the density is smaller than 1. We calculate k as k ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ population size þ archive size. The density information is used to differentiate between individuals having similar raw ﬁtness value. The ﬁnal ﬁtness value of an individual i is given by Ff (i) ¼ Raw Fitness(i) þ Density(i). The raw ﬁtness value of the Pareto front individuals equals to 0. In Fig. 6.6, nodes 1, 2, 7 are Pareto front individuals. The optimization results include a set of Pareto front individuals. In order to ﬁnd a single solution among Pareto front, we apply k-means algorithm to cluster the Pareto front individuals [32]. Silhouette width is used to determine the optimal number of clusters k by measuring how similar a solution is to its own cluster. For differing values of k, the silhouette width is calculated. The highest average silhouette width indicates the number of clusters. After deﬁning the number of clusters k, we apply the k-means algorithm to the Pareto front bi-clusters and the centroid of each cluster is calculated. We select the knee cluster of Pareto front solutions (region 3) as shown in Fig. 6.7. Knee solutions show a good trade-off of two objectives. The best bi-cluster is the closest bi-cluster to the centre of the knee cluster. The value of k differs between 3 and 10. Note that based on the deﬁnition of individuals, we only detect bi-cluster one at a time in this algorithm.

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

137

Fig. 6.6 The ﬁtness assignment scheme (the strength value and the raw ﬁtness value) for a minimization problem with two objectives f1 and f2

size

2

3 1 MSR

Fig. 6.7 The regional division of Pareto front after applying k-means algorithm when k ¼ 3

In order to detect multiple bi-clusters concurrently, in [33] we propose an evolutionary-based bi-clustering algorithm called parallel bi-cluster detection based on strength Pareto front evolutionary algorithm (PBD-SPEA). Figure 6.8 shows the proposed method. Here, each individual encodes all bi-clusters. To generate the next population, heuristic mutation as shown in Fig. 6.5 and crossover operations are performed on the parents. After the algorithm terminates, a post-processing step selects the ﬁnal bi-clusters from the set of Pareto front individuals. In our algorithm three objective functions are used, namely mean square residue (MSR) score, bi-cluster volume score and a variance score. The MSR of a bi-cluster is computed using Eq. (6.3). The size of bi-cluster, i.e., the second objective, is computed using Eq. (6.10) [37]. wd ¼ wr δ=Rx þ wc δ=Cx

ð6:10Þ

where Rx and Cx are the numbers of detected rows and columns in a bi-cluster, respectively, and wr and wc are weights used to balance the number of rows and columns. δ is the predeﬁned MSR threshold, where the MSR value of the bi-clusters is to stay below this value. The value of wr is set to one as in [37]. wc is the ratio of

138

M. Golchin and A. W.-C. Liew

Start

Generate initial population and empty archive

Calculate the cost function for each bi-cluster in an individual using Eq. (6.7)

NO

Calculate the cost function of individuals using Eq. (6.8)

Stop?

Determine Pareto front individuals

Perform crossover and mutation for next generation

Copy non-dominated individuals into the archive and do the truncation operation

YES

end

Post-processing to select final biclusters from Pareto front

Fig. 6.8 The ﬂow diagram of the PBD-SPEA algorithm

the number of rows to the number of columns in the dataset, and it varies from 1 to 10. wd is inversely related to the number of elements in the detected bi-cluster. The variance score is calculated using Eq. (6.11), which is based on the relevance index of [42]. The variance score is the sum of variances of each column in the bi-cluster over the variance of that column in the dataset. The smaller the score, the more identical are the elements of a bi-cluster in comparison to the dataset. In this equation, C is the number of columns in the detected bi-cluster. σ ij is the local variance of bi-cluster i under column j and it is calculated based on the variance of the columns in the bi-cluster. σ j is the global variance of column j, which is calculated based on the variance of the columns in the dataset. Variance Score ¼

1 X c σ ij j¼1 σ C j

ð6:11Þ

Variance score shows the closeness between the expression values of a column among the selected rows. The score is small when the local variance is low, compared to the global variance. Equation (6.12) is used to calculate the multiobjective cost function. The SPEA2 algorithm tries to minimize all the objectives, so the smaller the cost function value, the better is the detected bi-cluster. 8 < f 1 ¼ MSRðR; CÞ Cost funtion ¼ f 2 ¼ wd : f 3 ¼ Variance Score

ð6:12Þ

In our algorithm [33], each individual represents a number of bi-clusters using a numeric coding scheme as shown in Fig. 6.9. The ﬁrst number in the coding scheme

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . . 4

3

1

3

6

8

2

4

9

0

2

3

3

6

139

7

2

3

0

...

Row indices Col indices Number of cols

Separator

Number of rows

Fig. 6.9 The representation of individuals

refers to the number of rows in a bi-cluster; the second number refers to the number of columns in a bi-cluster; the ﬁrst set of numbers indicates the row indices of the bi-cluster; the second set of numbers indicates the column indices of the bi-cluster and a zero separates the bi-clusters. For example, an individual as shown in Fig. 6.9 contains two bi-clusters of size 4 3 and 2 3, where the ﬁrst bi-cluster has row indices of {1, 3, 6, 8}, and column indices of {2, 4, 9}, and the second bi-cluster has row indices of {3, 6} and column indices of {7, 2, 3}. The algorithm starts by generating random individuals with the user-deﬁned number of bi-clusters as the initial population. Equation (6.12) is then used to calculate the cost value of each bi-cluster in an individual and Eq. (6.13) is used to calculate the overall cost function of an individual by combining the cost functions of all bi-clusters in the individual. In this equation, F ij is the cost function of the ith bi-cluster of the jth individual and nBC is the number of bi-clusters. rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 1 X nBC i Costj ¼ Fj i¼1 nBC

ð6:13Þ

In PBD-SPEA, crossover is between pairs of similar bi-clusters from the two parents. Equation (6.14) is used to calculate the similarity value between two bi-clusters of two parents. Similarity value ¼

j xP1 \ xP2 j minðjxP1 j; jxP2 Þ

ð6:14Þ

where xP1 is the row or column indices of the ﬁrst parent and xP2 is the row or column indices of the second parent. |•| denotes the size of a set. The similarity value measures the degree of overlap of two sets. Figure 6.10 shows an illustration of similarity search and the corresponding similarity table. For each pair of bi-clusters from the parents, the similarity value is calculated and store as a similarity table. We compare the largest similarity value of the table to a userdeﬁned threshold. If the value is larger than the threshold, PBD-SPEA copies the best bi-cluster in the pair into the offspring. If the value is less than the threshold, we apply a single point crossover to the pair to generate the bi-cluster in the offspring. Once the pair of bi-clusters is processed, we update the table by removing the corresponding row and column from the table. We repeat this process until the table has only a single item.

140

M. Golchin and A. W.-C. Liew

Fig. 6.10 An example of similarity search in two parents and the resulting similarity table: (a) search procedure, (b) similarity table where S4 has the largest similarity value, (c) the updated similarity table after removing S4

Fig. 6.11 Single point crossover

Figure 6.11 illustrates the single point crossover operation in our algorithm. In Fig. 6.11, we select row index 6 randomly. Then in parent 2, all the row indices bigger than 6 are selected, and the row indices of the ﬁrst bi-cluster in the offspring would consist of {1, 3, 6} from parent 1 and {7, 9} from parent 2. The same operation goes for the columns of the bi-cluster as well. The Pareto front consists of a set of individuals, and not all bi-clusters in an individual on the Pareto front are optimum. Sequential selection of a set of best bi-clusters from individuals in the Pareto front is used to obtain the ﬁnal set of bi-clusters as follows. First, two individuals are randomly selected from the Pareto front and pairs of similar bi-clusters from the two individuals are identiﬁed based on Eq. (6.14). Then, in each pair, we retain only the best bi-cluster. We then compare the list of best bi-clusters obtained with the next individual randomly selected from the Pareto front to obtain a new set of best bi-clusters. We repeat this procedure until we examine all individuals in the Pareto front. To validate PBD-SPEA, we test it on two simulated datasets SD1 and SD2 where ground truth is available. SD1 has 200 rows and 40 columns with 4 embedded bi-clusters, which consist of a 40 7 constant row pattern bi-cluster, a 25 10 constant column pattern bi-cluster, a 35 8 constant column pattern bi-cluster with three rows and three columns in common with the previous bi-cluster, and a 40 8 constant value pattern bi-cluster. Gaussian noise with standard deviation of 0.3 is added to degrade the dataset. SD2 has 20 rows and 15 columns, and consists of two

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . . Fig. 6.12 Bi-clustering accuracy of PBD-SPEA in detecting different bi-clusters for the SD1 dataset

0.93 0.72

Constant value pattern

Constant column pattern

141

1.79

0.75

Constant column pattern

Constant row pattern

constant bi-clusters of sizes 10 6 and 11 6, respectively, with no noise and six rows in common. Due to the overlapping rows between the two bi-clusters, SD2 actually has three bi-clusters. Figure 6.12 shows the bi-clustering accuracy in detecting different bi-clusters for the SD1 dataset. PBD-SPEA is able to detect over 70% of the different pattern bi-clusters in the dataset despite the existence of noise and overlap. This performance shows that PBD-SPEA is not affected by external factors such as noise. As a comparison, we also run existing approaches such as LAS [43], xMotif [44] and FABIA [45] on the SD1 dataset. LAS and PBD-SPEA are the two methods that are able to detect all four bi-clusters. However, PBD-SPEA has higher accuracy than LAS in detecting all four bi-clusters. The xMotif algorithm is not able to discover any bi-cluster because of the added noise to the dataset. Interestingly, regardless of its high accuracy in detecting constant row and constant column pattern bi-clusters, FABIA is not able to detect constant value pattern bi-cluster, as the algorithm is the result of the outer product of two vectors. For the SD2 dataset, our method is able to detect two bi-clusters plus the overlapped part as a separate bi-cluster. Some of the methods such as xMotif [44] are also able to detect the two bi-clusters but fail to detect the overlap part. Note that technically the overlap part can be considered as a separate, third bi-cluster.

6.5

Bi-clustering Experiments

We apply PBD-SPEA on three different real-world datasets, which consists of a gene expression dataset, a multimodal image dataset and a Facebook dataset. Multimodal data are data that include different modalities such as audio, image and text. The different modality in the data can potentially review different aspects of the underlying concepts in the data, but the feature vectors derived from multimodal data are usually of high dimension and therefore suffer from the curse of dimensionality [46]. For the image dataset, a set of keywords (annotation words) is extracted and is used to form the feature vector that describes the image.

142

M. Golchin and A. W.-C. Liew

PBD-SPEA is then used to cluster the images so that images that share the same concepts are grouped together. In recent years, social networks such as Facebook, Twitter and so on are attracting millions of users. This leads to the creation of web-based applications to be offered to social network users. Big data analysis in social network is concerned with studying the users’ behaviour and the usage patterns to design new tools and applications [47]. We run PBD-SPEA on the Facebook dataset to discover how users are grouped together based on different subsets of features. In all these experiments, the parameters that control EA are set to: maximum iteration ¼ 150, population size ¼ 100, archive size ¼ 40, mutation probability ¼ 0.2, crossover probability ¼ 0.8.

6.5.1

Gene Expression Dataset

We apply our algorithm on the yeast Saccharomyces cerevisiae gene expression dataset [48], which consists of 2884 genes and 17 conditions. In order to verify the functional enrichment of the detected genes in the bi-clusters for the Yeast dataset, we use GENECODIS [20, 21] (http://genecodis.cnb.csic.es/) to verify the results based on gene ontology (GO) [49, 50] and KEGG pathway [51]. Table 6.1 summarizes the evaluation based on singular and modular biological enrichment analysis of one of the detected bi-clusters with the smallest p-value. The results show the enrichment of the detected bi-cluster in comparison to the background occurrence frequency.

Table 6.1 Singular and modular biological enrichment analysis Biological process Modular Enrichment Analysis

Cellular Component Ontology Biological Process Ontology Molecular Function Ontology Singular Enrichment Analysis of KEGG Pathway

Annotation GO:0002181: cytoplasmic translation (BP) GO:0003735: structural constituent of ribosome (MF) GO:0005737: cytoplasm (CC) (KEGG) 03010: Ribosome GO:0005737: cytoplasm (CC)

p-value 8.2541e-05

GO:0002181: cytoplasmic translation (BP) GO:0003735: structural constituent of ribosome (MF) (KEGG) 03010: Ribosome

4.4078e-04 2.1681e-02

7.5023e-04

2.3461e-04

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

6.5.2

143

Image Dataset

We apply PBD-SPEA to cluster the 15-scene categories dataset [52, 53, 54]. In order to apply PBD-SPEA, we generate a feature vector that describes each image using the bag-of-words (BoW) model. Feature detection, feature description and codebook generation are the three steps in the BoW model [52]. The features in the BoW model are keywords that characterize the image for each category. For example, for the kitchen category, appliances, cup, cupboard, drawer, plate set, cutlery, pots and pans, kitchen bench, dining table, and chair are the set of keywords in this category. For the living room category, sofa, armchair, chair, cushion, lamp, coffee table, side table, rug, ceiling fan, ﬁreplace, photo frame and curtain are the keywords. We use ten images from the ﬁrst ten categories and ﬁve individual images for each keyword from Google search images. Then, a sparse binary vector is used as feature descriptor to represent each image. We then apply PBD-SPEA to group the images into meaningful clusters. The number of bi-clusters is set to two, and Fig. 6.13 shows the two detected bi-clusters. The ﬁrst detected bi-cluster (Bic1) includes images with features such as cupboard, appliances, pots and pans, dining table and chair, plate set and cutlery, and cup. The second bi-cluster (Bic2) includes images with features such as sofa, coffee table, cup, plate set and cutlery, dining table and chair. Interestingly, there is an overlap between the detected images in the two bi-clusters (cup, plate set and cutlery, dining table and chair). From visual inspection, we can see that Bic1 images correspond to the kitchen category, while Bic2 points to the living room category. This experiment shows that bi-clustering can be used to uncover higher-level semantic information within images. Here, we can see that PBD-SPEA is able to uncover higher-level concepts (i.e. kitchen, living room) by recognizing a group of features that clusters a group of images together.

6.5.3

Facebook Dataset

We apply PBD-SPEA to the Social Circles Facebook dataset [55], which consists of circles (friends’ lists) from Facebook. This dataset includes 4039 nodes and 88,234 edges and is publicly available at https://snap.stanford.edu/data/egonets-Facebook. html. In this dataset, there are ten networks where each user is represented by a set of features including birthday, education, ﬁrst name, last name, gender, hometown, languages, location, work and locale. We run PBD-SPEA for each network separately. In Table 6.2, the number of IDs and features in a network; the number of IDs and features in the detected bi-cluster; the mean value of the pairwise cosine distance of the detected bi-cluster μcd; the mean value of the pairwise cosine distance of the samples μ; the standard deviation of the pairwise cosine distance of the samples σ; and the common detected features are reported. From the results in Table 6.2, we can conclude that the friend circles are mostly formed by the common educational activities, work place and/or biological relationships. These bi-clusters are smaller

144

M. Golchin and A. W.-C. Liew

Fig. 6.13 Detected bi-clusters group images with similar concepts

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

145

Table 6.2 Bi-clustering results on ten different Facebook networks Detected bi-cluster # # IDs features 180 30

μcd detected bi-cluster

μ

Σ

1

Network # # IDs features 348 224

0.2267

0.45

0.11

2 3

1046 228

576 161

496 91

200 20

0.3577 0.0986

0.66 0.51

0.05 0.05

4

160

105

78

10

0.0986

0.64

0.05

5 6 7

171 67 793

63 48 319

63 24 416

29 8 97

0.2574 0.1622 0.1984

0.63 0.55 0.49

0.03 0.04 0.04

8

756

480

449

106

0.2043

0.42

0.08

9

548

262

221

26

0.0893

0.38

0.03

10

60

42

20

10

0.1165

0.58

0.06

Net. NO.

Common detected features

Education, last name, work Education, work Education, work, birthday Education, work, birthday Education, last name Education, work Education, hometown, last name, work Education, work, birthday Education, last name, work Education, work

and more coherent in comparison to the original network (in network #2, the detected bi-cluster is almost 84% smaller than the original network). Furthermore, the bi-clusters can be used for tagging groups of similar interest and as input data for recommendation system, search relevant evaluation, user proﬁling [56] and targeted marketing [57]. For example, in network #2, the detected bi-cluster groups users with the same educational background and work history. Most probably, this group of users goes through educational events such as graduation ceremony at the same time and they are much more likely to have similar needs. It is easier to create a promotional post or an advertisement to target these core users rather than targeting the whole network with various needs. In order to study the correlation of our results that correspond to the users’ interests, we generate 10,000 submatrices from each network by random sampling of rows and columns according to the size of the detected bi-cluster. For each sampled submatrix, we calculate the mean value of the pairwise cosine distance μcd and plot their histogram in Fig. 6.14. In Fig. 6.14, the x-axis refers to the mean values of pairwise cosine distance, where 1 2 [0 0.1); 2 2 [0.1 0.2); 3 2 [0.2 0.3); 4 2 [0.3 0.4); 5 2 [0.4 0.5); 6 2 [0.5 0.6); 7 2 [0.6 0.7); 8 2 [0.7 0.8); 9 2 [0.8 0.9); 10 2 [0.9 1). In these ﬁgures, the asterisk on the x-axis shows where μcd is located. These empirical distributions provide the baseline statistical distributions for us to assess the signiﬁcant of the detected bi-clusters. We calculated the mean value μ and standard deviation σ of these distributions. The μcd of the detected bi-cluster are mostly smaller than μ 3σ. As less than 5.5% of the value are smaller than μ 3σ based on the Chebyshev’s inequality for general probability distribution (note that

146

M. Golchin and A. W.-C. Liew

a

350

3000

300

2500

250

2000

200

1500

150

1000

100

500

50

0

c

b

3500

1

2* 3 4

5

6

7

8

0

9 10

d

90 80

1

2* 3 4

5

6

7

8

9 10

5

6

7

8

9 10

300 250

70 200

60 50

150

40

100

30 20

50

10 0

1* 2

3

4

5

6

7

8

0

9 10

e 3000

f

2500 2000 1500 1000 500 0

g

4

5

6

7

8

h

1200

500

1000

400

800

300

600

200

400

100

200 4 5 6 7

8

j

450 400 350 300 250

1 2* 3 4

5 6

1 2* 3

5

7

8

9 10

4

6

7 8

9 10

2500 2000 1500

200 150 100 50 0

0

9 10

4

1400

600

1* 2 3

3

450 400 350 300 250 200 150 100 50 0

9 10

700

0

i

1 2* 3

1* 2

1000 500

1* 2

3

4

5

6

7

8

9 10

0

1 2* 3 4 5 6 7 8 9 10

Fig. 6.14 The histogram of the mean values of pairwise cosine distance for randomly generated bi-clusters: (a) network ID 1, (b) network ID 2 (c) network ID 3, (d) network ID 4, (e) network ID 5, (f) network ID 6, (g) network ID 7, (h) network ID 8, (i) network ID 9, (j) network ID 10

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

147

this is a much weaker bound than the bound obtained under normality assumption), this means random selected of a submatrix with a similar μcd to the detected bi-cluster is unlikely and the probability of obtaining the detected bi-cluster by chance is very low. Therefore, PBD-SPEA is able to detect bi-clusters that show signiﬁcant semantic enrichment.

6.6

Conclusion

Bi-clustering is a powerful tool for unsupervised pattern recognition in many different applications involving Big data. In this chapter, a review of bi-clustering algorithms and multi-objective evolutionary optimization and their application to multimodal and Big data is given. Evolutionary algorithms are efﬁcient optimization and powerful search algorithms that have the ability to ﬁnd near optimal bi-clusters in Big dataset. Multi-objective search strategy handles multiple conﬂicting objectives, which is often encountered in bi-clustering problem. We ﬁrst describe common types of bi-cluster patterns that are widely used. We then describe several evolutionary-based bi-clustering algorithms based on their objective functions, search strategies and how they validate their results. Finally, we review our recent work on a bi-clustering algorithm based on strength Pareto front evolutionary algorithm called PBD-SPEA. We illustrate application of PBD-SPEA to gene expression data, multimodal data such as image dataset and Big data such as Facebook dataset. For the gene expression dataset, our method is able to detect highly enriched bi-clusters. For the image dataset, we are able to discover higherlevel semantic information within groups of images. For the Facebook dataset, PBD-SPEA is able to detect coherent submatrices of users and features that are useful for subsequent analysis. Acknowledgement Maryam Golchin is supported by the Australian Government Research Training Program Scholarship.

References 1. Frost, S.: Drowning in Big Data? Reducing Information Technology Complexities and Costs for Healthcare Organizations (2015) 2. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011) 3. Fan, J., Han, F., Liu, H.: Challenges of big data analysis. Natl. Sci. Rev. 1, 293–314 (2014) 4. Bailey, K.D.: Numerical Taxonomy and Cluster Analysis. Typologies and Taxonomies, pp. 35–65. Sage, Thousand Oaks (1994) 5. Zhao, H., Liew, A.W.C., Wang, D.Z., Yan, H.: Biclustering analysis for pattern discovery: current techniques, comparative studies and applications. Curr. Bioinf. 7, 43–55 (2012) 6. Liew, A.W.C., Gan, X., Law, N.F., Yan, H.: Bicluster Analysis for Coherent Pattern Discovery. In: Encyclopedia of Information Science and Technology, IGI Global, pp. 1665–1674 (2015)

148

M. Golchin and A. W.-C. Liew

7. Hartigan, J.A.: Direct clustering of a data matrix. J. Am. Stat. Assoc. 67, 123–129 (1972) 8. Mirkin, B.G.E.: Mathematical classiﬁcation and clustering. Kluwer Academic, Dordrecht (1996) 9. Liew, A.W.C.: Biclustering analysis of gene expression data using evolutionary algorithms. In: Iba, H., Noman, N. (eds.) Evolutionary Computation in Gene Regulatory Network Research, pp. 67–95. Wiley, Hoboken (2016) 10. MacDonald, T.J., Brown, K.M., LaFleur, B., Peterson, K., Lawlor, C., Chen, Y., Packer, R.J., Cogen, P., Stephan, D.A.: Expression proﬁling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat. Genet. 29, 143–152 (2001) 11. Cha, K., Oh, K., Hwang, T., Yi, G.-S.: Identiﬁcation of coexpressed gene modules across multiple brain diseases by a biclustering analysis on integrated gene expression data. In: Proceedings of the ACM 8th International Workshop on Data and Text Mining in Bioinformatics, ACM, pp. 17–17 (2014) 12. Banerjee, A., Dhillon, I., Ghosh, J., Merugu, S., Modha, D.S.: A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J. Mach. Learn. Res. 8, 1919–1986 (2007) 13. Goyal, A., Ren, R., Jose, J.M.: Feature subspace selection for efﬁcient video retrieval. In: Boll, S., Tian, Q., Zhang, L., Zhang, Z., Chen, Y.P. (eds.) Advances in Multimedia Modeling. MMM 2010, pp. 725–730. Springer, Berlin (2010) 14. Wang, H., Wang, W., Yang, J., Yu, P.S.: Clustering by pattern similarity in large data sets. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 394–405 (2002) 15. Han, L., Yan, H.: A fuzzy biclustering algorithm for social annotations. J. Inf. Sci. 35, 426–438 (2009) 16. Li, H., Yan, H.: Bicluster analysis of currency exchange rates. In: Prasad, B. (ed.) Soft Computing Applications in Business, pp. 19–34. Springer, Berlin (2008) 17. Cheng, Y., Church, G.M.: Biclustering of expression data. In: Proceeding of Intelligent Systems for Molecular Biology (ISMB), American Association for Artiﬁcial Intelligence (AAAI), pp. 93–103 (2000) 18. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., Coello, C.A.C.: A survey of multiobjective evolutionary algorithms for data mining: Part I. IEEE Trans. Evol. Comput. 18, 4–19 (2014) 19. Mukhopadhyay, A., Maulik, U., Bandyopadhyay, S., Coello, C.A.C.: Survey of multiobjective evolutionary algorithms for data mining: Part II. IEEE Trans. Evol. Comput. 18, 20–35 (2014) 20. Carmona Saez, P., Chagoyen, M., Tirado, F., Carazo, J.M., Pascual Montano, A.: GENECODIS: a web-based tool for ﬁnding signiﬁcant concurrent annotations in gene lists. Genome Biol. 8, R3 (2007) 21. Nogales Cadenas, R., Carmona Saez, P., Vazquez, M., Vicente, C., Yang, X., Tirado, F., Carazo, J.M., Pascual Montano, A.: GeneCodis: interpreting gene lists through enrichment analysis and integration of diverse biological information. Nucleic Acids Res. 37, W317–W322 (2009) 22. De Jong, K.A.: Evolutionary Computation: A Uniﬁed Approach. MIT Press, Cambridge (2006) 23. Coelho, G.P., de França, F.O., Von Zuben, F.J.: A multi-objective multipopulation approach for biclustering. In: de Castro, L.N., Timmis, J. (eds.) Artiﬁcial Immune Systems, pp. 71–82. Springer, Heidelberg (2008) 24. Liu, J., Li, Z., Hu, X., Chen, Y., Liu, F.: Multi-objective dynamic population shufﬂed frogleaping biclustering of microarray data. BMC Genomics. 13, S6 (2012) 25. Liu, J., Li, Z., Hu, X., Chen, Y., Park, E.K.: Dynamic biclustering of microarray data by multiobjective immune optimization. BMC Genomics. 12, S11 (2011) 26. Liu, J., Li, Z., Liu, F., Chen, Y.: Multi-objective particle swarm optimization biclustering of microarray data. In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), IEEE, pp. 363–366 (2008)

6 Bi-clustering by Multi-objective Evolutionary Algorithm for. . .

149

27. Maulik, U., Mukhopadhyay, A., Bandyopadhyay, S.: Finding multiple coherent biclusters in microarray data using variable string length multiobjective genetic algorithm. IEEE Trans. Inf. Technol. Biomed. 13, 969–975 (2009) 28. Mitra, S., Banka, H.: Multi-objective evolutionary biclustering of gene expression data. Pattern Recognit. 39, 2464–2477 (2006) 29. Seridi, K., Jourdan, L., Talbi, E.G.: Multi-objective evolutionary algorithm for biclustering in microarrays data. In: IEEE Congress on Evolutionary Computation (CEC), IEEE, pp. 2593–2599 (2011) 30. Seridi, K., Jourdan, L., Talbi, E.G.: Using multiobjective optimization for biclustering microarray data. Appl. Soft Comput. 33, 239–249 (2015) 31. Golchin, M., Davarpanah, S.H., Liew, A.W.C.: Biclustering analysis of gene expression data using multi-objective evolutionary algorithms. In: Proceeding of the 2015 International Conference on Machine Learning and Cybernetics IEEE, Guangzhou, pp. 505–510 (2015) 32. M. Golchin, A.W.C. Liew, Bicluster detection using strength pareto front evolutionary algorithm. In: Proceedings of the Australasian Computer Science Week Multiconference, ACM, Canberra, pp. 1–6 (2016) 33. Golchin, M., Liew, A.W.C.: Parallel biclustering detection using strength pareto front evolutionary algorithm. Inf. Sci. 415–416, 283–297 (2017) 34. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, San Francisco, pp. 269–274 (2001) 35. Dhillon, I.S., Mallela, S., Modha, D.S.: Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Washington, DC, pp. 89–98 (2003) 36. De Castro, L.N., Timmis, J.: Artiﬁcial Immune Systems: A New Computational Intelligence Approach. Springer, Heidelberg (2002) 37. Divina, F., Aguilar Ruiz, J.S.: Biclustering of expression data with evolutionary computation. IEEE Trans. Knowl. Data Eng. 18, 590–602 (2006) 38. Roh, H., Park, S.: A novel evolutionary algorithm for bi-clustering of gene expression data based on the order preserving sub-matrix (OPSM) constraint. In: 8th IEEE International Conference on BioInformatics and BioEngineering (BIBE), IEEE, pp. 1–14 (2008) 39. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Trans. Evol. Comput. 6, 182–197 (2002) 40. Zitzler, E., Laumanns, M., Thiele, L.: SPEA2: improving the strength pareto evolutionary algorithm. In: Proceedings of the Evolutionary Methods for Design, Optimization and Control with Applications to Industrial Problems (EUROGEN), Eidgenössische Technische Hochschule Zürich (ETH), Institut für Technische Informatik und Kommunikationsnetze (TIK), Athens (2001) 41. Konak, A., Coit, D.W., Smith, A.E.: Multi-objective optimization using genetic algorithms: a tutorial. Reliab. Eng. Syst. Saf. 91, 992–1007 (2006) 42. Yip, K.Y., Cheung, D.W., Ng, M.K.: Harp: a practical projected clustering algorithm. IEEE Trans. Knowl. Data Eng. 16, 1387–1397 (2004) 43. Shabalin, A.A., Weigman, V.J., Perou, C.M., Nobel, A.B.: Finding large average submatrices in high dimensional data. Ann. Appl. Stat. 985–1012 (2009) 44. Murali, T., Kasif, S.: Extracting conserved gene expression motifs from gene expression data. In: Proceedings of the Paciﬁc Symposium on Biocomputing, pp. 77–88 (2003) 45. Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., Khamiakova, T., Van Sanden, S., Lin, D., Talloen, W.: FABIA: factor analysis for bicluster acquisition. Bioinformatics. 26, 1520–1527 (2010) 46. Zhu, X., Luo, X., Xu, C.: Editorial learning for multimodal data. Neurocomputing. 253, 1–5 (2017)

150

M. Golchin and A. W.-C. Liew

47. Bozkır, A.S., Mazman, S.G., Sezer, E.A.: Identiﬁcation of user patterns in social networks by data mining techniques: Facebook case. In: Second International Symposium on Information Management in a Changing World (IMCW 2010), Ankara, Turkey, pp. 145–153 (2010) 48. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J.: A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell. 2, 65–73 (1998) 49. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, M.J., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T.: Gene ontology: tool for the uniﬁcation of biology. Nat. Genet. 25, 25–29 (2000) 50. Boyle, E.I., Weng, S., Gollub, J., Jin, H., Botstein, D., Cherry, J.M., Sherlock, G.: GO: TermFinder—open source software for accessing gene ontology information and ﬁnding signiﬁcantly enriched gene ontology terms associated with a list of genes. Bioinformatics. 20, 3710–3715 (2004) 51. Kanehisa, M., Goto, S.: KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000) 52. Fei-Fei, L., Perona, P.: A Bayesian hierarchical model for learning natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, pp. 524–531 (2005) 53. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CPRV), IEEE, New York, pp. 2169–2178 (2006) 54. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. Int. J. Comput. Vis. 42, 145–175 (2001) 55. Leskovec, J., Mcauley, J.J.: Learning to discover social circles in ego networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 539–547 (2012) 56. Mislove, A., Viswanath, B., Gummadi, K.P., Druschel, P.: You are who you know: inferring user proﬁles in online social networks. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, ACM, pp. 251–260 (2010) 57. Bolotaeva, V., Cata, T.: Marketing opportunities with social networks. J. Internet Soc. Netw. Virtual Commun. 2011, 1–8 (2011)

Chapter 7

Unsupervised Learning on Grassmann Manifolds for Big Data Boyue Wang and Junbin Gao

Abstract With the wide use of cheaper cameras in many domains such as human action recognition, safety production detection, and trafﬁc jam detection, there are huge amount of video data that need to be processed efﬁciently. However, it is impossible to deal with so many videos with very limited labels. Thus, unsupervised learning algorithms for videos, i.e., clustering and dimension reduction, have attracted increasing interest recently, and it is urgently desired to achieve good performance for real-world videos. To achieve this goal, it is critical to explore a proper representation method for high-dimensional data and build proper clustering or dimension reduction models based on the new representation. The purpose of this chapter is to review the most recent developments in Grassmann manifolds representation for videos and image sets data in computer vision.

7.1

Introduction

In recent years, as one of the important unsupervised learning methods, subspace clustering has attracted great interest in image analysis, computer vision, pattern recognition, and signal processing [1, 2]. The basic idea of subspace clustering is based on the fact that most data often have intrinsic subspace structures and can be regarded as the samples of a union of multiple subspaces. Thus, the main goal of subspace clustering is to group data into different clusters, data points in each of which justly come from one subspace. To investigate and represent the underlying subspace structure, many subspace methods have been proposed, such as the conventional

B. Wang (*) Municipal Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China e-mail: [email protected] J. Gao The University of Sydney Business School, University of Sydney, Sydney, NSW, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_7

151

152

B. Wang and J. Gao

iterative methods [3], the statistical methods [4, 5], the factorization-based algebraic approaches [6, 7], and the spectral clustering-based methods [2, 8–11]. Among all the subspace clustering methods aforementioned, the spectral clustering methods based on afﬁnity matrix are considered to have good prospects [2], in which an afﬁnity matrix is ﬁrstly learned from the given data and then the ﬁnal clustering results are obtained by spectral clustering algorithms such as Normalized Cuts (NCut) [12] or simply the K-means. The key ingredient in a spectral clustering method is to construct a proper afﬁnity matrix for data. In the typical method, Sparse Subspace Clustering (SSC) [2], one assumes that the data of subspaces are independent and are sparsely represented under the so-called ‘1 Subspace Detection Property [13], in which the within-class afﬁnities are sparse and the between-class afﬁnities are all zeros. It has been proved that under certain conditions, the multiple subspace structures can be exactly recovered via ‘p( p 1) minimization [14]. In most of current sparse subspace methods, one mainly focuses on independent sparse representation for data objects. However, the relation among data objects or the underlying global structure of subspaces that generate the subsets of data to be grouped is usually not well considered, while these intrinsic properties are very important for clustering applications. Some researchers explore these intrinsic properties and relations among data objects and then revise the sparse representation model to represent these properties by introducing extra constraints, such as the low rank constraint [8], the data Laplace consistence regularization [15], and the data sequential property [16]. In these constraints, the holistic constraints, such as the low rank or nuclear norm kk∗, are proposed in favor of structural sparsity. The Low Rank Representation (LRR) model [8] is one of the representatives. The LRR model tries to reveal the latent sparse property embedded in a dataset in high-dimensional space. It has been proved that, when the high-dimensional dataset is actually from a union of several low-dimension subspaces, the LRR model can reveal this structure through subspace clustering [8]. Although most current subspace clustering methods show good performance in various applications, the similarity among data objects is measured in the original data domain. For example, the current LRR method is based on the principle of data self-representation and the representation error is measured in terms of Euclideanalike distance. However, this hypothesis may not be always true for many highdimensional data in practice where corrupted data may not reside in a linear space nicely. In fact, it has been proved that many high-dimensional data are embedded in low-dimensional manifolds. For example, the human facial images are considered as samples from a nonlinear submanifold [17]. Generally, manifolds can be considered as low-dimensional smooth “surfaces” embedded in a higher-dimensional Euclidean space. At each point of the manifold, manifold is locally similar to Euclidean space. To effectively cluster these high-dimension data, it is desired to reveal the nonlinear manifold structure underlying these high-dimensional data and obtain a proper representation for the data objects. There are two types of manifolds-related learning tasks. In the so-called manifold learning, one has to respect the local geometry existing in the data but the manifold itself is unknown to learners. In the other type of learning tasks, we clearly know

7 Unsupervised Learning on Grassmann Manifolds for Big Data

153

manifolds where the data come from. For example, in image analysis, people usually use covariance matrices of features as a region descriptor. In this case, such a descriptor is a point on the manifold of symmetrical positive deﬁnite matrices. More generally, in computer vision, it is common to collect data on a known manifold. For example, it is a common practice to use a subspace to represent a set of images, while such a subspace is actually a point on Grassmann manifolds. Thus, an image set is regarded as a point from the known Grassmann manifolds. This type of tasks incorporating manifold properties in learning is called learning on manifolds. There are three major strategies in dealing with learning tasks on manifolds: 1. Intrinsic Strategy: The ideal but hardest strategy is to intrinsically perform learning tasks on manifolds based on their intrinsic geometry. Very few existing approaches adopt this strategy. 2. Extrinsic Strategy: The second strategy is to implement a learning algorithm within the tangent spaces of manifolds where all the linear relations can be exploited. In fact, this is a ﬁrst-order approximation to the Intrinsic strategy and most approaches fall in this category. 3. Embedding Strategy: The third strategy is to embed a manifold into a “larger” Euclidean space by an appropriate mapping like kernel methods and any learning algorithms will be implemented in this “ﬂatten” embedding space. But for a practical learning task, how to incorporate the manifold properties of those known manifolds in kernel mapping design is still a challenging work. Therefore, we are concerned with the points on a particular known manifold, the Grassmann manifolds. We explore the LRR model to be used for clustering a set of data points on Grassmann manifolds by adopting the aforementioned third strategy. In fact, Grassmann manifolds have a nice property that they can be embedded into the linear space of symmetric matrices [18, 19]. By this way, all the abstract points (subspaces) on Grassmann manifolds can be embedded into a Euclidean space where the classic LRR model can be applied. Then an LRR model can be constructed in the embedding space, where the error measure is simply taken as the Euclidean metric in the embedding space. With the rapid development of security technology, numerous cameras or sensors are widely employed in public spaces; even one site is usually covered by several cameras. Thus, a huge amount of multi-source data is generated. For example, multiple cameras simultaneously capture the same human action from different viewpoints. Another case is that the deep camera can provide both color pictures and deep maps of the same scene. Compared with single-view data, the abundant and complementary information from multi-view data can overcome the drawbacks of view limitation and objects occlusions, which brings impressive improvements in recognition and clustering tasks. Nevertheless, how to represent and fuse these multi-view data is becoming the key problem. Although there has been progress in multi-view clustering, there are two major deﬁciencies that exist in these multi-view clustering algorithms:

154

B. Wang and J. Gao

1. These multi-view clustering methods are designed for vectorial data, which are generated from linear spaces and the similarity of data is measured by Euclidean distance. This limits the application of these classic multi-view clustering algorithms for data with manifold structure. 2. The simple and empirical fusion approaches in the current multi-view clustering methods cannot fulﬁll the complementary information of multi-view data or feature, so they are difﬁcult to apply in practical scenarios. To further fuse multi-view data of videos or images, motivated by [20], we use product space to integrate a set of points on Grassmann manifolds, namely, Product Grassmann Manifolds (PGM). Additionally, instead of simply combining each view with equal importance as PGM in [20], we set a self-tuning weight for each view of PGM to implement data fusion adaptively. Beneﬁting from the strong capability to extract discriminative information for videos, learning on Grassmann manifolds methods have been applied in many computer vision tasks. However, such learning algorithms, particularly on highdimensional Grassmann manifolds, always involve signiﬁcantly high computational cost, which seriously limits the applicability of learning on Grassmann manifolds in wider areas. It is desired to design a dimensionality reduction method for Grassmann manifolds. Locality Preserving Projections (LPP) is a commonly used dimensionality reduction method for vector-valued data, aiming to preserve local structure of data in the dimension-reduced space. The strategy is to construct a mapping from higher-dimensional Grassmann manifolds into the one in a relative low-dimensional with more discriminative capability. Wang et al. [21] propose an unsupervised dimensionality reduction algorithm on Grassmann manifolds based on LPP. In this chapter, we describe how to perform unsupervised learning, i.e., clustering and dimensionality reduction, on Grassmann manifolds and provide some interesting video applications in computer vision. We begin with the deﬁnition and some necessary properties of Grassmann manifolds in Sect. 7.2. Then, in Sect. 7.3, we extend classic subspace clustering models, low rank representation with embedding distance, and tangent space distance, onto Grassmann manifolds. To fuse multicamera video data, we provide the deﬁnition of Product Grassmann manifolds and also apply it in subspace clustering tasks in Sect. 7.4. Finally, we generalize the classic dimensionality reduction model, Locality Preserving Projection, for data on Grassmann manifolds in Sect. 7.5.

7.2

Low Rank Representation on Grassmann Manifolds

Low rank representation (LRR) has recently attracted great interest due to its pleasing efﬁcacy in exploring low-dimensional subspace structures embedded in data. One of its successful applications is subspace clustering, by which data are clustered according to the subspaces they belong to. In this chapter, at a higher level, we intend to cluster subspaces into classes of subspaces. This is naturally described

7 Unsupervised Learning on Grassmann Manifolds for Big Data

(a)

(b)

155

(c)

Fig. 7.1 (a) All Grassmannian points are mapped into the mapping space. (b) LRR model is formulated in mapping space and we constrain the coefﬁcient matrix maintaining the inner structure of origin data. (c) Clustering by NCuts

as a clustering problem on Grassmann manifolds. The novelty of this idea is to generalize LRR on Euclidean space onto an LRR model on Grassmann manifolds. We consider the Gaussian noise and outlier conditions, respectively. Several clustering experiments of our proposed methods are conducted on Human Action datasets and Trafﬁc video datasets. The whole clustering procedure is illustrated in Fig. 7.1. We review some concepts about the low rank representation (LRR) model and Grassmann manifolds, which pave the way for introducing our proposed methods.

7.2.1

Low Rank Representation

Given a set of data drawn from an unknown union of subspaces X ¼ [x1, x2, . . ., xN] 2 RD N where D is the data dimension, the objective of subspace clustering is to assign each data sample to its underlying subspace. The basic assumption is that K of dimensionality the data in X are drawn from a collection of K subspaces fS k gk¼1 K fdk gk¼1 . According to the principle of self-representation of data, each data point from a dataset can be written as a linear combination of the remaining data points, i.e., X ¼ XZ, where Z 2 RN N is the coefﬁcient matrix of similarity. The general LRR model [8] can be formulated as the following optimization problem:

156

B. Wang and J. Gao

min kEk2F þλjjZjj∗ , s:t: X¼XZþE Z, E

ð7:1Þ

where E is the error resulting from the self-representation. The Frobenius norm kk2F can be replaced by the Euclidean ‘2, 1-norm as done in the original LRR model or other meaningful norms. LRR takes a holistic view in favor of a coefﬁcient matrix in the lowest rank, measured by the nuclear norm kk∗. The LRR model tries to reveal the latent sparse property embedded in a dataset in high-dimensional space. It has been proved that, when the high-dimensional dataset is actually from a union of several low-dimension subspaces, the LRR model can reveal this structure through subspace clustering [8].

7.2.2

Grassmann Manifolds

Deﬁnition 1 (Grassmann Manifolds) [22] The Grassmann manifolds, denoted by Gðp; d Þ, consists of all the p-dimensional subspaces embedded in d-dimensional Euclidean space Rd (0 p d ). For example, when p ¼ 0, the Grassmann manifolds becomes the Euclidean space itself. When p ¼ 1, the Grassmann manifolds consists of all the lines passing through the origin in Rd. As Grassmann manifolds is abstract, there are a number of ways to realize it. One convenient way is to represent the manifold by the equivalent classes of all the thintall orthogonal matrices under the orthogonal group OðpÞ of order p. Hence, we have the following matrix representation: Gðp; d Þ ¼ X 2 Rdp : XT X¼Ip =OðpÞ

ð7:2Þ

We refer a point on Grassmann manifolds to an equivalent class of all the thin-tall orthogonal matrices in Rd p, anyone in which can be converted to the other by a p p orthogonal matrix. There are two popular methods to measure the distance on Grassmann manifolds. One is to deﬁne consistent metrics within the tangent space of Grassmann manifolds where all the linear relations can be exploited. Deﬁnition 2 (Log Mapping) [23] Given Grassmannian points X1 and X2, the Log mapping on the Grassmann manifolds is Log½X ð½YÞ ¼ UarctanðΣÞVT where conducting singular value decomposition (SVD) on

7 Unsupervised Learning on Grassmann Manifolds for Big Data

157

1 UΣVT ¼ Y 2 XXT Y XT Y

ð7:3Þ

In fact, this is a ﬁrst-order approximation to the geodesic distance. Another is to embed the Grassmann manifolds into symmetric matrices space where the Euclidean metric is available. The latter one is easier and more effective in practice, therefore, we use the embedding distance in this chapter. Deﬁnition 3 (Embedding Distance) [18] Given Grassmannian points X1 and X2, Grassmann manifolds can be embedded into symmetric matrices space as follows: Π : Gðp; dÞ⟶symðdÞ, ΠðXÞ ¼ XXT

ð7:4Þ

and the corresponding distance on Grassmann manifolds can be deﬁned as 1 dist2g ðX1 ; X2 Þ ¼ kΠðX1 Þ ΠðX2 Þk2F 2

ð7:5Þ

This property was used in clustering applications; the mean shift method was discussed on Stiefel and Grassmann manifolds in [24]. A new version of K-means method was proposed to cluster Grassmannian points, which is constructed by a statistical modeling method [25]. These works try to expand the clustering methods within Euclidean space to more practical situations on nonlinear spaces. Along with this direction, we further explore the subspace clustering problems on Grassmann manifolds and try to establish a novel and feasible LRR model on Grassmann manifolds.

7.2.3

LRR on Grassmann Manifolds

In the current LRR model (Eq. (7.1)), the data reconstruction error is generally computed in the original data domain. For example, the common form of the reconstruction error is Frobenius norm, i.e., the error term can be chosen as follows: kEk2F ¼ kX 2 XZk2F ¼

2 XN XN x z x i ji j i j

F

ð7:6Þ

where data matrix X ¼ [x1, x2, . . ., xN] 2 RD N. To extend the LRR model for manifold-valued data, two issues have to be resolved: (1) model error should be measured in terms of manifold geometry, and (2) the linear relationship has to be reinterpreted. This is because the linear relation deﬁned by X ¼ XZ + E in Eq. (7.1) is no longer valid on a manifold. We extend the LRR model onto Grassmann manifolds by using the embedding distance. Given a set of Grassmannian points {X1, X2, . . ., XN} on Grassmann manifolds Gðp; dÞ, we mimic the classical LRR deﬁned in Eqs. (7.1) and (7.6) as follows:

158

B. Wang and J. Gao

min Z

XN N X ⊝ ⨄ X ⊙z þ λkZk∗ i j ji j¼1 i¼1 G

ð7:7Þ

where ⊝, ⨄ and ⊙ are only dummy operators to be speciﬁed soon and N Xi ⊝ ⨄j¼1 X j ⊙z ji is to measure the error between the point Xi and its G

N “reconstruction” ⨄j¼1 X j ⊙z ji . Thus, to get an LRR model on Grassmann manifolds, we should deﬁne proper distance and operators for the manifolds.

7.2.4

LRR on Grassmann Manifolds with Gaussian Noise (GLRR-F)

We include our prior work reported in the conference paper [26]. This LRR model on Grassmann manifolds, based on the error measurement deﬁned in Eq. (7.5), is deﬁned as follows: min kE k2F þ λkZk∗ s:t: X ¼ X 3 ZþE E, Z

ð7:8Þ

The Frobenius norm here is adopted because of the assumption that the model ﬁts to Gaussian noise. We call this model the Frobenius norm constrained GLRR (GLRR-F). In this case, the error term in Eq. (7.8) is kE k2F ¼ where Eð:; :; iÞ ¼ Xi XiT

XN i¼1

kEð:; :; iÞk2F

ð7:9Þ

T z X X is the ith slice of E, which is the error ij j j j¼1

XN

between the matrix Xi XiT and its reconstruction of linear combination symmetric XN T z X jX j . j¼1 ji We follow the notation used in [26]. By using variable elimination, we can convert problem Eq. (7.8) into the following problem: min kX X 3 Zk2F þ λkZk∗ Z

We note that

ð7:10Þ

X Tj Xi has a small dimension p p, which is easy to handle.

Denote Δij ¼ tr

h i X Tj Xi XiT X j ,

ð7:11Þ

7 Unsupervised Learning on Grassmann Manifolds for Big Data

159

and the N N symmetric matrix

Δ ¼ Δij :

ð7:12Þ

Then we have the following Lemma 1. Lemma 1 Given a set of matrices {X1, X2, . . ., XN} s.t. Xi 2 Rd p and XiT Xi ¼ I, if h i Δ ¼ [Δij]i, j 2 RN N with element Δij ¼ tr X Tj Xi XiT X j , then the matrix Δ is semi-positive deﬁnite. Proof Please refer to [26]. From Lemma 1, we have the eigenvector decomposition for Δ deﬁned by Δ ¼ UDUT, where UTU ¼ I and D ¼ diag (σ i) with nonnegative eigenvalues σ i. Denote the square root of Δ by Δ{1/2} ¼ UD1/2 UT, then it is not hard to prove that problem (Eq. 7.10) is equivalent to the following problem: 1 1 2 min ZΔ2 Δ2 þ λkZk∗ : Z

F

ð7:13Þ

Finally, we have Theorem 1. Theorem 1 Given that Δ ¼ UDUT as deﬁned above, the solution to (Eq. 7.13) is given by Z∗ ¼ UDλ UT where Dλ is a diagonal matrix with its ith element deﬁned by Dλ ði; iÞ ¼ 1

λ , if σ i > λ; 0, otherwise: σi

Proof Please refer to the proof of Lemma 1 in [27]. According to Theorem 2, the main cost for solving the LRR on Grassmann manifolds problem (Eq. 7.8) is (i) computation of the symmetric matrix Δ and (ii) an SVD for Δ. This is a signiﬁcant improvement to the algorithm presented in [26].

7.2.5

LRR on Grassmann Manifolds with ‘2/‘1 Noise (GLRR-21)

When there exist outliers in the dataset, the Gaussian noise model is no longer a favored choice. Therefore, we propose using the so-called kk2, 1 noise model, which is used to cope with signal-oriented gross errors in LRR clustering applications

160

B. Wang and J. Gao

[8]. Similar to the above GLRR-F model, we formulate the kk2, 1 norm constrained GLRR model (GLRR-21) as follows: min kE k2, 1 þ λkZk∗ s:t: X ¼ X 3 ZþE E, Z

ð7:14Þ

where the kk2, 1 norm of a tensor is deﬁned as the sum of the Frobenius norm of 3-mode slices as follows: kE k2 , 1 ¼

XN i¼1

kEð:; :; iÞkF

ð7:15Þ

Note that Eq. (7.15) without squares is different from Eq. (7.9). Because of the existence of ‘2, 1 norm in error term, the objective function is not differentiable but convex. We propose using the alternating direction method (ADM) method to solve this problem. Firstly, we construct the following augmented Lagrangian function: LðE; Z; ξÞ ¼ kE k2, 1 þ λkZk∗ þ hξ; X 2 X 3 Z E i μ þ kX X 3 Z E k2F 2

ð7:16Þ

where h, i is the standard inner product of two tensors in the same order, ξ is the Lagrange multiplier, and μ is the penalty parameter. Speciﬁcally, the iteration of ADM for minimizing Eq. (7.16) goes as follows: E kþ1 ¼ argmin L E; Zk ; ξk E

μ ¼ argmin kE k2, 1 þ hξ; X 2 X 3 Z E i þ kX X 3 Z E k2F 2 E Zkþ1 ¼ argmin L E kþ1 ; Z; ξk

ð7:17Þ

Z

¼ argmin λkZk∗ þ ξ; X 2 X 3 Z E kþ1 Z

μk kX X 3 Z E k2F 2 ¼ argmin L E kþ1 ; Zkþ1 ; ξ þ

ξkþ1

ð7:18Þ

ξ

¼ argmin ξk þ μk X X 3 Zkþ1 E kþ1 ξ

ð7:19Þ

Now, we ﬁrstly optimize the error term E in formula (Eq. 7.17). Denote Ck ¼ X X 3 Zk , and for any 3-order tensor A, we use A(i) to denote the ith

7 Unsupervised Learning on Grassmann Manifolds for Big Data

161

front slice Að:; :; iÞ along the 3-mode as a shorten notation. Then we observe that Eq. (7.17) is separable in terms of matrix variable E(i) as follows: 2

μk Ekþ1 ðiÞ ¼ argmin kEðiÞkF þ ξk ðIÞ; Ck ðiÞ EðiÞ þ Ck ðiÞ EðiÞF 2 EðiÞ 2 k μ k 1 k ¼ argmin kEðiÞkF þ C ðiÞ EðiÞ þ k ξ ðiÞ μ 2 EðiÞ F

ð7:20Þ

From [8] we know that the problem in Eq. (7.20) has a closed form solution, given by kþ1

E

1 1 1 k k ðiÞ ¼ 0 if M < k ; 1 C ðiÞ þ k ξ ðiÞ otherwise μ μ Mμk

ð7:21Þ

where M ¼ Ck ðiÞ þ μ1k ξk ðiÞ . F

As for other two variables Z and ξ, they can be obtained after several algebraic calculations [28].

7.2.6

Representing Image Sets and Videos on the Grassmann Manifolds

Our strategy is to represent image sets or videos as subspace objects, i.e., points on the Grassmann manifolds. For each image set or a video clip, we formulate the subspace by ﬁnding a representative basis from the matrix of the raw features of image sets or videos with SVD (Singular Value Decomposition), as done in M [18]. Concretely, let fYi gi¼1 be an image set, where Yi is a gray-scale image with dimension m n and M is the number of all the images. For example, each Yi in this set can be a handwritten digit 9 from the same person. We construct a matrix Γ ¼ [vec(Y1), vec(Y2), . . ., vec(YM)] of size (m n) M by vectorizing raw image data Yi. Then Γ is decomposed by SVD as Γ ¼ UΣV. We pick up the left p columns of U as the Grassmannian point ½X ¼ ½Uð:; 1 : pÞ 2 Gðp; m nÞ to M represent the image set fYi gi¼1 .

7.2.7

Examples on Video Datasets

Grassmann manifolds is a good tool for representing image sets, so it is appropriate to use it to represent video sequence data, which is regarded as an image set. In this experiment, we ﬁrstly select two challenge action video datasets, the Ballet action video dataset and SKIG action video dataset (shown in Fig. 7.2), to test the proposed

162

B. Wang and J. Gao

Fig. 7.2 Some samples from the ballet dataset and the SKIG dataset. The top two rows are from the ballet dataset and the bottom two rows are from the SKIG dataset Table 7.1 The clustering accuracy (%) of different methods on the Ballet and SKIG datasets

Datasets FGLRR GLRR-21 LRR [8] SSC [2] SCGSM [25] SMCE [29] LS3C [30]

Ballet 57.27 62.53 28.95 30.89 54.29 56.16 22.78

SKIG 51.85 53.33 – – 46.67 41.30 37.22

methods’ clustering performance. The ballet dataset, which contains simple backgrounds, could verify the capacity of the proposed method for action recognition in an ideal condition; while SKIG, which has more variations in background and illumination, examines the robustness of the proposed method for noise. Table 7.1 presents all clustering performance of all the methods on two action video datasets, Ballet and SKIG. As the ballet images do not have a very complex background or other obvious disturbances, they can be regarded as clean data without noise. Additionally, the images in each image set have time sequential relations and each action consists of several simple actions. So, these help to improve the performances

7 Unsupervised Learning on Grassmann Manifolds for Big Data

163

Fig. 7.3 Some samples from Highway Trafﬁc dataset (ﬁrst third line) and Road Trafﬁc dataset (last third line) Table 7.2 The clustering accuracy (%) of different methods on the Highway Trafﬁc and Road Trafﬁc datasets

Datasets FGLRR GLRR-21 LRR[8] SSC[2] SCGSM[25] SMCE[29] LS3C[30]

Highway Trafﬁc 80.63 82.21 68.38 62.85 60.87 51.38 65.61

Road Trafﬁc 67.78 67.78 49.11 66.78 47.67 66.56 44.33

of the evaluated methods, as shown in Table 7.1. Our proposed methods are obviously superior to other methods. SKIG dataset is more challenging than the ballet dataset, due to the smaller scale of the objects, the various backgrounds, and illuminations. From the experimental results in Table 7.1, we can conclude that the proposed methods are still superior to other methods. The above two human action datasets are considered as having relatively simple scenes with limited backgrounds. To demonstrate robustness of our algorithms to

164

B. Wang and J. Gao

complex backgrounds, we further test our methods on two practical applications with more complex conditions, including the Highway Trafﬁc dataset and a Road Trafﬁc dataset collected from Road detector by us (shown in Fig. 7.3). Table 7.2 presents the clustering performance of all the methods on the Highway Trafﬁc dataset. Because the number of trafﬁc level is only 3, all experimental results seem meaningful and for the worst case the accuracy is 0.5138. GLRR-F’s accuracy is 0.8063, which is at least 15% higher than other methods. Though the environment in Road Trafﬁc database is more complex than that in the above Highway Trafﬁc database, the accuracy of our methods was higher than other methods. The experiment on this dataset shows that the Grassmann manifolds based methods are more appropriate than other methods for this type of data.

7.3

Improved Models on Manifolds

Our proposed models, LRR on Grassmann manifolds, contain two main parts: LRR and Grassmann manifolds with embedding distance. Therefore, we further improve our proposed methods on these two parts.

7.3.1

An Improved LRR on Grassmann Manifolds

Wang et al. [31] proposes an improved LRR model for manifold-valued Grassmannian data, which incorporates prior knowledge by minimizing partial sum of singular values instead of the nuclear norm, namely, Partial Sum minimization of Singular Values Representation (PSSVR). The new model not only enforces the global structure of data in low rank, but also retains important information by minimizing only smaller singular values. To further maintain the local structures among Grassmannian points, we also integrate the Laplacian penalty with GPSSVR: min kE k2G þ λkZk>r þ2βjzi 2 z j jjj22 wij , s:t: X ¼ X 3 ZþE E, Z

ð7:22Þ

Xminðd;mÞ σ i ðAÞ, A 2 Rd m, and σ i(A) represents where PSSV norm kAk>r ¼ i¼rþ1 the ith singular value of the matrix A. The r is the expected rank of the matrix A, which may be derived from the prior knowledge of a deﬁned problem [32]. wij is the local similarity between Grassmannian points Xi and Xj. There are many ways to deﬁne wij’s. We simply use the explicit neighborhood determined by its manifold distance measure to deﬁne all wij. Let C be a parameter of neighborhood size, and we deﬁne wij ¼ dg(Xi, Xj) if Xi 2 N C X j ; otherwise, wij ¼ 0, where N C X j denote the C nearest elements of Xi on Grassmann manifolds.

7 Unsupervised Learning on Grassmann Manifolds for Big Data

165

By introducing the Laplacian matrix L, problem Eq. (7.22) can be easily rewritten as its Laplacian form: min kE k2G þ λkZk>r þ2β tr ZLZT , s:t: E, Z

X ¼ X 3 ZþE

ð7:23Þ

where Laplacian matrix L 2 Rm m is deﬁned as L ¼ D W, and

the m W ¼ wij i, j¼1 and D ¼ diag (dii) with dii ¼ ∑jwij.

7.3.2

LRR on Grassmann Manifolds with Tangent Space Distance

Xie et al. [33] approximately deﬁne a linear combination on the manifolds to achieve the dictionary learningX over Riemannian manifold. It is also pointed out in [33] that N the afﬁne constraints w ¼ 1 ði ¼ 1; 2; . . . ; N Þ can preserve the coordinate j¼1 ij independence on manifolds. Following this motivation, Wang et al. [23] formulate the LRR on Grassmann manifolds with tangent space distance, 1 XN X N 2 z Log ij X i X j þ λ kZ k∗ i¼1 j¼1 Z 2 Xi i ¼ 1, 2, . . . , N: &PΩ ðZÞ¼0

min

s:t:

XN

z j¼1 ij

¼ 1,

ð7:24Þ

where PΩ(Z) ¼ 0 is a constraint to preserve the formal local properties of the coefﬁcients, so the coefﬁcient matrix Z is sparse. Here PΩ(Z) ¼ 0 is implemented by a projection operator PΩ over the entries of Z deﬁned as follows: PΩ wi, j ¼ 0, if ði; jÞ 2 Ω; otherwise

ð7:25Þ

where the index set of Ω is deﬁned as = Ni : Ω ¼ ði; jÞ : j ¼ i or X j 2 To preserve the local properties of the coefﬁcients, there are a number of ways to predeﬁne the index set Ω. For example, we can use a threshold over the Euclidean distance between Xi and Xj in the ambient space of the manifold, or we may use a threshold over the geodesic distance between Xi and Xj. In this chapter, we adopt the KNN (K-nearest neighbor) strategy to choose C closest neighbors under geodesic distance. Thus, the neighborhood size C is a tunable parameter in our method.

166

7.4

B. Wang and J. Gao

Weighted Product Grassmann Manifolds and Its LRR Application

Compared with conventional single-view clustering, multi-view clustering, which combines various views of different fractional information, has achieved better performance and attracted more and more attention in recent years. However, the existing multi-view clustering methods face two main challenges: (1) Most multiview clustering methods are designed for vectorial data from linear spaces, thus not suitable for high-dimensional data with intrinsic nonlinear manifold structure, e.g., videos. (2) The simple and empirical fusion approaches cannot fulﬁll the complementary information of multi-view data or feature, which are difﬁcult to be applied in practical scenarios as shown in Fig. 7.4. To address these problems, we propose a novel multiple manifolds based multi-view clustering subspace method, in which the multi-view data are represented as Product Grassmann manifolds and an adaptive fusion strategy is designed to weight the importance of different views automatically. Moreover, the low rank representation in Euclidean space is extended onto the product manifolds space to obtain an afﬁnity matrix for clustering. The experimental results show that our method obviously outperforms other state-of-the-art clustering methods.

7.4.1

Weighted Product Grassmann Manifolds

The PGM [20] is deﬁned as a space of product of multiple Grassmann manifolds, denoted by PGd:p1 , ..., pM . For a given set of natural numbers {p1, . . ., pM}, we deﬁne the PGM PGd:p1 , ..., pM as the space of Gðp1 ; dÞ GðpM ; d Þ. A PGM point can be represented as a collection of Grassmannian points, denoted by [X] ¼ {X1, . . ., XM} such that Xm 2 Gðpm ; dÞ, m ¼ 1, . . . , M. Following this idea, Wang et al. [20] give the sum of Grassmann distances as the distance on PGM, d2PG ð½X; ½YÞ ¼

XM m¼1

d 2g ðXm ; Ym Þ,

ð7:26Þ

i.e., d2PG ð½X; ½YÞ ¼

XM 1 Xm ðXm ÞT Ym ðYm ÞT 2 : F m¼1 2

ð7:27Þ

However, the distance deﬁnition of PGM in Eq. (7.27) assigns an equal weight to each view of Grassmann manifolds. In practice, we need to employ different weights to measure the importance of each view as its individual role in application. Thus, we

7 Unsupervised Learning on Grassmann Manifolds for Big Data

167

Fig. 7.4 Assuming one object or action is collected by several cameras (i.e., C1, C2, C3, and C4 shown above) in different angles, each camera can only capture partial information of the object

formulate the distance of Weighted Product Grassmann Manifolds (WPGM) in a weighted form as follows: d2WPG ð½X; ½YÞ ¼

XM m¼1

wm d2G ðXm ; Ym Þ,

ð7:28Þ

where w ¼ [w1, . . ., wM], wm is the weight for the mth view of points Xm, Ym on WPGM.

168

7.4.2

B. Wang and J. Gao

LRR on Weighted Product Grassmann Manifolds

To generalize the classic LRR model (Eq. 7.1) onto WPGM and implement clustering on a set of WPGM points X ¼ {[X1], [X2], . . ., [XN]}, we construct the LRR on WPGM as follows: min E , Z, w

XM m¼1

wm kE m k2G þ λkZk∗ ,

s:t:

X ¼ X 4 ZþE

ð7:29Þ

where E m denotes the mth view of the reconstructed error of E and kkG represents the Grassmann distance. Z 2 RN N is the low rank representation coefﬁcients matrix, which shares the same pattern across different modalities. X ¼ {[X1], [X2], . . ., [XN]} is a 4th-order tensor such that the 4th-order slices are the 3rd-order tensors [Xi], and each [Xi] is constructed by stacking the symmetrically mapped matrices along the third mode. Its mathematical representation is given by n T T o T ½Xi ¼ X1i X1i ; X2i X2i ; . . . ; XiM XiM SymðdÞ and 4 means the mode-4 multiplication of a tensor and a vector (and/or a matrix) [34]. The formula (Eq. 7.29) gives the error term of each view E m a weight wm to learn one shared similarity matrix Z which can consistently across views. We can search those weights in a large range manually. Although this scheme with parameters often has better performance than a parameter-free scheme, more parameters make the algorithm not easy to process in practical applications. Therefore, motivated by [35], instead of solving the problem in Eq. (7.29) directly, we consider the following problem and attempt to learn the weight wm from the mth view of multi-view data: min E , Z, w

XM m¼1

kE m k2G þ λkZk∗ ,

s:t: X ¼ X 4 ZþE

ð7:30Þ

where each view shares the same similarity matrix Z and no weight factor is deﬁned. We consider it as a no-constraint issue and thus it can be written as a Lagrange function: min E, Z

XM m¼1

kE m kG þ λkZk∗ þ Cð^; ZÞ

ð7:31Þ

where Cð^; ZÞ is the formalized term derived from constraints and ^ is the Lagrange multiplier. Taking the derivative of formula Eq. (7.31) w.r.t. Z and setting the derivation to zero, we have

7 Unsupervised Learning on Grassmann Manifolds for Big Data

XM

w m¼1 m

∂kE m kG ∂λkZk∗ ∂Cð^; ZÞ ¼0 þ þ ∂Z ∂Z ∂Z

169

ð7:32Þ

where the weight wm is given by wm ¼

1 2kE m kG

ð7:33Þ

The mth weight wm depends on the variable E (or Z); thus, two factors of the ﬁrst and the second term in formula Eq. (7.32) are coupled with each other. Additionally, if we set wm stationary, formula Eq. (7.32) can be considered as the solution to formula Eq. (7.29). Therefore, we get the ﬁnally weighted LRR on WPGM as follows: min E, Z

XM m¼1

wm kE m k2G þ λkZk∗ ,

s:t: X ¼ X 4 ZþE

ð7:34Þ

where the weight wm can be adaptively tuned by Eq. (7.33) to fuse the multiple manifolds. It is called LRR on Weighted Product Grassmann Manifolds (WPGLRR). It can be simply solved by alternately optimizing E, Z iteratively.

7.4.3

Optimization

We ﬁrstly consider each slice E im of the reconstructed error E m and so the ﬁrst term in Eq. (7.34) can be rewritten as the following form: XM m¼1

X N 2 E m i G i¼1 2 XN XM m mT X N m mT ¼ w X X z X X i i j j m¼1 m i¼1 j ij

wm kE m k2G ¼

XM

m¼1

wm

F

ð7:35Þ 2 m 2 TTo simplify the expression of jjE jjG , noting that the matrix property jjAjjF ¼ tr A A and denoting

Δijm ¼ tr

h i m m XmT XmT j Xi i Xj

Clearly Δijm ¼ Δ mji and we deﬁne M N N symmetric matrices

ð7:36Þ

170

B. Wang and J. Gao

N Δm ¼ Δijm

i, j¼1

, m ¼ 1, 2, . . . , M:

ð7:37Þ

With some algebraic manipulation, it is not hard to simplify the reconstructed error as XM

w kE m k2G ¼ m¼1 m

XM m¼1

2wm trðZΔm Þ þ tr ZΔm ZT þ const

ð7:38Þ

After variable elimination, the objective function of WPGLRR can be converted into min Z

XM m¼1

2wm trðZΔm Þ þ wm tr ZΔm ZT þ λkZk∗

ð7:39Þ

To tackle this problem, we employ the alternating direction method (ADM) [28, 36], which is widely used to solve unconstrained convex problems [8, 37]. Firstly, by introducing an augmented variable Z 2 Rm m, the proposed WPGLRR model can be reformulated as a linear equality-constrained problem with two variables J and Z: min Z, J

XM m¼1

2wm trðZΔm Þ þ wm tr ZΔm ZT þ λkJk∗ s:t: J¼Z

ð7:40Þ

Thus, the ADM method can be applied to absorb the linear constraint into the objective function as follows: f ðZ; J; Y; μÞ ¼

wm wm m T trðZΔm Þ þ tr ZΔ Z λ λ μ þkJk∗ þ hY; Z Ji þ kZ Jk2F 2 PM

m¼1

ð7:41Þ

where matrix Y is the Lagrangian Multiplier and μ is a weight to tune the error term jjZ Jjj2F . The ALM formula Eq. (7.41) can be naturally solved by alternatively solving for Z, J, Y, and μ, respectively, in an iterative procedure.

7.4.4

Experimental Results

In this section, we evaluate the performance of our proposed clustering approaches on a multi-camera individual action video dataset, the ACT42 action dataset (shown in Fig. 7.5). ACT42 action dataset is collected under a relative pure background

7 Unsupervised Learning on Grassmann Manifolds for Big Data

171

Fig. 7.5 Some samples in The ACT42 samples. Each row presents a video sequence from a camera. There are four cameras to record the same action simultaneously Table 7.3 The clustering accuracy (%) of different methods on a multi-camera video dataset

Evaluation FGLRR-1 [26] FGLRR-2 FGLRR-3 FGLRR-4 MLAN+LBP-TOP [38] PGLRR [20] KPGLRR [20] WPGLRR

ACC(%) 60.20 62.59 55.44 55.78 20.40 75.17 70.41 77.21

NMI(%) 51.77 50.51 44.38 51.23 2.30 69.14 66.09 71.12

condition with four cameras in different viewpoints, which contains 14 kinds of actions. This dataset can be regarded as a clean dataset without noises because of controlled internal settings. In addition, each action is recorded by four cameras at the same time and each camera has a clear view, which helps improve the performance of the evaluated methods. The overall clustering results are presented in Table 7.3. The bold number in each column represents the best result for the corresponding standard measurement. Classic multi-view clustering algorithm, MLAN, performing on LBP-TOP video features has unsatisfactory performance, while our proposed method WPGLRR achieves notable performance improvement. We owe it to the strong representation ability of Grassmann manifolds for videos. We execute subspace clustering on

172

B. Wang and J. Gao

Grassmann manifolds, FGLRR (discussed in Sect. 7.2), on each view of PGM, respectively. The experimental results of our proposed method are obviously outcome to FGLRR-1/2/3/4, which reﬂect the advantages of Product manifolds. Compared with clustering methods on PGM, our proposed methods maintained comparable experimental results, which demonstrate adaptive-fusion of PGM can extract more discriminative features than sharing the same weight for each view of PGM.

7.5

Dimensionality Reduction for Grassmann Manifolds

Beneﬁting from the strong capability to extract discriminative information for videos, learning on Grassmann manifolds methods have been applied in many computer vision tasks. But, such learning algorithms, particularly on highdimensional Grassmann manifolds, always involve signiﬁcantly high computational cost, which seriously limits the applicability of learning on Grassmann manifold in wider areas. It is desired to design a dimensionality reduction method for Grassmann manifolds. Locality Preserving Projections (LPP) is a commonly used dimensionality reduction method for vector-valued data, aiming to preserve local structure of data in the dimension-reduced space. The strategy is to construct a mapping from higher-dimensional Grassmann manifolds into the one in a relative low-dimensional with more discriminative capability. In this research, Wang et al. [21] propose an unsupervised dimensionality reduction algorithm on Grassmann manifolds based on LPP. The performance of our proposed method is assessed on several classiﬁcation and clustering tasks and the experimental results show its clear advantages over other Grassmann manifolds based algorithms (Fig. 7.6).

Fig. 7.6 Conceptual illustration of the proposed unsupervised DR on Grassmann manifolds. The projected Grassmannian points still preserve the local structure of original high-dimensional Grassmann manifolds

7 Unsupervised Learning on Grassmann Manifolds for Big Data

7.5.1

173

Locality Preserving Projection

LPP uses a penalty regularization to preserve the local structure of data in the new projected space. Deﬁnition 4 (Locality Preserving Projections) [39] Let X ¼ [x1, . . ., xN] 2 RD N be the data matrix with N the number of samples and D the dimension of data. Given a local similarity W ¼ [wij] among data X, LPP seeks for the projection vector such that the projected value yi ¼ aTxi (i ¼ 1, . . ., N ) fulﬁlls the following objective: min a

XN i, j¼1

aT xi aT x j

2

wij ¼

XN i, j¼1

aT XLXT a

ð7:42Þ

with the constraint condition yDyT ¼ aT XDXT a ¼ 1

ð7:43Þ

where y X ¼ [y1, . . ., yN], L ¼ D W is the graph Laplacian matrix and D ¼ diag [dii] N with dii ¼ w . j¼1 ij A possible deﬁnition of W is suggested as follows: ! xi 2 x j 2 wij ¼ exp , if xi 2 N x j or x j 2 N ðxi Þ t

ð7:44Þ

where t 2 R+ and N ðxi Þ denotes the k nearest neighbors of xi. With the help of W, minimizing LPP objective function Eq. (7.42) is to ensure if xi and xj are similar to each other, then the projected values yi ¼ aTxi and yj ¼ aTxj are also similar. To reduce the dimensionality of each data xi from D into d, we can seek d projection vectors for a.

7.5.2

LPP for Grassmann Manifolds

In [21], we propose an unsupervised dimensionality reduction method for Grassmann manifolds that maps a high-dimensional Grassmannian point Xi 2 G ðp; DÞ to a point in a relative low-dimensional Grassmann manifolds Gðp; d Þ, D > d. The mapping Gðp; DÞ ! Gðp; d Þ to be learned is deﬁned as Yi ¼AT Xi

ð7:45Þ

where A 2 RD d. To make sure that Yi 2 Rd p is well deﬁned as the representative of the mapped Grassmannian point on lower-dimension manifold,

174

B. Wang and J. Gao

we need to impose some conditions. Obviously, the projected data Yi is not an orthogonal matrix, disqualiﬁed as a representative of a Grassmannian point. To solve this problem, we perform QR decomposition on matrix Yi as follows [40]: Yi ¼ AT Xi ¼ Qi Ri T ~i ) Qi ¼AT Xi R1 ¼A X i

ð7:46Þ

where Qi 2 Rd p is an orthogonal matrix, Ri 2 R p p is an invertible upper ~ i ¼ Xi R1 2 RDp denotes the normalized Xi. As both Yi triangular matrix, and X i ~ i) and Qi generate the same (columns) subspace, the orthogonal matrix Qi (or AT X can be used as the representative of the low-dimensional Grassmannian point mapped from Xi.

7.5.3

Objective Function

The term (aTxi aTxj)2 in LPP objective function Eq. (7.42) means the distance between the projected data aTxi and aTxj; therefore, it is natural for us to reformulate the classic LPP objective function on Grassmann manifolds as follows: min A

XN ij

XN T 2 ~ i ; AT X ~ j wij X dist2g Qi ; Q j wij ¼ dist A g ij

ð7:47Þ

where wij reﬂects the similarity between original Grassmannian points Xi and Xj, and the distance distg() is chosen as the embedding distance (Eq. 7.5). Hence, 2 2 T ~ i ; AT X ~j ¼ ~ iX ~ T A AT X ~ jX ~ T A dist2g AT X A X ¼ AT Gij AF i j F

~ TX ~ T , which is a symmetric matrix of size D D. Thus, ~ iX ~ jX where Gij ¼ X i j the objective function Eq. (7.47) can be rewritten, termed as GLPP: min A

XN

T A Gij A2 wij : F i, j¼1

ð7:48Þ

The next issue is how to construct the adjacency graph W from the original Grassmannian points. We extend the Euclidean graph W onto Grassmann manifolds as follows. Deﬁnition 5 (Graph W on Grassmann Manifolds) Given a set of Grassmannian points {X1, . . ., XN}, we deﬁne the graph as

7 Unsupervised Learning on Grassmann Manifolds for Big Data

wij ¼ exp dist2g Xi ; X j

175

ð7:49Þ

where wij denotes the similarity of Grassmannian points Xi and Xj. In this deﬁnition, we may set distg(Xi, Xj) to any one valid Grassmann distance. We select the embedding distance in our experiments.

7.5.4

GLPP with Normalized Constraint

Without any constraints on A, we may have a trivial solution from problem Eq. (7.48). To introduce an appropriate constraint, we have to ﬁrstly deﬁne some ~ i 2 RDp and the necessary notations. We split the normalized Grassmannian point X d p projected matrix Qi 2 R in Eq. (7.46) into their components

~i Qi ¼ qi1 ; . . . ; qip ¼ AT ~ x i1 ; ; AT ~x ip ¼ AT X where qij 2 Rd and ~ x ij 2 RD with j ¼ 1, 2, . . ., p. For each j (1 j p), deﬁne matrix

Q j ¼ q1 j ; q2 j ; . . . ; qNj 2 RdN , and

~j¼ x ~ 1j ; ~ X x 2j ; . . . ; ~ x Nj 2 RDN : That is, from all N normalized Grassmannian points Qi (or all N normalized ~ i), we pick their jth column and stack them together. Then, it Grassmannian points X is easy to check that ~j Q j ¼ AT X For this particularly organized matrix Q j, considering the constraint condition similar to formula Eq. (7.43): ~j : ~ jT AAT X tr Q j DQ jT ¼ tr DQ jT Q j ¼ tr DX i Hence, one possible overall constraint can be deﬁned as Xp j¼1

~ j ¼ 1: ~ jT AAT X tr DX i

176

B. Wang and J. Gao

~ j , we can further simplify it into a form by using Rather than using the notation X ~ i . A long algebraic manipulation can original normalized Grassmannian points X prove that X N ~ j ¼ tr AT ~ iX ~T A : ~ jT AAT X X tr D X d ii i j¼1 i

Xp

Hence, we add the following constraint condition: X N ~ iX ~ T A ¼ 1: tr AT X d ii i i¼1 Deﬁne H ¼

XN i¼1

~ iX ~ T , then the ﬁnal constraint condition can be written as dii X i tr AT HA ¼ 1

ð7:50Þ

Combining the objective function Eq. (7.48) and constraint condition Eq. (7.50), we get the overall GLPP model min A

XN i, j¼1

T A Gij A2 wij s:t: tr AT HA ¼ 1 F

ð7:51Þ

In the next section, we propose a simpliﬁed way to solve problem Eq. (7.51), which is quite different from most Riemannian manifold based optimization algorithms such as in the Riemannian Conjugate Gradient (RCG) toolbox.

7.5.5

Optimization

In this section, we provide an iteration solution to solve the optimization problems Eq. (7.51). First, we write the cost function as follows: f ðAÞ ¼

N X tr AT Gij AAT Gij A wij i, j¼1

For simpliﬁcation, we redeﬁne a new objective function fk in the kth iteration by using the last step A(k 1) in the following way:

7 Unsupervised Learning on Grassmann Manifolds for Big Data

f k ðA Þ ¼

N X

177

wij tr AT Gij Aðk1Þ Aðk1ÞT Gij A

i, j¼1 N X ¼ tr AT wij Gij Aðk1Þ Aðk1ÞT Gij A i, j¼1

ð7:52Þ

Denoting J¼

XN i, j¼1

wij Gij Aðk1Þ Aðk1ÞT Gij

~ i and X ~ j . Then the where Gij is calculated according to A(k 1) through both X simpliﬁed version of problem Eq. (7.51) becomes min tr AT JA , s:t: tr AT HA ¼ 1 A

ð7:53Þ

The Lagrangian function of Eq. (7.53) is given by tr AT JA þ λ 1 tr AT HA

ð7:54Þ

which can be derived to solve and translated to a generalized eigenvalue problem Ja ¼ λHa Obviously, matrices H and J are symmetrical and positive semi-deﬁnite. By performing eigenvalue decomposition on H1J, the transform matrix A ¼ [a1, . . ., ad] 2 RD d is given by the minimum d eigenvalue solutions to the generalized eigenvalue problem.

7.5.6

Experimental Results

In this section, we evaluate the proposed method GLPP on several classiﬁcation and clustering tasks, respectively. The selected datasets are collected in public scenery, Highway Trafﬁc dataset, and UCF sport dataset (shown in Fig. 7.7). For the classiﬁcation task, we use K Nearest Neighbor on Grassmann manifolds algorithm (GKNN) and Dictionary Learning on Grassmann manifolds (GDL) [18] as baselines, and the experimental results are listed in Table 7.4. Obviously, the experimental accuracy of GLPP-based algorithms is at least 5% higher than the corresponding compared methods in most cases. And so, since LPP is derived by preserving local information, LPP is less sensitive to outliers. The experimental results also demonstrate that the low-dimensional Grassmannian points generated by

178

B. Wang and J. Gao

Fig. 7.7 Some samples from UCF sport dataset Table 7.4 Classiﬁcation results (in %) on different datasets Evaluation methods Dataset 3 sub Dataset 13 sub

Number of samples Training Testing Highway Trafﬁc 192 60 UCF sport 124 26

ACC GKNN

GKNN-GLPP

GDL [18]

GDL-GLPP

70.00

76.67

65.00

70.00

53.85

61.54

61.54

65.38

We also list the number of samples in the ﬁrst two columns. The values in boldface give the best performance among all the compared methods Table 7.5 Clustering results (in %) on different datasets Evaluation methods Dataset 3 sub Dataset 13 sub

ACC GKM [25] GKM-GLPP Highway Trafﬁc 64.43 73.52 UCF sport 50.00 57.33

NMI GKM [25]

GKM-GLPP

27.13

38.59

56.54

62.70

The values in boldface give the best performance among all the compared methods

our proposed method reﬂect more discrimination than on the original Grassmann manifolds. As for the clustering task, select K-means on Grassmann manifolds (GKM) [25] as the compared method. Table 7.5 shows ACC and NMI values for all algorithms. Clearly, after drastically reducing dimensionality by our proposed method, the new low-dimensional Grassmann manifolds still maintain fairly higher accuracy than the original high-dimensional Grassmann manifolds for all algorithms, which attests that our proposed DR scheme signiﬁcantly boosts the performance of GKM.

7 Unsupervised Learning on Grassmann Manifolds for Big Data

7.6

179

Conclusion

This chapter has presented a discussion for unsupervised learning of Big multimodal data using mappings on Grassmann manifolds for Big data clustering. We discussed three major strategies in dealing with learning tasks on manifolds (intrinsic strategy, extrinsic strategy, and embedding strategy), and then proposed approaches using the embedding strategy and an LRR model to be used for clustering video datasets based on the new representation. We have shown the good performance of our proposed approaches using videos and image sets in computer vision applications.

References 1. Xu, R., Wunsch-II, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(2), 645–678 (2005) 2. Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell. 35, 2765–2781 (2013) 3. Tseng, P.: Nearest q-ﬂat to m points. J. Optim. Theory Appl. 105(1), 249–252 (2000) 4. Gruber, A., Weiss, Y.: Multibody factorization with uncertainty and missing data using the EM algorithm. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 707–714 (2004) 5. Ho, J., Yang, M.H., Lim, J., Lee, K., Kriegman, D.: Clustering appearances of objects under varying illumination conditions. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 11–18 (2003) 6. Kanatani, K.: Motion segmentation by subspace separation and model selection. In: IEEE International Conference on Computer Vision, pp. 586–591 (2001) 7. Ma, Y., Yang, A., Derksen, H., Fossum, R.: Estimation of subspacearrangements with applications in modeling and segmenting mixed data. SIAM Rev. 50(3), 413–458 (2008) 8. Liu, G., Lin, Z., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell. 35, 171–184 (2013) 9. Lang, C., Liu, G., Yu, J., Yan, S.: Saliency detection by multitask sparsity pursuit. IEEE Trans. Image Process. 21(1), 1327–1338 (2012) 10. Luxburg, U.V.: A tutorial on spectral clustering. Stat. Comput. 17(4), 395–416 (2007) 11. Chen, G., Lerman, G.: Spectral curvature clustering. Int. J. Comput. Vis. 81(3), 317–330 (2009) 12. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(1), 888–905 (2000) 13. Donoho, D.: For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution. Commun. Pure Appl. Math. 59, 797–829 (2004) 14. Lerman, G., Zhang, T.: Robust recovery of multiple subspaces by geometric lp minimization. Annu. Stat. 39(5), 2686–2715 (2011) 15. Liu, J., Chen, Y., Zhang, J., Xu, Z.: Enhancing low-rank subspace clustering by manifold regularization. IEEE Trans. Image Process. 23(9), 4022–4030 (2014) 16. Tierney, S., Gao, J., Guo, Y.: Subspace clustering for sequential data. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1019–1026 (2014) 17. Wang, R., Shan, S., Chen, X., Gao, W.: Manifold-manifold distance with application to face recognition based on image set. In: IEEE Conference on Computer Vision and Pattern Recognition (2008) 18. Harandi, M.T., Sanderson, C., Shen, C., Lovell, B.: Dictionary learning and sparse coding on Grassmann manifolds: An extrinsic solution. In: International Conference on Computer Vision, pp. 3120–3127 (2013)

180

B. Wang and J. Gao

19. Harandi, M.T., Salzmann, M., Jayasumana, S., Hartley, R., Li, H.: Expanding the family of Grassmannian Kernels: an embedding perspective. In: European Conference on Computer Vision, pp. 408–423 (2014) 20. Wang, B., Hu, Y., Gao, J., Sun, Y., Yin, B.: Laplacian LRR on Product Grassmann Manifolds for Human Activity Clustering in Multi-Camera Video Surveillance. IEEE Trans. Circuits Syst. Video Technol. 27(3), 554–566 (2017) 21. Wang, B., Hu, Y., Gao, J., Sun, Y., Chen, H., Yin, B.: Locality preserving projections for Grassmann manifold. In: International Joint Conference on Artiﬁcial Intelligence, pp. 2893–2900 (2017) 22. Absil, P., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2008) 23. Wang, B., Hu, Y., Gao, J., Sun, Y., Yin, B.: Localized LRR on Grassmann Manifold: an extrinsic view. IEEE Trans. Circuits Syst. Video Technol. 28(10), 2524–2536 (2018) 24. Cetingul, H., Vidal, R.: Intrinsic mean shift for clustering on Stiefel and Grassmann Manifolds. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1896–1902 (2009) 25. Turaga, P., Veeraraghavan, A., Srivastava, A., Chellappa, R.: Statistical computations on Grassmann and Stiefel manifolds for image and video-based recognition. IEEE Trans. Pattern Anal. Mach. Intell. 33(11), 2273–2286 (2011) 26. Wang, B., Hu, Y., Gao, J., Sun, Y., Yin, B.: Low rank representation on Grassmann manifolds. In: Asian Conference on Computer Vision, pp. 81–96 (2014) 27. Favaro, P., Vidal, R., Ravichandran, A.: A closed form solution to robust subspace estimation and clustering. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1801–1807 (2011) 28. Lin, Z., Liu, R., Su, Z.: Linearized alternating direction method with adaptive penalty for low rank representation. In: Advances in Neural Information Processing Systems, pp. 612–620 (2011) 29. Elhamifar, E., Vidal, R.: Sparse manifold clustering and embedding. In: Advances in Neural Information Processing Systems, pp. 55–63 (2011) 30. Patel, V.M., Nguyen, H.V., Vidal, R.: Latent space sparse subspace clustering. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 225–232 (2013) 31. Wang, B., Hu, Y., Gao, J., Sun, Y., Yin, B.: Partial sum minimization of singular values representation on Grassmann manifolds. ACM Trans. Knowl. Discov. Data. 12(1), 13 (2018) 32. Oh, T., Tai, Y., Bazin, J., Kim, H., Kweon, I.: Partial sum minimization of singular values in robust PCA: algorithm and applications. IEEE Trans. Pattern Anal. Mach. Intell. 38(4), 171–184 (2016) 33. Xie, Y., Ho, J., Vemuri, B.: On a nonlinear generalization of sparse coding and dictionary learning. In: International Conference on Machine Learning, pp. 1480–1488 (2013) 34. Kolda, G., Bader, B.: Tensor decomposition and applications. SIAM Rev. 51(3), 455–500 (2009) 35. Nie, F., Li, J., Li, X.: Self-weighted multiview clustering with multiple graphs. In: International Joint Conference on Artiﬁcial Intelligence, pp. 2564–2570 (2017) 36. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011) 37. Liu, G., Yan, S.: Active subspace: toward scalable low-rank learning. Neural Comput. 24(12), 3371–3394 (2012) 38. Nie, F., Cai, G., Li, X.: Multi-view clustering and semi-supervised classiﬁcation with adaptive neighbor. In: AAAI Conference on Artiﬁcial Intelligence, pp. 2408–2414 (2017) 39. He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems, pp. 153–160 (2003) 40. Huang, Z., Wang, R., Shan, S., Chen, X.: Projection metric learning on Grassmann manifold with application to video based face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 140–149 (2015)

Part IV

Supervised Learning Strategies for Big Multimodal Data

Chapter 8

Multi-product Newsvendor Model in Multitask Deep Neural Network with Norm Regularization for Big Data Yanfei Zhang

Abstract As a classical model used in operational research, the newsvendor model had already been researched and assessed statistically by various researchers. Machine learning approaches have gradually been recognized recently in this area. This chapter addresses the multi-product newsvendor model by using multi-task deep neural network (MTL-DNN) with weight ridge regularization to explore the relations among various products. The multi-task learning problem aims to allow learning from multiple tasks that share the same goal. We propose an approach for the multi-product newsvendor model (a big data extension of the single product model) where multiple products are considered in a single training process. Besides, the proximal algorithms are also introduced for better optimization outcomes.

8.1

Introduction

The newsvendor model is a classical model in operational research to predict a product demand d when it clearly states the expected holding losses cp (when the order amount of that product is greater than the actual demand) and shortage losses ch (when order amount of that product is less than the actual demand for that perishable product). The loss functions for the newsvendor problem is as follows: N X ch ðdi f ðxi ÞÞ þ cp ð f ðxi Þ di Þ þ þ 1 i¼1

Minimizing the proposed function, we can get the optimal order amount, which is expressed as f(xi) above [1]. Originally, this model was inferred with statistical methods [1–6]. Recently, machine learning albeit deep learning approaches supplanted statistical methods Y. Zhang (*) The University of Sydney Business School, The University of Sydney, Camperdown, NSW, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_8

183

184

Y. Zhang

with better performances in demand approximations for newsvendor-related problems [7–11], where most of them take the newsvendor model objective as their loss functions, the essential part of machine learning tasks. Readers can refer to [11] for more information. But the constraint of the method mentioned in [11] is obvious: it only considers single-product demand at one time. If we need to deal with multiple products simultaneously, we need to repeat the algorithms multiple times, which is inefﬁcient. Therefore, we reconsider a multi-product newsvendor model instead of the single-product newsvendor model by eyeing on the multi-task DNN [12], which can be regarded as a big data extension in the newsvendor model. Multi-task learning mainly refers to the parameter sharing in model training, which covers quite a wide range of propositions that can be roughly summarized as follows: ﬁrst, learning with tasks shared under the same loss function (same objective) and, second, learning with tasks shared under different loss functions (different objectives). For the ﬁrst kind of learning, researchers consider l1-norm regularization [13, 14], trace-norm regularization [15, 16], and l2,1-norm regularization [14, 17–19] for the regularizations of parameter matrices; while in the second kind, the models would basically be based on sharing parameters initially, then the learning process splits in the middle stage for different loss functions [20–23]. In this chapter, we would like to focus on the ﬁrst type, which is learning the tasks with same objective (demand prediction for each product), where different parameter regularizations on the ﬁnal layer of neural network was applied to reveal the importance of certain products in the meantime, i.e., multi-task deep neural network (MTL-DNN). As most of the gradients of norm regularizations cannot be attained, the proximal method would also be explained in the following sections. This chapter is organized as follows: Sections 8.2 and 8.3 focus on introducing the MTL-DNN and its optimization algorithm and experiment setting. Section 8.4 evaluates the performance of the proposed method on both simulated and real-world datasets. Finally, conclusions and suggestions for future work are provided in Sect. 8.5.

8.2

Methods of Multi-product Newsvendor Model in MTL-DNN

In this section, we discuss the multi-task neural network for multi-product newsvendor model in detail.

8.2.1

Multi-output Neural Network

Multi-task deep neural network is based on multi-output deep neural networks. This section mainly explains the differences between multi-output and single-output neural networks in terms of input data and the model structure.

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

185

Multi-task Data Description In the original newsvendor model, we assume a single product in the dataset with a total of N historical demand observations denoted N by fdi gi¼1 . Each demand observation is with p data features encoding in binary form, collected in a column vector xi ¼ [xi1, . . . , xip]T 2 ℝ p. In the multi-task newsvendor model, we consider k products in a single training process. We assume a total number of N observations. Each demand observation di is still with a feature set xi ¼ [xi1, . . . , xip]T 2 ℝ p. However, di has now been replaced with a column vector that contains all the demand observations for each product that corresponds to certain data features. For the sake of convenience, we denote di T ¼ d1i ; . . . ; dik 2 ℝk as the i-th demand observation for each products over the data feature xi. The new dataset is as follows: N D ¼ fðxi ; di Þgi¼1 ,

where di 2 ℝk

The full demand matrix for multi-task learning is as follows: 2

d11 6 d2 1 D¼6 4⋮ d1k

d12 d22 ⋮ d2k

3 d1N d2N 7 7 2 ℝkN ⋱ ⋮5 dNk

Multi-output Deep Neural Network Structure The deep neural network can be designed to generate multiple output neurons. Multi-task learning requires this multi-output structure to generate output for different products, but the differences of multi-task learning and general multi-output neural networks is that the outputs are justiﬁed with different regularizations for better understanding of their relationships. The structure can be written out as in Fig. 8.1, and each neuron in ﬁrst shared hidden layer and second shared hidden layer still preserves a sigmoid activation function. We deﬁne the number of neurons in the ﬁrst hidden layer as n1 and second hidden layer as n2 (Fig. 8.1). The output of this neural network over all the inputs is a matrix denoted by ,N f ðX; qÞ ¼ ff m ðxi ; qÞgkm¼1 ,i¼1 , where q is the overall network parameter. Particularly, we use f(xi, q) to denote the k outputs of the network for the i-th input xi. The f(X, q) lies in the Euclidean space of ℝkN, meaning that each row stands for the prediction of one product across N observations. The graphical description of this output matrix is as follows:

186

Y. Zhang

x1

Input Layer

1st Hidden Layer

2nd Hidden Layer

Output Layer

x2

.. .

x3

xp

.. .

.. .

1

f (x, q)

.. .

k

f (x, q)

x0 Fig. 8.1 Multi-task learning in deep neural network, where x0 ¼ 1 is the bias

2

f 1 ðx1 ; qÞ 6 f 2 ðx1 ; qÞ 6 4 ⋮ f k ðx1 ; qÞ

f 1 ð x 2 ; qÞ f 2 ð x 2 ; qÞ ⋮ f k ðx2 ; qÞ

3 f 1 ð x N ; qÞ f 2 ð x N ; qÞ 7 7 2 ℝkN 5 ⋱ ⋮ k f ðxN ; qÞ

For notation convenience, we also denote f im ðX; qÞ ¼ f m ðxi ; qÞ, i.e., the m-th output of the network for the i-th input xi. The corresponding loss function is as follows: Lðf ðX; qÞ; DÞ ¼ n ok where cp ¼ cpm

m¼1

XN ch ðdi f ðxi ; qÞÞ þ cp ð f ðxi ; qÞ di Þ þ þ 1 i¼1

ð8:1Þ

k and ch ¼ chm m¼1 are also column vectors in Rk with respect

to each product, which are different pairs of cpm (shortage cost for the m-th product) and chm (overhead cost for the m-th product) on different products. Nevertheless, multi-output DNN cannot be equated as MTL-DNN if a special regularization over the ﬁnal layer of parameters q3 would not be added.

8.2.2

Special Loss Function for Missing Observations

Generally, the observed matrix that contains full information is as follows: 2

d11 6 d2 1 D¼6 4⋮ d1k

d12 d22 ⋮ d2k

3 d1N d2N 7 7 2 ℝkN ⋱ ⋮5 dNk

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

187

However, there might be some missing values in the observed matrix D, i.e., missing observations of a product sales under certain conditions, e.g.: 2

NA d12 6 d2 NA 6 1 4 ⋮ ⋮ d1k NA

3 NA d2N 7 7 ⋱ ⋮ 5 dNk

where “NA” stands for loss information missing from the observed matrix. Obviously the original loss function can no longer be applied to this special case, as those missing values would generate wrong outputs. And if we just delete those entries, the matrix dimensions would not agree with each other, thus the BP (Backpropagation) Algorithm cannot be processed. Then the question comes up: Can we modify the loss function that adapts to the loss information existing in the observed demand matrix? Recall that the demand information is clearly labeled. Therefore, we can distinguish those missing values that would inﬂuence the loss calculation. And thus, we can create an indexing matrix Ξ which extracts the previous labeled information for the selection of output matrix, which speciﬁes the labeled output. Under the previous example, the example indexing matrix is as follows: 2

0 6 1 Ξ¼4 ⋮ 1

1 0 ⋮ 0

⋱

3 0 1 7 2 ℝkN ⋮5 1

Those entries with “NA” are replaced with 0, while the valid entries are replaced by 1, meaning that the missing entries would not be calculated by the loss function. Thus, the corresponding multi-output newsvendor loss function (8.1), under the matrix operations, can be expressed as follows: Lð f ðX; qÞ; DÞ ¼ Ξ ch ðD f ðX; qÞÞþ þ cp ð f ðX; qÞ DÞþ 1

ð8:2Þ

where the k∙k1 stands for matrix l1-norm, which sums all the absolute values of the elements in the matrix, and represents matching element-wise product between matrix and vector, or between matrices. The effect of indexing matrix on the ﬁnal output can be showed by a simple graph (Fig. 8.2), where C f im ðX; qÞ ¼ ch ðdi f ðxi ; qÞÞþ þ cp ð f ðxi ; qÞ di Þþ The corresponding gradient for this loss function can be calculated as follows:

188

Y. Zhang

Fig. 8.2 After-indexing output

∂Lð f ðX;qÞ; DÞ ¼ Ξ ch IðD f ðX; qÞ < 0Þ cp IðD f ðX; qÞ < 0Þ : ð8:3Þ ∂ f ðX;qÞ

8.2.3

Weight Constraints by Special Norm Regularization

In general, the l2-norm regularization is applied on all the weight matrices to prevent overﬁtting. The optimization problem can be decomposed as three independent regularizations on three different weight matrices in our settings: 1 2 1 2 2

min Lð f ðX; qÞ; DÞ þ γ q3 2 þ λ q2 2 þ q1 2 q 2 2

ð8:4Þ

Referring to all the previous works in multi-task learning [13, 17, 19, 24, 25], we can see that the relationship among the outputs should be considered priority, and this is the essential point for multi-task learning. Therefore, we only apply special regularization on the weights that directly connect the ﬁnal layer, which is q3. Norm Regularization for Sparse Representation Sparse restrictions come from the assumption that only a small set of critical features have ﬁnal determinations on the output, indicating that the weight matrix should contain many 0s to shadow those unused input features. In terms of math representation, this is called sparsity. 1. l1-norm According to [14], the ideal sparsity regularizer is l0-norm, but it will result in an NP-hard optimization problem. However, in practice, the l1-norm is usually used as a surrogate for relaxation, as it is the convex envelope of l0-norm [26]. The scalarized representation of l1-norm is as follows:

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

X Ω q3 ¼ q3 1 ¼ q3 ij ij

189

ð8:5Þ

The ﬁnal optimization problem is as follows: 1 2 2

min Lð f ðX; qÞ; DÞ þ γ q3 1 þ λ q2 2 þ q1 2 q 2

ð8:6Þ

2. l2/1-norm Another available regularization term was proposed by [17], which is the l2,1-norm. And we improve it by applying a similar l2/1-norm. This norm sums up all the squared row elements, applies a square root on them, then sums all the generalized l2-norm elements as the ﬁnal result. As introduced in Sect. 8.1, this norm regularization can be viewed as “l1-norm on row elements,” which distinguishes the importance of certain row vectors in the weight matrix by assigning appropriate non-zeros values and penalizes the remaining row vectors as 0, which differentiates the important and unimportant tasks. The different levels of γ stand for different levels of regularization, higher γ tends to have higher regularization power. Here, we apply l2-norm as an example. The scalarized representation of this norm is not quite special: Xk Ω q3 ¼ q3 2=1 ¼ m¼1

sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ﬃ X ﬃ X Xn2 2 2 n2 k 3 3 qij j¼1 qij ¼ m¼1 j¼1

ð8:7Þ

As the l2-norm is naturally positive, we can eliminate the absolute value operator. The ﬁnal optimization problem would be displayed as follows: 1 2 2

min Lð f ðX; qÞ; DÞ þ γ q3 2=1 þ λ q2 2 þ q1 2 q 2

ð8:8Þ

Norm Regularization for Low-Rank Representation Generally speaking, low-rank is based on the fact that the observations within a task group should have similar attributions, so that their corresponding rows would be more likely to share similar structure, i.e., their corresponding rows in the weight matrix should have linear dependence, resulting in a low-rank parameter matrix. As mentioned in Sect. 8.1, as row rank should be equal to column rank, this regularization is both applicable on row and column grouping, i.e., task grouping and important feature selection. Chen et al. [14] linearly combined both low-rank and sparse representations on the ﬁnal matrix. For simplicity, only trace norm would be discussed here:

190

Y. Zhang

1 2 2

min Lð f ðX; qÞ; DÞ þ γ q3 þ λ q2 2 þ q1 2 q 2

ð8:9Þ

where |||| stands for trace norm, deﬁned as the sum of all the singular values of a matrix. The || q3|| normally is used as a substitution of rank(q3), coming from the fact that ||q3|| rank(q3) given that 8 q3 2 {q3 | kq3k2 1}.

8.3

The Optimization Method

The optimization problem in the previous section needs to be tackled with certain optimization algorithms. In this section, two main optimization algorithms will be introduced: the traditional gradient-based algorithm for deep neural network and the proximal algorithm. Gradient-Based Algorithm The general gradient optimization methods for deep neural network require gradients and Hessians (second-order gradients) of the whole cost function C( f(X, q), D) ¼ L( f(X, q), D) + Ω(q). Thus in this section, we provide the speciﬁc gradient or sub-gradient (for some non-differentiable regularization terms) of the previously mentioned regularizers and a special manipulation for the sub-gradient as an approximation to the gradient. 1. Sub-gradient of l1-norm According to Zhang and Gao [11], the non-differentiable nature of this regularizer at 0 should be noted. Here, we use the sub-gradient as deﬁned below: 8 < 1, > ∂Ω q3ij ¼ ½1; 1, > : 1,

q3ij > 0 q3ij ¼ 0 q3ij < 0

ð8:10Þ

The most general method for approximating the gradient from the sub-gradient is to randomly choose one value from [1, 1], e.g., 0.1 or 0.5, or we can just specify a value within this interval for stability, e.g., 0. When integrating the gradient into the BP Algorithm, if q3ij ¼ 0, an error term that follows a normal distribution N ð0; 1Þ will be added up as a small manipulation that ensures regular accessibility of the gradient, which is q3ij þ E. 2. Sub-gradient for l2/1-norm This norm is differentiable at most points, but it also cannot be differentiated when the sum of the squared rowpelements is 0, in which case all the elements in one ﬃﬃﬃ row are 0. The function gðxÞ ¼ x is non-differentiable at x ¼ 0.

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

And taking x ¼

191

Pn 2 3 2 makes g(x) the exact formula for the l2/1-norm for j¼1 qij

row vectors, indicating that the sum of the squared row elements cannot be 0. The sub-gradient of the inner l2-norm at 0 is given by the following equation, which means that for any vector g that satisfy the following conditions can be used to replace the gradient [27]: ∂q3 2=1 ¼ g j kgk2 1

ð8:11Þ

However, when reasoning the gradient for l2/1-norm, as all elements in that vector P 2 3 2 qij ¼ 0, we can just say that the sub-gradient of that are deﬁned to be 0 as nj¼1 point is 0 when all q3ij ¼ 0, which is coincidently the same when we assign q3ij for the gradient function. Thus the sub-gradient for l2/1-norm is as follows: q3 ﬃ ∂Ω q3 i ¼ rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ Pn 2 3 2 j¼1 qij

ð8:12Þ

3. Sub-gradient for the Trace Norm The special sub-gradient requires Singular-Value Decomposition (SVD) on q3 [28]: ∂Ω q3 ¼ UVT þ W :

W 2 Rkn2 ; UT W ¼ 0; WV ¼ 0; kWk2 1

ð8:13Þ where the SVD of q3 is UΣVT and Σ is the diagonal matrix of all non-zero singular values of q3. This subdifferential has the following meaning: the sub-gradient of trace norm is UVT + W, where W needs to satisfy some constraints in Eq. (8.13). In the BP Algorithm, we will use one particular sub-gradient as the derivative needed. The sub-gradient is generated by UVT, i.e., we have taken W ¼ 0 which satisﬁes all the conditions proposed above [29]. Proximal Algorithm As previously illustrated, if the function cannot be differentiated at some point, or the function is a non-differentiable function, we can specify a certain formula that can be included in the sub-gradient set for the approximated gradient algorithms. But in general, this kind of approximation to the gradient should be avoided because it is not as accurate as one wishes. Thus, we need to identify another compatible minimization method.

192

Y. Zhang

First, we need to clarify that the only parameter matrix that requires special regularization is q3, then at the proximal operator step, the optimization problem can be simpliﬁed as follows: min Lð f ðX; qÞ; DÞ þ γΩ q3 q

ð8:14Þ

This representation means that we can treat the neural network layer by layer and deal it with the normal proximal algorithms. As the Deep Neural Network and the original loss function L( f(X, q), D) is a non-smooth structure [30], the loss function itself should be relaxed or replaced by other appropriate linearized proximal methods. Proximal Methods Require Kurdyka–Lojasiewicz Conditions Given that q 2 dom (∂f(q)), η 2 (0, +1], if there exists a neighborhood U, where a function φ(q) : [0, η] ! R+ where φ satisﬁes the following conditions: – – – –

φ(0) ¼ 0 φ(q) is continuous and differentiable in (0, η) 8s 2 (0, η), φ(s) > 0 8q 2 U \ [ f(q) < f(q) < f(q) + η], φ0(s)( f(q) f(q))dist(0, ∂f(q)) 1, then the function f(q) satisﬁes Kurdyka–Lojasiewicz Conditions.

Any norm and the gradient of our cost function can satisfy this special condition. The Lipschitz continuous gradient on the cost function was developed based on this fact: ∇L f X; q3 ; D ∇L f X; q b3 ; D 2 Lip q3 2 q b3 2

ð8:15Þ

where Lip is the Lipschitz constant of the loss function, whose formula is given as l1/l1-norm of the ﬁrst-order gradient of the loss function with respect to q3, which ﬁnds the largest absolute value of all the row vectors in the gradient: ∂Lð f ðX; q3 Þ; DÞ Lip ¼ ∂q3 1

ð8:16Þ

∂L where the detailed gradient ∂q 3 has already been given in [11]. From this special condition, we can further obtain the following lemma:

b3k ; D þ ∇L f X; q b3k ; D q3 q b3k L f X; q3 ; D L f X; q 2 Lip q3 q b3k 2 þ ð8:17Þ 2 b3k ; D þ ∇L f X; q b3k ; D q3 q b3k ; D ¼ L f X; q b3k is the ﬁrstwhere eL f X; q b3k . order Taylor expansion of the loss function on point q

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

193

Proximal methods [31] were applied in solving the non-convex or non-smooth problem. The simple explanation for the proximal algorithm in this context is as follows: b3k ; 1. Initiating a starting point q 2. Using Taylor expansion to ﬁnd the exact linearized approximation in this starting b3k ; D þ ∇L f X; q b3k ; D q3 q b3k , b3k ; D ¼ L f X; q point: eL f X; q 3. Solving the next step by: 3 2 3 b3kþ1 ¼ argminq3 Lip b3 e b3 q 2 q q k 2 + L f X; q k ; D þ γΩðq Þ b3T 4. Repeating step 3 for T times for the ﬁnal q γ The proxLip ðÞ is the proximal operator [31], which is constructed by the following form

1 b3kþ1 ¼prox γ q b3k ; D b3k q ∇L f X; q Lip Lip

2 3 1 1 3 3 3 bk ; D bk ∇L f X; q ¼argminq3 q q þ γΩ q : 2 Lip 2 1 b3k Lip ~ k for convenience sake. b3k ; D as q Then we deﬁne q ∇L f X; q There are two norms that need to be treated specially when we refer to the previous content, which are: l1-norm (kq3k1) and trace norm (kq3k). The solutions for both proximal operators are as follows: (i) kq3k1 [32]:

b3kþ1 q

8 γ > sign e qk ; qk

0; :

γ Lip γ e qk Lip

e qk >

ð8:18Þ

(ii) kq3k [33]:

γ b3kþ1 ¼ U 0 Σ0 I n V 0T q Lip

ð8:19Þ

~ k , Σ0 is the diagonal where the Singular Value Thresholding (SVT) is applied on q γ ~ kthat are larger than Lip matrix consisting of all the singular values of q , and U0 and V0 are singular vectors corresponding to Σ0. The proximal operators of some norms have certain closed form solutions, i.e., unique solution. Thus this algorithm is more reliable than the gradient optimization methods, which randomly determines the possible solution under the sub-gradients as the approximation.

194

Y. Zhang

When we integrate this simple algorithm into the neural network, the proximal algorithm on the optimization of ﬁnal layer parameters should be distinguished from other two layers of parameters. This process would be expressed by a simple pseudo code:

Algorithm 1 Neuron Network Optimization with Proximal Operator Algorithm 1: function NeuronPA (

,

,

)

,

2:

Get total number of elements in q1and q2 as n1 and n2

3:

Get gradients with n1 and n2 as grad1 and grad2

4:

−

grad . Gradient Descent for the first layer

5:

−

grad

6:

Gradient Descent for the second layer

First update new gradients by utilizing new weights

7:

=

8:

for j

maxiterin do

argmin 9:

−

+

to calculate

10:

end for

11:

Second update new gradients by utilizing new weights

12: end function

8.4

Numerical Experiment for Multi-task Newsvendor

In this section, we test the ability of revealing task relatedness for those regularization terms based on the deep neural network. There are two assessments for each task: testing error comparison and colormap of the outcome matrix.

8.4.1

Simulation Study for Information Loss in Demand

Previously, we introduce the indexing method for information loss in demand. In this section, we will generate a demand matrix that has products correlating to each other, and then randomly drop out different observations in each task in the training set to

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

195

Table 8.1 Testing error m TestErr

0 0.08692

500 0.08689

1000 0.08696

2000 0.08709

4000 0.08728

8000 0.08797

view the differences between testing errors. For convenience sake, we replace the special regularization term of q3 with the original l2-norm regularization term. The testing error modiﬁed for multi-demand version is as follows: TestErr ¼

2 1 Xntest ^f xtest ; q b dtest i i 2 i¼1

ntest

ð8:20Þ

We simulate a six-product case with six demands. The simulated demand matrix b3k in the neural network. As the is generated by applying a speciﬁed weight matrix q relationship between tasks can be determined by the last weight matrix q3, we b3k out of a multi-variate normal distribution with simulated the weight matrix q μ ¼ 1 2 ℝ6 and the covariance matrix Σ 2 ℝ6 6 whose diagonal values (variances) σ 2 ¼ [1, 1.21, 1.44, 1.69, 1.96, 2.25]. We set the matrix structure as follows: the number of units in hidden layer 1 as n1 ¼ 10 and the number of units in hidden layer 2 as n2 ¼ 100, thus the simulated weight q3 2 ℝ6100. The training data set contains input data feature that spans in ℝ43 9877, and the testing data set is with data feature spanning in ℝ433293. We used these sets of data features as the input and then generated the corresponding training set spanning in ℝ69877 and testing set spanning in ℝ63293. We then trained the model where the amounts of lost demand observations in each task are as follows: m ¼ 0, 500, 1000, 2000, 4000, and 8000. We generated the missing demand observations by randomly selecting the demand observations from a uniform distribution, then labeled them by 1, which indicates abandoned observations. The simulation testing error (TestErr) for different values abandoned is as follows (Table 8.1): The testing errors across all the training sets are almost the same, which justiﬁes that the weight of the weight matrices q1, q2, and q3 after abandoning certain observations might statistically be the same. For further justiﬁcation, we trained the weight with different levels of the loss information, and compared the difference between two kinds of training errors under the same weight: training errors with loss information indexing matrix Ξ (TrainErr1), and without loss information (TrainErr2), whose formulas are as follows: TestErr1 ¼

1 kΞ ð f ðX; qÞ DÞk22 ntrain

196

Y. Zhang

Table 8.2 Training error comparison m TrainErr1 TrainErr2

0 0.08679 0.08679

500 0.08676 0.08689

TestErr2 ¼

1000 0.08682 0.08696

1 ntrain

2000 0.08695 0.08708

k f ðX; qÞ Dk22

4000 0.08715 0.08728

8000 0.08785 0.08797

ð8:21Þ

The outcomes are as follows (Table 8.2): This comparison shows that the training loss with speciﬁed matrix for loss information (TrainErr1) is generally constant over different loss levels, and does not signiﬁcantly deviate from TrainErr2, which strongly supports the fact that even with great amount of loss information, the trained weights still possesses robustness and consistency compared to the weight trained by full information, thus the modiﬁed loss function is compatible with the original loss function.

8.4.2

Simulation Study in Norm Regularization

In these experiments, only training process was considered. In this section, we consider generating the weight matrix q3 that contains two different correlated weight groups implying two task groups. The ﬁrst weight matrix was simulated out of a multi-variate normal distribution with μ ¼ 0.5 1 2 ℝ5 and the covariance matrix Σ 2 ℝ55 whose diagonal values (variances) is σ 2 ¼ [1, 4, 9, 16, 25], indicating there are at least a total of ﬁve tasks that relates with each other. Thus, the simulated weight matrix is under the size of 5100. Then, another weight matrix with the size of 5 100 was also created under another set of multi-variate normal distribution where μ ¼ 0.5 1 2 ℝ5 and variances [1, 1.21, 1.44, 1.69, 1.96]. Then, we inserted 15 rows of simulated weights that are randomly drawn from the uniform distribution with lower bound a ¼ 0 and upper bound b ¼ 1 between the two simulated weights introduced above. The covariance calculation shows that there was no relationship between these 15 rows. This creates a weight matrix q3 with the size of 25100, indicating 25 products, and 100 neurons as per second layer (as shown in Fig. 8.3). Then, we put this simulated b q 3 into the neural network structure and replaced the ﬁnal layer of weight. We utilized the set of data feature mentioned above as the input for the neural network, and then created a set of simulated demand with the size 259877. This demand also shows a similar structure as the weight matrix. We then tested the performance of l2/1-norm and trace norm by applying both norms on the regularization of b q 3 and observed their patterns by plotting out their

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

197

Fig. 8.3 Simulated ﬁnal weight matrix

Fig. 8.4 Final weight matrix after l2/1-norm regularization

corresponding colormaps, then compared the colormaps with the true simulated weight matrix to ﬁnd out whether the pattern was learned or inferred. After comparison with the trace norm, the result can be applied as further reference for the robustness of l1-norm. The reason is explained later in the chapter. Robustness of l2/1-Norm and Trace Norm After simulating the data that determines the robustness of l1, we then simulate weights with certain relationships between tasks, which enable us to test the ability of revealing the relationship for l2/1-norm and trace norm. l2/1-Norm-regularized Weight Matrix After the training process, the colormap for l2/1-norm-regularized weight matrix is as shown in Fig. 8.4. Figure 8.4 shows the pattern of separation between the corresponding weights for the two simulated related task groups (rows 1–5 and 20–25), which is consistent with the previous simulated weight matrix. Following the color bar on the side, most of the unrelated task coefﬁcients are close to 0 (aqua blue), which is consistent with the characteristics of l2/1-norm: the unrelated task parameters were pulled to 0.

198

Y. Zhang

Fig. 8.5 Final weight matrix after trace norm regularization

However, the boundary between the ﬁrst related task group and the rest of the task is blurred, meaning that the ability to distinguish unrelated task and related task is not strong enough. Trace-Norm-Regularized Weight Matrix After the training process, the colormap for trace-norm-regularized weight matrix is as shown in Fig. 8.8. Recalling that the low-rank regularization is on both row (task) and column (neuron), Fig. 8.5 presents clear differences between the weights for the two correlated task groups (rows 1–5 and 16–20) and the unrelated task group (rows 6–15), and it also shows that in this speciﬁcally trained neural network, there are several neurons (columns) that have values that close to 0 across all the tasks, meaning that they make little contribution to the ﬁnal output, e.g., 52nd to 58th neurons. Robustness of l1-Norm The expected outcome for l1-norm-regularized weight matrix should be all but a few important columns regularized to 0. Remember that for a matrix, the number of dependent columns and the number of dependent rows should be the same. Thus, if the trace-norm-regularized weight matrix is consistent with the true pattern in rows, it should also reﬂect the true pattern in columns because of the uniqueness of rank. Therefore, in this part we need to compare the colormap of l1-norm-regularized weight matrix with the trace-norm-regularized weight matrix under the same simulated demand matrix. After the training process, the resulting colormap for l1-norm-regularized weight matrix is shown in Fig. 8.6. From Fig. 8.6, we can ﬁnd the clear correspondence between the trace-norm-regularized weight matrix and the l1-norm-regulairzed weights matrix, where the small blocks in colors that are different from aqua blue (values that are signiﬁcantly different from 0) in the l1-norm-regulairzed weights matrix are at the same location as what the trace-norm-regularized weight matrix indicates. And we can also see a blurred sign in the last ﬁve rows and the ﬁrst ﬁve rows, which roughly indicates task grouping.

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

199

Fig. 8.6 Final weight matrix after l1-norm regularization

8.4.3

Empirical Study

This empirical study focuses on applying the previously introduced Foodmart data to test the speciﬁc multi-task problem. As previously illustrated, Foodmart data contains total amount of 9877 and 3293 observations in the training and testing set respectively, each with 43 data features encoded in binary form and one demand observation. For multi-task learning, we further explored the original dataset with MySQL and sorted 10 newsvendor products out of the original dataset. The relationship among those products is unknown before the experiment. For convenience sake, the names of those products are replaced by product 1 to product 10. Then, we applied three different norm regularization terms, respectively, on this dataset to explore whether there is some relationship between those products, and which one has the least testing error out of the three norm regularization terms. We applied a special Ξ containing 10 lines of binary vectors to select the correspondence between the output and the true data, which is shown as follows:

200

Y. Zhang

Fig. 8.7 Final weight matrix after l1-norm regularization for empirical dataset

In addition to what is illustrated above, if the amount of data is sufﬁciently large, the trained structure is robust at different levels of the loss information in the matrix, and when the loss information was added back to calculate the training error and testing error, the result was consistent with the training error and testing error, which shows that the observations that correspond to the loss information in the ﬁnal prediction f(xi, q) can reﬂect true information in the loss function. When studying this special, new loss function, we can see that it is quite similar to the previous experiment on the synthetic dataset, where the after-treated demand matrix had so many 0s, which can be viewed as a loss function. Thus, the predicted values on those areas can also be used as the prediction for the new product. Experimental Results After we prepared the appropriate dataset and loss function, we conducted three empirical experiments on the l1-norm-, l2/1-norm-, and tracenorm-regularization terms. Outcome for l1-Norm As previously illustrated, l1-norm can distinguish the important data feature applied in the matrix. And in this part, it distinguishes the important neurons, or, in other words, decoded data-features. This colormap is shown in Fig. 8.7. In the ﬁgure, we can clearly distinguish some important neurons, e.g., neuron 12, 50, 71, and 94. And we can also ﬁnd out that there are several signiﬁcant results that cluster in rows 1, 2, 4, 6, and 7, meaning that those corresponding products might be signiﬁcant among all the products, which would be further checked by l2/1norm and trace norm regularizations. Outcome for l2/1-Norm The outcome of l2/1-norm can be used to distinguish auxiliary and main tasks. In the experiment for l1-norm, we found out that the sixth product has the most signiﬁcant result among all the tasks, which would be checked in this experiment. The corresponding colormap is shown in Fig. 8.8.

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

201

Fig. 8.8 Final weight matrix after l2/1-norm regularization for empirical dataset

We can observe signiﬁcant results among rows 1, 2, 4, 6, and 7 in Fig. 8.8, and there are some signiﬁcant features in rows 9 and 10, but it is not sufﬁcient enough to distinguish this row from other rows that mainly consist of 0s. This one shows that there might possibly be some relationship between those tasks. Outcome for Trace Norm Trace norm is the approximation of rank for one matrix, and rank indicates the linearly independent columns and rows. However, the general dataset might not possess the strong grouping tendency, thus if we combine column and row criterion of grouping, we can say that if the same signiﬁcant neuron(s) with same sign (positive/negative) can be observed in two or more tasks, we can say that these products can be clustered in the same group. The colormap is shown in Fig. 8.9. Figure 8.9 shows clear division between the row vectors and column vectors, which is an improvement to Fig. 8.8, where the boundary between task and features are not as clear as this picture. Figure 8.9 further distinguishes the important tasks from unimportant tasks, where the important tasks have greater values in parameters, which can be reﬂected by the more colorful blocks. This ﬁnding is consistent with what is observed in Fig. 8.8, where products 1, 2, 4, 6, and 7 showed signiﬁcant results among all the ten tasks, while tasks 9 and 10 have some signiﬁcant values, but not as signiﬁcant as ﬁve other important tasks.

8.4.4

Testing Error Without Insigniﬁcant Tasks

In Sect. 8.1, we introduced a term “Inductive Bias,” [12] which referred to as the phenomenon that integrating unimportant tasks with important tasks would be

202

Y. Zhang

Fig. 8.9 Final weight matrix after trace norm regularization for empirical dataset

Table 8.3 Testing error with and without unimportant tasks for three norm regularization terms l2 l1 l2/l1 Trace

With unimportant tasks 7152.084 3754.015 4052.208 4160.093

Without unimportant tasks 6515.074 3560.055 4017.346 4354.889

Increase/decrease Decrease Decrease Decrease Increase

helpful to the predictions of the main tasks, as those unimportant tasks would bring some other hidden information that would help the task training. However, if the unimportant tasks are highly unrelated with important tasks, it would make no difference between utilizing those tasks and not utilizing them, or sometimes even make the predictability worse, which is called “Negative Transfer.” [34] As in the previous numerical experiments, whether the unimportant tasks are related to the important tasks is still unknown. We now design an experiment to ﬁnd the answer to the question in this dataset. In the experiments above, we have concluded that products 1, 2, 4, 6, and 7 can be viewed as important products, while the rest of products are not important. Therefore, we conduct two experiments: three different norm regularization terms used in training with unimportant tasks and training without unimportant products. We will use TrainErr2 mentioned above as the criterion of the out-of-sample predictability. Besides, we also introduce the original l2-norm regularization for the last layer as a reference. The outcomes are shown in Table 8.3. This table shows an interesting fact: for different norm regularization terms, the answer on whether the task relatedness inﬂuences predictability can be different. When we mention the trace norm, we can ﬁnd out that this norm regularization is highly inﬂuenced by task relatedness, therefore deleting those unimportant tasks can cause negative bias to the ﬁnal

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

203

answer; while for the l1-norm and l2/1-norm regularizations, deleting those unimportant tasks would bring positive effect on predictability, but the extent might be different, as in the l2/1-norm regularizations, the difference between the testing error is not as large as the other two norm regularizations. Another noticeable thing is: compared to l2-norm, the other three norm regularizations are superior in terms of predictability, which further supports that l2-norm regularization is indeed not a multi-task regularization term.

8.5

Conclusion

This chapter presents several improvements on the current machine learning methods for the multi-product newsvendor model. First, we utilized the multi-output neural network for multi-product newsvendor model. Second, we introduced an indexing matrix for loss information in demand. Third, we proposed and investigated three kinds of norm regularization terms under multi-task learning structure and developed two optimization methods (proximal gradient and proximal operator) for Deep Neural Network. To our knowledge, the study on norm regularization in neural network and the related optimization algorithms provides the ﬁrst of its kind. Although proximal gradient and proximal operator algorithms had been fully studied under the linear-regression-based models (SVM, group LASSO, etc.), we expanded these algorithms to Deep Neural Network and provided numerical experiment (convergence analysis) for supporting the algorithms’ robustness. Our work can be extended in various aspects. For example, we might apply other proximal algorithms on the neural network, e.g., Accelerated Gradient Algorithm, and we might also study time-series neural network structures on multi-task learning, e.g., multi-task learning LSTM, etc.

References 1. Gallego, G., Moon, I.: The distribution free newsboy problem: review and extensions. J. Oper. Res. Soc. 44(8), 825–834 (1993) 2. Lau, A.H.L., Lau, H.S.: The newsboy problem with price-dependent demand distribution. IIE Trans. 20, 168–175 (1988) 3. Ingene, C.A., Parry, M.E.: Coordination and manufacturer proﬁt maximization: the multiple retailer channel. J. Retail. 71(2), 129–151 (1995) 4. Weng, Z.K.: Pricing and ordering strategies in manufacturing and distribution alliances. IIE Trans. 29(8), 681–692 (1997) 5. Shukla, M., Jharkharia, S.: ARIMA model to forecast demand in fresh supply chains. Int. J. Oper. Res. 11(1), 1–18 (2011) 6. Alwan, L.C.: The dynamic newsvendor model with correlated demand. Decis. Sci. 47(1), 11–30 (2016) 7. Carbonneau, R., Laframboise, K., Vahidov, R.: Application of machine learning techniques for supply chain demand forecasting. Eur. J. Oper. Res. 184, 1140–1154 (2008)

204

Y. Zhang

8. Rudin, C., Ban, G.Y.: The Big Data Newsvendor: Practical Insights from Machine Learning Analysis. MIT Sloan School of Management Working Paper. MIT (2013). http://hdl.handle.net/ 1721.1/81412 9. Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.C.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, pp. 802–810. NIPS’15. MIT Press, Montreal (2015). http://dl.acm.org/citation.cfm?id¼2969239.2969329 10. Oroojlooyjadid, A., Snyder, L.V., Takác, M.: Applying deep learning to the newsvendor problem. CoRR abs/1607.02177 (2016). http://arxiv.org/abs/1607.02177 11. Zhang, Y., Gao, J.: Assessing the performance of deep learning algorithms for newsvendor problem. CoRR abs/1706.02899 (2017). http://arxiv.org/abs/1706.02899 12. Caruana, R.: Multitask learning. Cmu Ph.D. thesis, CMU (1997). http://reports-archive.adm.cs. cmu.edu/anon/1997/CMUCS-97-203.pdf 13. Argyriou, A., Evgeniou, T., Pontil, M.: Multi-task feature learning. Adv. Neural Inf. Process. Syst. 19, 41–48 (2007) 14. Chen, J., Liu, J., Ye, J.: Learning incoherent sparse and low-rank patterns from multiple tasks. ACM Trans. Knowl. Discov. Data (TKDD). 5(4), 22 (2012) 15. Chen, J., Zhou, J., Ye, J.: Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 42–50 (2011) 16. Yang, Y., Hospedales, T.M.: Trace norm regularised deep multi-task learning. CoRR abs/1606.04038 (2017). http://arxiv.org/abs/1606.04038 17. Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008) 18. Zhang, C.H., Huang, J.: The sparsity and bias of the lasso selection in high-dimensional linear regression. Ann. Stat. 36, 1567–1594 (2008) 19. Liu, J., Ji, S., Ye, J.: Multi-task feature learning via efﬁcient l2, 1-norm minimization. In: UAI ’09 Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence vol. 1(1), pp. 339–348 (2009) 20. Liao, Y., Banerjee, A., Yan, C.: A distribution-free newsvendor model with balking and lost sales penalty. Int. J. Prod. Econ. 133(1), 224–227 (2011) 21. Hashimoto, K., Xiong, C., Tsuruoka, Y., Socher, R.: A joint many-task model: growing a neural network for multiple nlp tasks. CoRR abs/1611.01587 (2016). http://arxiv.org/abs/1611.01587 22. Misra, I., Shrivastava, A., Gupta, A., Hebert, M.: Cross-stitch networks for multi-task learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3994–4003 (2016) 23. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. arXiv preprint arXiv:1705.07115 (2017) 24. Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6, 1817–1853 (2005) 25. Arcelus, F.J., Kumar, S., Srinivasan, G.: Retailer’s response to alternate manufacturer’s incentives under a single-period, price-dependent, stochastic-demand framework. Decis. Sci. 36(4), 599–626 (2005) 26. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press, Cambridge (2004) 27. Nie, F., Huang, H., Cai, X., Ding, C.H.: Efﬁcient and robust feature selection via joint l2,1norms minimization. In: Advances in neural information processing systems, pp. 1813–1821 (2010) 28. Watson, G.A.: Characterization of the subdifferential of some matrix norms. Linear Algebra Appl. 170, 33–45 (1992) 29. Jaggi, M., Sulovsk, M.: A simple algorithm for nuclear norm regularized problems. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 471–478 (2010)

8 Multi-product Newsvendor Model in Multi-task Deep Neural Network with. . .

205

30. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 31. Parikh, N., Boyd, S., et al.: Proximal algorithms. Found. Trends R Optim. 1(3), 127–239 (2014) 32. Jenatton, R., Mairal, J., Bach, F.R., Obozinski, G.R.: Proximal methods for sparse hierarchical dictionary learning. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 487–494 (2010) 33. Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010) 34. Ruder, S.: An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098 (2017). https://arxiv.org/pdf/1706.05098.pdf

Chapter 9

Recurrent Neural Networks for Multimodal Time Series Big Data Analytics Mingyuan Bai and Boyan Zhang

Abstract This chapter considers the challenges when using Recurrent Neural Networks (RNNs) for Big multimodal time series, forecasting where both the spatial and temporal information has to be used for accurate forecasting. Although RNN and its variations such as Long Short-Term Memory (LSTM), Gated Recurrent Unit (GRU), and Simple Recurrent Unit (SRU) progressively improve the quality of training outcomes by implementing the network structures and optimisation techniques, one major limitation in such models is that most of them vectorise the input data and thus destroy a continuous spatial representation. We propose an approach termed the Tensorial Recurrent Neural Network (TRNNs) which addresses the problem on how to analyse the multimodal data as well as considers the relationship along with the time series, and shows that the TRNN outperforms other RNN models for image captioning applications.

9.1

Introduction

In recent years, there has been an increasing demand of multi-dimensional data analysis, i.e., “big data”. The multi-dimensional data are also referred to as multimodal data. What especially appeals to the academia and the industry are the data in no less than two dimensions with both temporal and spatial information, arising from the growing computing speed of machines. Therefore, the consequent problems are to design a structured model to extract the most salient representation from its longitudinal subspace, to conduct the inference and forecast the future data

M. Bai (*) The University of Sydney Business School, The University of Sydney, Darlington, NSW, Australia e-mail: [email protected] B. Zhang The School of Information Technologies, The University of Sydney, Darlington, NSW, Australia e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_9

207

208

M. Bai and B. Zhang

(T ¼ n + 1) with the given sequential data (T ¼ 1, . . ., n). For example, given only a set of multimedia data, including continuous video frames, audio, text, etc., we intend to ﬁnd out the global content and predict possible ongoing scenes of the video, audio, text, etc., afterwards. Clearly, analysing these time series data and predicting their future trend, however, is an extremely challenging problem, as this task requires thorough understanding of the data content and their global relationship. Conventional methods in accessing the data content such as Deep Convolutional Neural Network (CNN) only assume all the representations are independent to each other, regardless of the correlations among them. In addition, these methods only take vectorial data as the input. For instance, for the image classiﬁcation task, images are converted into vectors before they are fed into a neural network structure. This vectorisation causes a discontinued representation. In our previous work [1], we proposed Tensorial Neural Network (TNN). This TNN can directly take multimodal data as inputs, without considering the temporal relationship within the data. The Recurrent Neural Network (RNN) is a sub-class of the artiﬁcial neural network where connections between units form a directed cycle. This property allows hidden units to access the information from others. With this advantage, RNN and its variants can use the internal memory to process the time series data. It should be noted that some of the well-trained RNNs are theoretically Turing complete, which means that they are able to solve any computation problem, in theory. There are a variety of architectures of RNNs, such as Elman RNN [2], Jordan RNN [3], long short-term memory (LSTM) [4], gated recurrent units (GRUs) [5], the nonlinear autoregressive exogenous inputs networks [6], and the echo state networks (ESN) [7]. The autoregressive model is also worth mentioning as it is the generally applied classic model dealing with time series data. Elman RNN and Jordan RNN are the most basic types of RNNs. The research on LSTM has been extraordinarily active for these years. GRU as a simpler and more aggregated model than LSTM has also been growingly popular in the past 3 years. Nonlinear Auto-Regressive eXogenous input neural network (NARX) contains a tapped delay input and multiple hidden layers. ESN emphasises on sparsity. Simple recurrent unit (SRU) is proposed to increase the training speed. However, the existing models above are all only able to process and analyse vectorised data. Thus, we introduce our newly proposed tensorial recurrent neural networks (TRNNs) [1] which directly process the multimodal time series data without vectorisation and in consequence, preserve the spatial and temporal information. Note that multimodal data are also named tensorial data [8]. The rest of this chapter is structured as follows. Section 9.2 presents the two most basic RNNs which are Elman RNN and Jordan RNN, as essential components of the variants of the RNNs. Subsequently, Sects. 9.3 and 9.4 provide the information of the existing literatures on LSTM and GRU. Section 9.5 then introduces the autoregressive model. NARX is subsequently presented in Sect. 9.6. Section 9.7 demonstrates ESN. Section 9.8 provides the information on SRU. The newly proposed TRNNs are presented in Sect. 9.9. In Sect. 9.10, the corresponding numerical experiments are conducted for all the aforementioned models. Lastly,

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

209

Sect. 9.11 concludes the whole chapter with a brief critique of the existing RNNs and autoregressive models.

9.2

Most Basic Recurrent Neural Networks

As the most basic RNNs, Elman RNN and Jordan RNN provide the fundamental idea of RNNs and the foundations of the further variants of RNNs. Elman RNN [2] is also referred to as simple RNN or vanilla RNN. In Elman RNN, there are the input node, the hidden node and the output node [9]. From the second time step, the hidden node at the current time step contains the information from the input node at the current time step and the information from the hidden node at the immediate previous time period. The output node is the activation of the corresponding bias plus the product of the hidden layer at the current time step and the learned weights. The form of Elman RNN [2] thus can be presented as the following equations (Eqs. 9.1–9.2), and the related workﬂow could be found in Fig. 9.1. ht ¼ σh ðWxh xt þ Whh ht1 þ bh Þ yt ¼ σy Why ht þ by

ð9:1Þ ð9:2Þ

where ht is the hidden node at the current time step and ht 1 is the hidden node at the immediate previous time step. xt is the current input node and yt is the current output node. These nodes are vectors. Wxh is the weight matrix operating on the current input node and Whh is the connection between the current hidden node and the immediate previous hidden node. Why can be treated as the coefﬁcient on the current hidden node. σh and σy are the activation functions predetermined by users. bh and by are the bias terms which are vectors. Another conventional RNN model is Jordan model. It also has the same three nodes as Elman RNN. However, the information ﬂow is slightly different. The hidden node at the current time step contains the information from the current input node and the immediate previous output node rather than the immediate

Fig. 9.1 An illustration of Elman RNN model

210

M. Bai and B. Zhang

Fig. 9.2 An illustration of Jordan RNN model

previous hidden node [3]. The workﬂow of a Jordan RNN model could be found in Fig. 9.2. Mathematically, a Jordan model can be expressed as ht ¼ σ h Wxh xt þ Wyh yt1 þ bh yt ¼ σ y Why ht þ by

ð9:3Þ ð9:4Þ

where all the notations share the same meaning with Elman RNN, except the vector yt 1 which indicates the output node at time (t 1). Elman and Jordan models are the cornerstones of the subsequent variants of RNNs, including LSTM and GRU. Since Elman RNN is easier to understand and interpret for the information ﬂow, the following RNN models are presented with the Elman mapping. However, both of them suffer from the gradient explosion issue [10] or the vanishing gradient problem [11, 12], as there is no ﬁlter or gate of the information ﬂow. Since the model is in a recursive pattern across the time, when the weights are initialised with the values between 1 and 1, after the recursive multiplication of the weights, it is possible that the gradients vanish to 0. If the weights can range outside 1 to 1, with the similar principle, there can be gradient explosion. In consequence, there is the demand to develop the other RNN models to avoid these two issues.

9.3

Long Short-Term Memory

In order to prevent the vanishing gradient problem and the gradient explosion problem and thus conduct the accurate and stable long-term and short-term learning and forecast, LSTM is proposed [4]. It is widely applied in natural language processing [13], bioinformatics [14], human action recognition, stock market prediction [15], etc. LSTM imitates the learning process in human brains. It uses the cells and the gates for the information ﬂow to forget the useless information, to add the necessary information, and subsequently read the information as a proportion from the memory stored at the current time step after the previous two steps. The information read from the memory is also used to handle the information at the next time step.

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

211

From the general picture of LSTM presented above, the cells and the gates differentiate the information ﬂow within LSTM from what is within Elman RNN and Jordan RNN. In speciﬁc, the cells can carry the information from the very beginning time step to the last time step. The value of the cell is named cell state and denoted as ct. This value is used to compute the hidden nodes ht with the functional transformation. ct is functions of the immediate previous cell state ct 1 (t > 1), the value of the available input at the current time step xt or the hidden node at the immediate previous layer, and the value of the hidden node at the immediate previous time step ht 1 (t > 1). Note that the current input data xt and the current hidden node from the last layer cannot occur simultaneously. These transformations are performed before producing the ultimate output yt. The details of these transformations are as follows where the gates are applied, to prevent the training procedure from the gradient explosion issue and the vanishing gradient problem. First, the useless information from ht 1 and xt or the current hidden node at the last layer is removed and the useful information is retained through the current forget gate which is denoted as ft with the sigmoid activation function to restrict the value of ft from 0 to 1. Then the necessary information is input to the whole information ﬂow by the current input gate it and the current candidate state b c t . The input gate transforms the information in the hidden node and the immediate past time step ht 1 and the input data xt with the sigmoid transformation, whereas the candidate state uses the tanh activation function to restrict the value between 1 to 1. The current cell state can be computed as the sum of the elementwise product of ft and ct 1 and the element-wise product of it and b c t . ft, it, and b c t are intended to store the information in memory. The output gate can be computed afterwards as the sigmoid activated ht 1 and xt or the current hidden node at the last layer with the corresponding weights. The output gate modiﬁes the information ﬂow to ht. Therefore, ht can be obtained. The computation of ht is a product of ot and the tanh activated ct. The mathematical expression can be demonstrated as follows, which is a simpliﬁed version of the original LSTM [4, 16]: Forget gate :

f t ¼ σ g W f ht1 þ U f xt þ b f

ð9:5Þ

Input gate :

it ¼ σ g ðWi ht1 þ Ui xt þ bi Þ

ð9:6Þ

ot ¼ σ g ðWo ht1 þ Uo xt þ bo Þ

ð9:7Þ

Output gate :

Candidate state : b c t ¼ σ c ðWc ht1 þ Uc xt þ bc Þ Current state :

ct ct ¼ f t ◦ ct1 þ it ◦ b

Current hidden state :

ht ¼ ot ◦σ h ðct Þ

ð9:8Þ ð9:9Þ ð9:10Þ

where σ g is the sigmoid activation function, σ c and σ h are the tanh activation functions. ◦ is the element-wise multiplication. Wf, Wi and Wc are the weight matrices operating on the hidden node from the last time step for the forget gate, the input gate and the output gate, respectively. Uf, Ui, and Uc are also the weight matrices, where they are for the current input vector data xt or the current hidden

212

M. Bai and B. Zhang

Fig. 9.3 An illustration of vector LSTM

node from the last layer for the forget gate, the input gate and the output gate respectively. bf, bi and boand bc are the bias terms which are vectors. Note that if ðl1Þ xt is not available, all the xt in the LSTM equations should be substituted with ht ðlÞ and the ht should be ht . All the other notations should also be added the superscript (l). Same as Elman RNN and Jordan RNN, the variables are all vectors in the LSTM. Thus, the information processing method of LSTM can be demonstrated in Fig. 9.3. Therefore, with the cells and the gates, the information processing in LSTM diminishes the unnecessary information and preserves the useful information. They also prevent the vanishing gradient problem and the gradient explosion issue. The learning process is much more stable than the most basic RNNs and the forecasting performance of LSTM is thus also better than Elman RNN and Jordan RNN. However, it is slightly complicated for the users to quickly acquire the process of the LSTM algorithm and the corresponding back propagation through time (BPTT) algorithm. Thus, a new and simpler RNN model can be implemented to prevent from the two issues.

9.4

Gated Recurrent Units

Gated recurrent units (GRUs) [5] are a variant of RNN with a more aggregated structure than LSTM. The core of the information ﬂow is similar to LSTM, except that it merges the forget gate and the input gate into one single gate—the update gate and collapses the cell state and the hidden states (the value of the hidden nodes) into one. Therefore, GRUs have fewer parameters than LSTM and are faster to be trained [17]. GRUs have the similar application ﬁelds as LSTM, since both of them use the

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

213

gated algorithm and capture the sequential relationship among the data. The mathematical form of GRUs can be demonstrated as follows [5]: Reset gate : Update state :

rt ¼ σ g ðWr ht1 þ Ur xt þ br Þ zt ¼ σ g ðWz ht1 þ Uz xt þ bz Þ

Current state : b r t ¼ rt ∘ht1 r t þ U h x t þ bh Candidate state : ~h t ¼ σ h Whb New state :

ht ¼ zt ∘ht1 þ ð1 zt Þ∘~h t

ð9:11Þ ð9:12Þ ð9:13Þ ð9:14Þ ð9:15Þ

where σ g is the sigmoid activation function and σ h is the tanh activation function. Wr, Wz, Wh, Ur, Uz, and Uh are the weight matrices. br, bz, and bh are vector biases. xt, ht, and ht 1 have the same meaning as that in LSTM and are vectors. The rest of the variables are also vectors. ◦ is the element-wise multiplication. To visualise the information processing in GRU, Fig. 9.4 presents the information ﬂow before obtaining the value of the current hidden node, i.e., the hidden state, and the ultimate output yt. It is obvious that the number of parameters in GRUs is fewer than what is in LSTMs, keeping the other conditions same. In consequence, GRUs are faster to converge to the minimum training cost compared with LSTMs and to update the weights. However, the existing GRUs and LSTMs are only able to process the vector data. Even if the input data are high-dimensional with special spatial structure, GRUs and LSTMs still vectorise them into an extremely long vector and analyse it. Therefore, for large-volume input data, they can be computationally costly. This issue provides the motivation for this chapter to propose a new GRU and LSTM

Fig. 9.4 An illustration of vector GRU

214

M. Bai and B. Zhang

which can directly handle and analyse the large-volume data with high-dimensions and the special complicated spatial and temporal relationship.

9.5

Autoregressive Model

The Autoregressive model has been widely applied for forecast systems for a long time. It speciﬁes that the output variable linearly depends on its own previous values and a stochastic term. Given an Autoregressive model of order 1 with a sequence input x with the length t, the corresponding mapping function could be deﬁned as in Eq. (9.16): xt ¼ ω0 þ ω1 xt1 þεt

ð9:16Þ

where ω0 is a constant, ω1 is the weight to be trained, and εt is the random error. In this model, the response variable in the previous time period has become the predictor and the errors have our usual assumptions about errors in a simple linear regression model. The model in Eq. (9.16) describes a ﬁrst-order autoregression. That is, the value at the present time and is directly associated and predicted by processing the values in the time t 1. However, in most cases, we may take more previous data at time t 1, t 2, . . ., t N into consideration for a comprehensive result. Then, an Nth Autoregressive Model would be written as x t ¼ ω0 þ

N X

ωi xti þεt

ð9:17Þ

i¼1

where ω0 is constant. ωi is the weight in time t i, and εi is the random error. The value at present as shown in Eq. (9.17) is predicted from the values at times t N to t 1. A kth-order Auto Regressive Model therefore could be viewed as a combination of N linear function at times t N to t 1.

9.6

Nonlinear AutoRegressive eXogenous Inputs Networks

The Nonlinear AutoRegressive eXogenous inputs Networks (NARX) is an autoregressive neural network architecture with a tapped delay input and multiple hidden layers [6]. In addition, the model contains an error term which relates to the fact that the knowledge from the other time steps will not enable the current value of the time series to be predicted exactly. Figure 9.5 shows a general scheme of a NARX Network. Here the size of input it is deﬁned by its corresponding Time-Delay Lines (TDLs):

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

215

Fig. 9.5 An illustration of nonlinear autoregressive network with exogenous inputs (NARX)

" it ¼

ð x ; . . . ; xt Þ T tnx T xtny ; . . . ; yt

#T ð9:18Þ

The current output signal yt is calculated by considering previous values of exogenous input and output signals: yt ¼ f xtnx ; . . . ; xt ; ytny ; . . . ; yt þ εt

ð9:19Þ

where f() is the nonlinear activation function. nx and ny are the time delay of exogenous input and output signals. The output is fed back to the input of the feed forward neural network.

9.7

Echo State Network

Echo State Network (ESN) [7] (Fig. 9.6) is a sparsely connected RNN model and its connectivity and weights are randomly assigned. The basic discrete-time echo ESN with T reservoir units, N inputs, and M outputs is deﬁned as ht ¼ f Whh ht1 þ Wih xt þ Woh yt1 þ εt

ð9:20Þ

where ht 1 is the reservoir state, Whh 2 ℝTT is the reservoir weight, Wih 2 ℝTN is the input weight, Woh 2 ℝTM is the output feedback weight, and f() is the activation. Note that each ht at current stage, will be inﬂuenced by the previous hidden stage ht 1, input signal of current stage xt, and also the output of previous stage yt 1. The corresponding output of current output yt is deﬁned as yt ¼ g Wio xt þ Who ht

ð9:21Þ

where Wio 2 ℝMN and Who 2 ℝMT is the weight to be trained, and g() is the activation.

216

M. Bai and B. Zhang

Fig. 9.6 Illustration of the Echo State Network

9.8

Simple Recurrent Unit

Although recurrent models achieve the remarkable performance in time series data analysis, one long-held issue of such model is its training efﬁciency. The forward pass computation of ht is blocked until the entire computation of ht 1 ﬁnishes. In addition, the growing model size and the number of parameters increase the training time, which makes training a recurrent unit time costing. To counter the increased computation as well as increase the training speed of a recurrent model, many new forms of recurrent models have been proposed. Most recently, Lei [18] proposed a Simple Recurrent Unit (SRU) to increase the speed of training a recurrent model. Conventional recurrent models such as LSTMs and GRUs obtain gate units to control the information ﬂow and therefore reduce the effect of gradient vanishing problems. The typical implementation is shown as follows: Ct ¼ f t ⨀Ct1 þ it ⨀~x t

ð9:22Þ

where ft and it is considered to be the forget gate and the input gate and is deﬁned as the element wise multiplication. In comparison, it in SRUs is set as it ¼ 1 ft to increase the training speed. In addition, ~ x t is set to perform linear transformation over the input matrix, that is ~ x t ¼ Wxt . Finally, the cell state Ct is passed into an activation function g() to produce the hidden state ht. Mathematically, given a sequence of input data {x1, . . ., xn} SRUs can be demonstrated as follows ~ x t ¼ Wxt f t ¼ σ W f xt þ b f

ð9:23Þ ð9:24Þ

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

217

Fig. 9.7 Illustration of an SRU model

r t ¼ σ ð W r x t þ br Þ

ð9:25Þ

ct ¼ f t ⨀ct1 þ ð1 f t Þ⨀~x t

ð9:26Þ

ht ¼ rt ⨀gðct Þ þ ð1 rt Þ⨀xt

ð9:27Þ

where σ() is the sigmoid activation. To visualise the forward passing in SRU, Fig. 9.7 presents the information ﬂow before obtaining the value of the current hidden node.

9.9

TRNNs

In this chapter, we also propose tensorial recurrent neural networks (TRNNs) designed for longitudinal data prediction. Since there are autocorrelations in terms of time, in order to handle the multimodal time series and obtain the accurate forecast, the classic recurrent neural networks (RNNs) are not powerful enough in terms of both computational efﬁciency and capturing structural information inside the data. Therefore, TRNNs are proposed in this chapter. The long short-term memory (LSTM) model and the gated recurrent units (GRUs) model as two types of RNNs have been under active research during the past few years. Therefore, the newly proposed TRNNs are designed as the tensorial LSTM (TLSTM) and the tensorial GRU (TGRU). Section 9.9.1 presents the form of the two new proposed TRNNs, which are TLSTM and TGRU. Then Sect. 9.9.2 outlines the loss function and the regularisation for the computation of test errors for prediction and the training

218

M. Bai and B. Zhang

process of TRNNs. Subsequently, Sect. 9.9.3 derives the recurrent BP algorithm to train TRNNs. Note that Sect. 9.9 is based on [1].

9.9.1

Tensorial Recurrent Neural Networks

From Sect. 9.2, the cornerstone of the RNNs is Elman model [2] forward mapping, ht ¼ σ h ðWhx xt þ Whh ht1 þbh Þ yt ¼ σ y Why ht þby

ð9:28Þ

ht ¼ σ h ðWhx xt þ Whh yt1 þbh Þ

ð9:29Þ

or Jordan model [3]

As demonstrated above, Elman model uses the information from the previous hidden node and the current input for the prediction of the next time step t + 1, whereas Jordan model utilises the information from the output at the current time step and the information in the previous hidden node for forecasting. In this chapter, we utilise the Elman mapping Eq. (9.28), since it is easier for interpretation and demonstration of the information ﬂow. What is different is that the tensorial longiT tudinal dataset is used and considered. This dataset is denoted as D ¼ fðXt ; Yt Þgt¼1 where each Xt is a D-dimensional tensor, as the explanatory data. Yt is the response data which can be a scalar, a vector or a tensor with D dimensions as well. In this chapter, we consider the case of Yt ¼ Xt + 1 in order to learn a prediction model. We will extend the two RNN architectures which are under active research. They are the Long Short-Term Memory Units (LSTMs) and the Gated Recurrent Units (GRUs) for tensorial data. Tensorial Long Short-Term Memory The classic LSTM is proposed in [4] and further implemented by many other researchers; e.g., see [19]. The application has demonstrated that LSTMs perform well on a large variety of problems and are applied in a wide range of cases. Based on the classic LSTMs demonstrated in Fig. 9.3, all the input nodes, output nodes, hidden nodes, the cells and the gates are extended to tensorial variates, where we set the input node and the hidden node as Dway tensors. Thus, we propose the following tensorial LSTM (TLSTM): Ft ¼ σ g Ht1 1 W f 1 2 W f 2 3 D W fD þ Xt 1 U f 1 2 U f 2 3 D U fD þ B f ð9:30Þ

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

219

It ¼ σ g ðHt1 1 Wi1 2 Wi2 3 D WiD þ Xt 1 Ui1 2 Ui2 3 D UiD þ Bi Þ ð9:31Þ Ot ¼ σ g ðHt1 1 Wo1 2 Wo2 3 D WoD þ Xt 1 Uo1 2 Uo2 3 D UoD þ Bo Þ ð9:32Þ b t ¼ σ c ðHt1 1 Wc1 2 Wc2 3 D WcD þ Xt 1 Uc1 2 Uc2 3 D UcD þ Bc Þ C ð9:33Þ bt Ct ¼ Ft1 ∘Ct1 þ It ∘C Ht ¼ Ot1 ∘σ h ðCt Þ

ð9:34Þ ð9:35Þ

where “◦” is the Hadamard product, which is element-wise. W.D and U.D are weight matrices with the relevant size, operating on tensorial hidden nodes and tensorial input nodes along mode D [8]. All Bare tensorial bias terms. b t is the Ft is the current forget gate. I t is the input gate at the current time step. C candidate state at time t and Ct is the current cell state. What should be noted is that the output gate Ot is not the actual output of TLSTM, whereas O is controlled by It and Ct to produce the hidden state Ht in Eq. (9.35). The type of response data Yt determines the extra layer added for TLSTM on the top of Ht. This layer transforms the tensorial hidden state Ht into the structure of Yt, for instance, a scalar, a vector or b t. a tensor. For the convenience in notations, we denote the converted output as O Tensorial Gated Recurrent Units The article [5] proposes the Gated Recurrent Unit (GRU), which is based on LSTM. Recall the literature review in Sect. 9.4, based on the architecture of GRU; the new proposed tensorial GRU (TGRU) has the forget gate Ft and the input gate It aggregated to be one single “update gate”. TGRU thus has a simpler structure and fewer parameters than TLSTM. Rt ¼ σ g ðHt1 1 Wr1 2 Wr2 3 D WrD þ Xt 1 Ur1 2 Ur2 3 D UrD þ Br Þ ð9:36Þ Zt ¼ σ g ðHt1 1 Wz1 2 Wz2 3 D WzD þ Xt 1 Uz1 2 Uz2 3 D UzD þ Bz Þ ð9:37Þ b t ¼ Rt ∘Ht1 R

ð9:38Þ

b t ¼ σ h ðHt1 1 Wh1 2 Wh2 3 D WhD þ Xt 1 Uh1 2 Uh2 3 D UhD þ Bh Þ H ð9:39Þ bt Ht ¼ Zt ∘Ht1 þ ð1 Zt Þ∘H

ð9:40Þ

220

M. Bai and B. Zhang

b t :H b t is the candidate Rt is the reset gate. Zt is the update gate. The current state is R gate. Similar with TLSTM, the additional transformation is added in the extra layer between the hidden variables Ht to match the shape of the response data Yt.

9.9.2

Loss Function

Recurrent neural networks can be used to predict a single target like the regression and classiﬁcation problem or to predict a sequence from another sequence. According to the presentation of the data, there are three classes of loss functions proposed for tensorial data analysis tasks. The ﬁrst type of the loss function is the loss function for single data series. When the training data are presented as a single time series as input and a single tensorial target as output, we use LTSM or GRU to model a series and manage to match a b t at time t, calculated from {X1, . . ., sequence of targets. In other words, the output O Xt} with TRNNs, will be applied in order to match the observed target Yt. Therefore, we deﬁne the simple loss at each time t as bt ℓ s ðt Þ ¼ ℓ Y t ; O which is the basis for all other overall loss functions in the following. ℓ is a loss function which can be set as the squared loss function in regression tasks and the cross-entropy loss in classiﬁcation tasks. Hence, the overall loss is deﬁned as ℓs ¼

T X t¼1

ℓ s ðt Þ ¼

T X bt ℓ Yt ; O

ð9:41Þ

t¼1

The second loss function is for more than one data series for same time length. The recurrent network structure is often applied for a certain duration. In this case, there are multiple training series in the training data, as follows: D¼

N X j1 ; . . . ; X jT ; Y j j¼1

In terms of the jth case, we only compute the loss at time T as b jT ℓ T ð jÞ ¼ ℓ Y j ; O where the input series (Xj1, . . ., XjT) produces the last output of TLSTM or TGRU, b jT . Therefore, we obtain the overall loss as O

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

ℓT ¼

N X

ℓ T ð jÞ ¼

j¼1

N X b jT ℓ Y j; O

221

ð9:42Þ

j¼1

In addition, if Y j1 ; . . . ; Y jT is the response data which resemble the single data series case, we update the loss as N X T X

ℓm ¼

N X T X b jt l Y jt ; O

ℓ m ðt; jÞ ¼

j¼1 t¼1

ð9:43Þ

j¼1 t¼1

The third type of loss function is for Panel Data. For panel data, the length of time T for each series (Xj1, . . ., XjT) is possible to be different. In consequence, there are different loops that TLSTM or TGRU should run through. Assuming the data series are D¼

N X j1 ; . . . ; X jT j ; Y j j¼1

we deﬁne the loss function as ℓ Tp ¼

N X

ℓ Tp ð jÞ ¼

j¼1

N X b jT ℓ Y j; O j

ð9:44Þ

j¼1

or ℓ mp ¼

Tj N X X

ℓm ðt; jÞ ¼

j¼1 t¼1

Tj N X X b jt ℓ Y jt ; O

ð9:45Þ

j¼1 t¼1

It is clear to observe that the loss function Eq. (9.41) can be considered as the special case of Eq. (9.43) for N ¼ 1. When deriving the BP algorithm in Sect. 9.3, the focuses are the loss function Eqs. (9.42) and (9.43), where the BP algorithm based on loss functions Eqs. (9.44) and (9.45) can be derived with the similar pattern with the loss functions Eqs. (9.42) and (9.43), respectively.

9.9.3

Recurrent BP Algorithm

The major difference between our newly proposed TRNNs and classic RNNs is on the mapping of data. In TRNNs, the Tucker decomposition [8] is applied to directly process the multidimensional data, whereas the classic RNNs vectorise the tensorial data to analyse and process them. The Tucker mapping within TRNN is denoted as follows:

222

M. Bai and B. Zhang

Mtα ¼ Ht1 1 Wα1 2 Wα2 3 . . . D WαD þ Xt 1 Uα1 2 Uα2 3 . . . D UαD þ Bα

b t in where α represents f, i, o, c, r, z or h. For α ¼ h, Ht1 should be substituted with R Eq. (9.39). To begin with, the results in [8] are introduced without proof. Theorem 1 The total sizes of tensors Mtα and Ht respectively, then

∂Mtα ∂Ht1

1

are denoted by |M| and |H|,

jMjjHj

¼ WαD Wα1

ð9:46Þ

where the matrix form is used and is the Kronecker product of matrices. In addition, ∂MtαðdÞ ∂W αd ∂MtαðdÞ ∂U αd

¼

¼

h i WαD Wαðdþ1Þ Wαðd1Þ Wα1 HðTt1ÞðdÞ Ihd hd

ð9:47Þ

h i WαD Wαðdþ1Þ Wαðd1Þ Wα1 XtTðdÞ Ixd xd

ð9:48Þ

∂Mtα ¼ IjM jjM j ∂Bα jM jjM j

ð9:49Þ

where the subscript (d ) means unfolding a tensor along mode d into a matrix, and both hd and xd are the sizes of mode d of tensors H t1 and X t , respectively. First the BP algorithm is derived for each stage in TLSTM for information processing. Figure 9.8 visualises the computation ﬂow of TLSTM demonstrated in Eqs. (9.30)–(9.35). Across time, the information of Ct is passed to the next TLSTM stage, while the tensorial hidden node H t is forwarded into the immediate following phase. In addition, it produces an extra layer for matching the response data Y t at t in shape or structure. In consequence, the BP algorithm for TLSTM units has two ∂ℓ pieces of information, which is from the following TLSTM unit, represented by ∂H ∂ℓ o and from the output loss ℓ t, represented by ∂H . Therefore, the aggregated derivative t information as follows is for the further back propagation.

∂ℓ ∂ℓ ∂ℓ o ¼ þ ∂h Ht ∂Ht ∂Ht ∂ℓo For ∂H ¼ 0, the response Yt does not exist for matching. t

ð9:50Þ

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

223

C t–1

Ct Wf , Uf , Bf

Wc, Uc, Bc

MtF

MtC

Xt WI, UI, BI

Ht –1

Wo, Uo, Bo

MtI

MtO

sg sg sg sc

Ft

Ct

It

Ot

sh

Ht

Fig. 9.8 An illustration of tensorial LSTM computation ﬂow [1]

b t , It, and Ot with the chain rule, Based on the computation ﬂows from Ct 1, Ft, C respectively, for Ct 1 and Ht, we can obtain the gradients of the loss function with b t , It, and Ot: respect to Ct 1, Ft, C ∂l ∂ℓ ∂l 0 ¼ þ σ h ðCt Þ∘ ∘Ft ∂Ct1 ∂Ct ∂h Ht ∂ℓ ∂ℓ ∂ℓ 0 ¼ þ σ h ðCt Þ∘ ∘Ct1 ∂Ft ∂Ct ∂h Ht ∂ℓ ∂ℓ ∂ℓ 0 ¼ þ σ h ðCt Þ∘ ∘It bt ∂Ct ∂h Ht ∂C ∂ℓ ∂ℓ ∂l bt ¼ þ σ 0h ðCt1 Þ∘ ∘C ∂It ∂Ct ∂h Ht ∂ℓ ∂l ¼ σ h ðCt Þ∘ ∂Ot ∂h Ht

ð9:51Þ ð9:52Þ ð9:53Þ ð9:54Þ ð9:55Þ

For the computation, two operators are introduced and utilised. The ﬁrst one is for vectorisation, which is Vec(.) and m ¼ Vec(M) for a tensor M. The second is to inverse the vectorisation, which is iVec(.) and thus iVec(m) ¼ M for a tensor M. Therefore, to compute the gradient with respect to Ht 1, the following equation is obtained:

224

M. Bai and B. Zhang

!! f T ∂Mt f ∂ℓ ∂ℓ 0 ¼ iVec Vec ∘σ M ∂Ht1 ∂Ft g t ∂Ht1 ! c T ∂Mtc ∂ℓ 0 þ iVec Vec ∘σ M bt g t ∂Ht1 ∂C ! T ∂ℓ 0 i ∂Mti þ iVec Vec ∘σ M ∂It g t ∂Ht1 ! o T ∂Mto ∂ℓ 0 þ iVec Vec ∘σ M ∂Ot g t ∂Ht1

ð9:56Þ

Using a similar approach, the following gradients are obtained at time t: !!

a T ∂MtaðdÞ ∂ℓ

∂ℓ 0 ¼ iVec Vec ∘σ M ∂Wαd t ∂At g t ∂Wαd !!

T a ∂M ∂ℓ

∂ℓ t ð d Þ ¼ iVec Vec ∘σ 0 M a ∂Uαd t ∂At g t ∂Uαd

a ∂ℓ

∂ℓ 0 ¼ sum ∘σ M ∂Bα t ∂At g t

ð9:57Þ

ð9:58Þ ð9:59Þ

b c or (I, i) or (O, o) and sum(.) means adding all the where (A, α) ¼ (F, f ) or C; elements together. Ultimately, the overall derivatives for all the parameters are

T X ∂ℓ ∂ℓ

¼

∂Wαd ∂Wαd

t¼1 t

T X ∂ℓ ∂ℓ

¼

∂Uαd ∂Uαd

t¼1 t

T X ∂ℓ

∂ℓ ¼

∂Bα ∂Bα

t¼1

ð9:60Þ

ð9:61Þ

ð9:62Þ

t

In terms of the loss functions Eqs. (9.42) and (9.43), what should be initially ∂ℓ 0 ∂ℓ noted is ∂H ¼ 0. For the loss function Eq. (9.43), the computation of ∂H is based on t t the form of the layer speciﬁed and the cost function for Yt (t ¼ 1, 2, . . ., T ) which are the dependent data. As for the loss function Eq. (9.42), the gradient of loss with ∂ℓ 0 respect to the hidden node ∂H is only available at t ¼ T, otherwise 0. Thus, Eq. (9.50) T can be computed accordingly.

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

225

Hence, the derivative BP algorithm can be concluded for TLSTM in Algorithm 1 where we have single training data. Algorithm 1 The Derivative BP Algorithm for TLSTM (for a single training data)

In the case of TGRU, according to the computation ﬂow diagram Fig. 9.9, which is the visualisation of Eqs. (9.36)–(9.40), we derive the BP algorithm using a similar approach as for TLSTM. The back propagation of the information retrieved from the time step t + 1 is ∂ℓ , which is bt ∂h H ∂ℓ ∂ℓ ¼ ð1 Zt Þ∘ b ∂ h Ht ∂H t ∂l b t ∘ ∂l ¼ Ht1 H ∂Zt ∂h Ht

ð9:63Þ ð9:64Þ

226

M. Bai and B. Zhang Ht

Ht–1

Xt

WR, UR, BR

MtR

sg

Rt

Rt WH, UH, BH

Wz, Uz, Bz

MtZ

sg

MtH

Zt

sh

Ht

1 – Zt

Fig. 9.9 An illustration of tensorial-GRU computation ﬂow [1]

∂ℓ ∂Ht1

! h T ∂Mth ∂ℓ ∂ℓ 0 ¼ Ht1 ∘iVec Vec ∘σ M bt h t bt ∂Rt ∂H ∂R ! r T ∂Mtr ∂ℓ 0 ¼ iVec Vec ∘σ M ∂Rt g t ∂Rt ! T z ∂ℓ ∂M ∂ℓ t þ iVec Vec ∘σ 0 M z þ Zt∘ ∂Zt g t ∂h Ht ∂Zt

ð9:65Þ

ð9:66Þ

Similarly, with Eqs. (9.57)–(9.59), we simply substitute ðA; αÞ by ðR; r Þ, ðZ; zÞ b h to obtain the gradients of the loss with respect to the three sets of weights and H; W’s and U’s. With all the information and equations above, we conclude the derivative BP algorithm of TGRU as in Algorithm 2. Finally, when there exists a regulariser as introduced in Sect. 9.9.2, we only need to add the derivatives of the regulariser to the BP-calculated derivatives.

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

227

Algorithm 2 The Derivative BP Algorithm for TGRU (for a single training data)

9.10

Experimental Results

This section presents the results of the empirical study and the simulation study. In the experiments, we selected the models which are either widely applied or state-ofthe-art. The empirical study is concerned with the performance of our newly proposed models—TLSTM and TGRU as TRNNs with the benchmark models— the classic LSTM and GRU. The empirical dataset is supposed to be multidimensional time series data which are also spatially interdependent. With the motivation from [20], the empirical study is within the context of international relationship study to analyse the actions, i.e., the relationships, among the countries in two additional types of dependence: reciprocity and transitivity. Reciprocity means that if Country A has a strong positive or negative action towards Country B, it is possible for Country B to also have a strong positive or negative

228

M. Bai and B. Zhang

behaviour towards Country A. Transitivity means that if Country C and D are both correlated to Country E, Country C and D are likely to have correlations. The ﬁrst dataset is chosen from Integrated Crisis Warning System (ICEWS).1 It is the international relationship network weekly data on the actions among 25 countries from 2004 to mid-2014. Neural networks have not been applied in the international relationship study within the existing literature by other authors, see [1]. The second dataset is the MSCOCO dataset [21] for the image captioning task, which is applied to TLSTM and TGRU. The simulation study is to explore the performance of the proposed TLSTM and TGRU, by comparing them with the classic vector LSTM [4], the classic vector GRU [5], and the new model SRU [18] also on the convergence speed of the training cost and the forecast accuracy as expressed by the test error. For all the mentioned models, robustness to noise in the data is also considered.

9.10.1 Empirical Study with International Relationship Data The empirical study design includes the description of the collected data on the data collection approach and the data transformation, the settings of the models to solve the empirical study research question, and the performance evaluation criteria. Before designing the empirical study, the empirical study research question should be clariﬁed. The general question is to conduct the prediction of the relationships among the 25 countries across time. As TRNNs are implemented to capture the temporal relation among the data, they are suitable for the prediction of the relationships among the countries. Thus, the empirical study should be designed accordingly, especially for the settings of the study and the performance evaluation criteria where the corresponding classic models should be applied as benchmarks. Data Description As aforementioned, the dataset used in this empirical study is collected from ICEWS, which is on the Internet. It is the same dataset used in the work [20]. The data are on the weekly actions among the 25 countries from 2004 to mid-2014. There are four types of actions within the dataset: material cooperation, material conﬂict, verbal cooperation, and verbal conﬂict. In the dataset, there are two time series of tensors, X and Y. At each time step t, Yt is the one-step-ahead version of Xt, where Xt and Yt are two 3D tensors: Xt 2 ℝ25 25 4 and Yt 2 ℝ25 25 4 among 25 countries in four different types of actions. Therefore, for the inference and the prediction, Xt are the inputs. There are also two types of the dependencies, which are reciprocity and transitivity, among the international relationship data and even social network data. Therefore, the collected data are organised in the following three distinguished cases to scrutinise the patterns of the relations among the 25 countries.

1

http://www.lockheedmartin.com/us/products/W-ICEWS/iData.html

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

229

Case I: Same as the work [20], the overall dataset is organised as fðXt ; Yt Þg543 t¼1 , where Yt is a 3D target tensor at time t and Xt is the lagged Xt, in other words, Yt ¼ Xt + 1. Case II: This case is to scrutinise the actions and the reciprocity. Thus, while the Yt is still the same as that in Case I, the explanatory tensor is Xt 2 ℝ25 25 4, within which Xt(i, j, k) ¼ Xt( j, i, k 4) for k ¼ 5, 6, 7, 8. It can be explained that the frontal slices corresponding to k ¼ 5, 6, 7, 8 are generated by transposing slices k ¼ 1, 2, 3, 4, respectively. Case III: This case is intended to explore the transitivity, the reciprocity, and the actions among the 25 countries. The explanatory tensor is constructed as the extension of Xt in Case II, such that Xt 2 ℝ25 25 12, within which Xt ði; j; kÞ ¼ 25 P ðXt ði; l; k 8Þ þ Xt ðl; i; k 8ÞÞðXt ðj; l; k 8Þ þ Xt ðl; j; k 8ÞÞ, k ¼ 9, 10, 11, 12. l¼1

Note that the ﬁrst eight frontal slices of Xt in Case III are unchanged from Case II. For the missing values NaN in the dataset, they are eliminated and substituted by 0. All the three cases normalise the data xt and Yt at all-time steps. For all the three cases, the overall dataset fðXt ; Yt Þg543 t¼1 is split into two sets, which are the data of the ﬁrst 400 weeks fðXt ; Yt Þg400 as the training set and the data of the last 143 weeks t¼1 543 fðXt ; Yt Þgt¼401 as the test set for prediction with TRNNs, the classic LSTM and GRU. Note that before the real forecasting task, the grid search is conducted to acquire the optimal layer size, where the training data are used as the whole dataset. The dataset of the ﬁrst 400 weeks as the training data is split into two parts, fðXt ; Yt Þg380 t¼1 as the training set and fðXt ; Yt Þg400 as the validation set. These two datasets are t¼381 also used in the grid search to obtain the optimal regularisation term, the sparsity penalty term, and the optimal starting learning rate for TRNNs with their benchmark models. Empirical Study Settings For the prediction task for TRNNs, the proposed TLSTM and TGRU are applied. For TLSTM, as aforementioned, there are three nodes: the input node, the hidden node, and the cell node. The size of the input node is 25 25 4 (or 8 or 12) for Case I, Case II or Case III respectively. The size of the hidden node is 50 50 4, 20 20 8, and 20 20 12 for the three cases, respectively. For the cell node, the size is the same as the hidden node. For TGRU, the input nodes have the same size with TLSTM in all three cases. The hidden nodes are 50 50 4, 16 16 8, and 18 18 12 for Case I, Case II, or Case III, respectively. These sizes are also obtained from the grid search based on validation error. The forecast of the two TRNNs are both many-to-one. In speciﬁc, with the data from the last month which are the explanatory data Xt, Xt + 1, Xt + 2, and Xt + 3, the forecast of the next week Yt + 3 is obtained as the one-step-ahead forecast. Before the BP in the training process of the model in the prediction task, the weights are initialised. In the initialisation, the value of each element in each weight matrix is randomly drawn from the standard normal distribution. In addition, since

230

M. Bai and B. Zhang

the loss function in TNN includes the regularisation term λ and the sparsity penalty terms β and ρ, the grid search as aforementioned is conducted. For TRNNs, the proper λ’s for TLSTM are all 0.0050, 0.0031 and 0.0031 for Case I, Case II or Case III respectively, with β’s at the value of 0 and ρ’s are actually not imposed. For TGRU, λ’s are 0.0050, 0.0361 and 0.0361 with ρ ¼ 0 for the three cases, respectively. The optimisation method selected to train TRNN is the stochastic gradient descent (SGD) with mini-batches. For TRNNs to capture the monthly relation and the weekly relation in the countries, the length of each batch of time series is 4. The number of iterations and evaluations are 1000 and 2000, respectively. Performance Evaluation Criteria The evaluation criterion to the performance of TRNNs and their benchmark models is the prediction accuracy. For TRNNs, it is equivalent to the test error deﬁned as ℓ¼

2 543 543 543 X 1 X 1 X

bt ¼ 1 b t ℓ ðt Þ ¼ ℓ Yt ; O

Yt O

143 t¼401 143 t¼401 143 t¼401 F

ð9:67Þ

b t is the output from the many-to-one TRNNs given, Xt 3, Xt 2, Xt 1, where O b t. and Xt. Thus, it is actually the forecast, Y Empirical Results and Findings For TRNNs, LSTM and GRU, which are proposed for forecasting, with the ICEWS data, the results of the forecast on the international relations are as shown in Table 9.1, based on the forecast accuracy assessed on the test error. From the table we can observe that with more dependency information input to the model, the forecast is more accurate for both TLSTM and TGRU. With more data as in Case II and Case III, TLSTM has better performance than TGRU and the benchmark models in forecast accuracy, since it has more parameters within the model than TGRU, keeping the other conditions constant. Nevertheless, both TRNNs have generally satisfactory performance in forecast, since the values of the test errors are really small, compared with the input data that most of the data ranging from 1 to 1 and their benchmark models.

Table 9.1 Test error for TLSTM and TGRU for prediction Case I Case II Case III

Classic LSTM 1.2032 0.8096 0.6759

Classic GRU 1.0017 0.9353 0.7802

Bold values are the best performing ones with the least forecast error

TLSTM 0.0713 0.0304 0.0296

TGRU 0.0416 0.0395 0.0343

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

231

9.10.2 Empirical Study with MSCOCO Data We apply our TLSTM and TGRU model to image captioning task. In this experiment, we used MSCOCO dataset [21] which consists of 164,042 pairs of images and sentences in English describing these images. The numbers of training samples, validation samples, and test samples are set to be 82,783, 40,504, and 40,775, respectively. Our experiments are based on a deep model, which is similar to the one described in [21]. The only difference is that we obtain ResNet101 [22] for image feature extraction, and replace its original last pooling layer, fully connected layer, and softmax function with a 1 1 512 convolutional layer and our proposed TLSTM for the sentence generation. The detailed framework structure could be found in Fig. 9.10. We treat this image translation task as a probabilistic framework, which is that the entire training pipeline is designed to maximise the following equation: θ∗ ¼ arg max θ

X

log pðSjI; θÞ

ð9:68Þ

ðI;SÞ

where θ are the parameters to be trained, I is the training image, and S indicates its correct transcription of I. As sentence S could be in any form (i.e. variant length), it is general to obtain the chain rule to model the joint probability over S0, . . ., ST:

An

airplane

...

sky.

log p(S1)

log p(S2)

TensorialLSTM

TensorialLSTM

TensorialLSTM

ResNet

WeS0

WeS1

WeST–1

S1

ST–1

S0

Fig. 9.10 The entire framework for image captioning task

log p(ST)

...

TensorialLSTM

232

M. Bai and B. Zhang

log pðSjI; θÞ ¼

T X

log pðSt jI; S0 ; . . . ; St1 Þ

ð9:69Þ

t¼0

where T is the length of S. Our TLSTM model is trained to predict each word of the sentence after it has seen the image as well as all preceding words as deﬁned by logp(St| I, S0, . . ., St 1). More speciﬁcally, given an input image I and its corresponding captions S ¼ (S0, . . ., ST), the recurrent connections are calculated by Xt ¼ CNNðI Þ

ð9:70Þ

Xt ¼ W e St , t 2 f0; . . . ; T g

ð9:71Þ

ptþ1 ¼ TLSTMðXt Þ, t 2 f0; . . . ; T g

ð9:72Þ

where We is the word embeddings weight, and St is a one-hot vector. The loss function is the sum of the negative log likelihood of the correct word at each step as follows: J ðI; SÞ ¼

T X

log pt ðSt Þ

ð9:73Þ

t¼1

Note that Eq. (9.73) is designed to minimise all the parameters of our framework, including the ResNet101 model and proposed tensorial model, and word embeddings We. Similarly, we also test our proposed TGRU model on this experiment by only replacing the TLSTM part with TGRU. The corresponding equation to calculate the prediction of next word in the sentence pt + 1 is deﬁned as ptþ1 ¼ TGRUðXt Þ, t 2 f0; . . . ; T g

ð9:74Þ

The ResNet101 model is pre-trained on the ImageNet dataset. Then we ﬁne-tune its last two convolutional blocks with a learning rate of 0.0001 and start to train our proposed TLSTM and TGRU model by using SGD method with an initial learning rate of 0.01 and a weight decay of 0.95 every 2000 iterations. Each tensorial unit shares the same weights. We used 512 dimensions for the embeddings and 1024 dimensions for the size of LSTM and GRU memory. We implement our algorithm in Python 2.7 and conduct our experiments on a server which runs an i7 7700K CPU with 32GB memory and 2 NVIDIA-1080 GPUs with 8 GB memory each. Quantitative Analysis To evaluate our proposed model, we follow the method mentioned in [21] and obtained three different criteria. We ﬁrst applied BLEU-4, which uses a modiﬁed form of precision to compare a generated sentence against multiple reference sentences. In addition, we also report our score on METEOR and

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics Table 9.2 Comparison of different models on MSCOCO image captioning task

Method Show and Tell Bayesian RNN TLSTM TGRU Random Nearest Neighbour Human

BLUE-4 28.8 30.2 30.91 30.86 4.6 9.9 21.7

METEOR 23.2 24.0 25.28 25.33 9.0 15.7 25.2

233 CIDER 89.8 96.0 97.73 98.13 5.1 36.5 85.5

Fig. 9.11 Examples of evaluation results based on human rating

CIDER, which test for alignment with ground truths and human consensus respectively, therefore capturing additional improvements to caption quality beyond BLEU metrics. Here we compared our proposed tensorial models with two state-of-the-art models: Show and Tell [21] and Bayesian-RNN [23]. We also reference three benchmarks to evaluate the performance of our tensorial model. The ﬁrst benchmark is a random generation of words from the vocabulary until the end-of-sentence token is emitted. The second benchmark is a nearest neighbour approach, which compares image vectors and returns the caption of the closest image. The last benchmark is the one generated by human captioning results, which are originally reported in [21]. The corresponding results are shown in Table 9.2. It could be observed that our model outperforms the human raters. In addition, compared with other state-ofthe-art models, our proposed tensorial models have a signiﬁcant improvement in each evaluation method. Qualitative Analysis Figure 9.11 provides a sample of generated captions on test images. We classiﬁed all the outputs into three groups: Correct, Partially Correct,

234

M. Bai and B. Zhang

Fig. 9.12 Example caption results of our proposed tensorial models and comparison models

and Wrong. Captions with a minor error or captions that are technically correct but miss the point of the image are labelled as Partially Correct. For instance, there are some captions that fail to calculate the number of objects or ﬁne-gained error but relatively clearly represent the information, as shown in Fig. 9.11. In the ‘Wrong’ category, all the captions fail to express the major content of the image, i.e., fail to recognise the key objects or misunderstanding of the scenes. The example of the image captions could be found in Fig. 9.12. It is obvious that our tensorial models outperformed the others. For instance, in Fig. 9.12a our proposed tensorial is able to recognise the major content “young girl” and “eating” whereas the others only recognise a “young girl” but fail to recognise “eating”. For the images with complex background, such as Fig. 9.12b, the benchmark models fail to express the major content whereas the proposed tensorial models are able to detect

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

235

“man”, “horse”, and the verb “ride”. In Fig. 9.12c, our proposed tensorial models not only can recognise the major objects in an image but also can learn the most suitable verb which could link the major objects in a logical way. In Fig. 9.12d, our proposed tensorial models are able to recognise the teddy bear, whereas the comparison model can only recognise this teddy bear as a toy.

9.10.3 Simulation Study This section presents the simulation study to assess the performance of TRNNs, the classic LSTM, the classic GRU and SRU in the regression task within which there are spatial and temporal relations in the multimodal time series data. Since TLSTM and TGRU as TRNNs are newly proposed by us, we compare them with the classic vector LSTM [4], the classic vector GRU [5] and SRU [18] on the convergence speed of the training cost and the forecast accuracy as expressed by the test error. For all the models, robustness to noise in the data is also considered. Data Simulation Process In order to evaluate the performance of TRNNs, LSTM, GRU, and SRU on the analysis of multimodal time series data, the data simulated should be multi-dimensional and contain spatial inter-relationships, with the noise added to test on the robustness of the models. In addition, we assume that the input data are normally distributed, which is preferred by neural network models. With all the above considerations and the motivation from the work [20], the data simulation process for the ﬁrst cluster of models is designed as follows. Firstly, based on the assumption of the input data, X0 2 ℝ25 25 4 as a 3D tensor is generated from the standard normal distribution. Subsequently, in order to obtain the spatial dependency within the data, inspired by [20], the reciprocity and the transitivity can be added in the data, where reciprocity means the general symmetry of the data on each frontal slice j of the 3D tensor x0 and transitivity means that if the elements Xi1, i3, k and Xi2, i3, k are non-zero, it will cause the element Xi1, i2, k to be non-zero. Therefore, for reciprocity, four more frontal slices should be added to the original 3D tensor x0 along the mode 3, where Xi1, i2, k ¼ Xi1, i2, k 4 for k ¼ 5, . . ., 8. For transitivity, four frontal slices should also be added along the mode 3 to the 3D P tensor, where Xi1, i2, k ¼ Xi1, i3, k8 Xi3, i1, k 8 + Xi2, i3, k 8 Xi3, i2, k 8) for i3

k ¼ 9, . . ., 12. In consequence, the initial sample x0 has the size as 25 25 12. Subsequently, the weighted pooling is conducted for the slices k, k-4 and k-8 ( j ¼ 0, . . ., 12) with the corresponding weights 0.4, 0.3 and 0.3 which are simulated from a uniform distribution and required to sum to be 1. The initial x0 is now again a 25 25 4 tensor and contains the reciprocity and transitivity spatial dependency on all slices. For the data in the latter steps from t ¼ 1, the generation process is set as from an autoregressive model as follows, where the noises added are shown in Table 9.3.

236

M. Bai and B. Zhang

Table 9.3 The distribution of noises Distribution Chi-Squared Distribution Student’s T-Distribution Normal Distribution Table 9.4 Simulation study: the parameters found from grid search

Mean 1 Undeﬁned 0

Classic LSTM Classic GRU SRU TLSTM TGRU

Variance 2 Undeﬁned I λ 0.0100 0.0100 0.0100 0.0100 0.0100

Degree of Freedom 1 1 – ρ 0 0 0 0 0

β 0 0 0 0 0

H1 1000 1 1000 1 1000 1 15 15 8 15 15 8

1. X1 (chi2): X1 ¼ X0 1 U1 2 U2 3 U3 þ Bþε1 , ε1 ~χ 2(1); 2. X1 (t): X1 ¼ X01U12U23U3 + B + ε1,ε1~t(1); 3. X1 (normal): X1 ¼ X01U12U23U3 + B + ε1,ε1~N(0, I); where U1 2 ℝ25 25, U2 2 ℝ25 25, U3 2 ℝ4 4, and B 2 ℝ25 25 4 are simulated from the standard normal distribution. Then Xt are simulated with the same method, where Xt ¼ Xt 11U12U23U3 + εt and εt is from the three distributions in Table 9.3, where I is a tensor with the same size of the input data. All the diagonal elements of I are 1 and off-diagonal elements are 0. In order to avoid the exponential explosion, the length of the dataset is set to be relatively small as t ¼ 1, . . ., 49. Therefore, the dataset consists of the spatial and temporal relations. The obtained dataset fXt g49 t¼1 is also separated into two sets 49 and Y where Y ¼ X . For the designed task, fXt ; Yt g40 fXt g49 f g t t¼1 t t + 1 t¼1 t¼1 is for 49 training and fXt ; Yt gt¼41 is for test. With these data with different noises, the traditional neural network, the classic LSTM, the classic GRU, TLSTM, and TGRU conduct the forecast task on the test set and compare the convergence speed in the training process to assess the performance of TLSTM and TGRU, including their robustness. Parameter Generating Method for Models For all the models in both groups, the value of parameters should be optimal in order to obtain the highest accuracy. Except TNN-St, all the models have the regularisation terms and the penalty terms for the training cost function, in addition to the Euclidean distance between the true value and the output from the neural networks. Therefore, in order to determine the parameters λ, ρ, and β, the grid search is conducted to select their proper values in order to obtain the minimum test error. With the same method, for all the neural network models, the size of each hidden layer and the number of layers are also obtained. Table 9.4 presents the parameters of each model, where H1 means the size of the ﬁrst hidden layer.

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

237

Performance Evaluation Criteria As mentioned in Sect. 9.1, the demand in recent years is to have a model which is fast in training multi-dimensional and large volume data and accurate in inference and forecast. Therefore, the models should be assessed on the convergence speed of the training loss and forecast accuracy. The convergence speed is based on the iteration before which the training cost dramatically drops to a small number and after which becomes generally stable. The forecast accuracy is based on the test error, which is computed by the following equation: ℓ¼

2 1 X48 1 X48

2 b Y f ð X ; W Þ ¼ Y O k k

t t t t F t¼41 t¼41 41 41 F

ð9:75Þ

where W is the set of all the weights U in the model and f(Xt, W) is the output of the models which estimate the weights by treating the data as uncorrelated based on b t is the output of the recurrent neural networks for forecast tasks. Note that if time. O the models vectorise the multi-dimensional input data for processing, Yt should be substituted by yt and yt ¼ VecðYt Þ

ð9:76Þ

where Vec(.) is to ﬂatten the tensorial data input into a long vector along mode b t is substituted with b 1. Similarly, O o t for vector outputs from the classic LSTM and the classic GRU. Note that the accuracy of the models in group one, which are for inference and process the data as they are temporally independent, is still assessed with the test error, from the nature of the model estimation and due to it being common practice in machine learning. Simulation Results and Analysis Before obtaining the simulation results and analysing them, it is necessary to presume the performance of the applied RNN models. The classic LSTM and GRU and SRU vectorise the multi-dimensional data to process and analyse them, whereas TLSTM and TGRU directly process the tensorial data where the spatial and temporal information are both preserved and analysed. Therefore, TLSTM and TGRU are supposed to have better overall performance than the classic LSTM, the classic GRU and SRU. With the simulated data, the actual performance of the models is demonstrated as follows. Note that in the experiments, we have 1000 iterations to train the models in each run and we have 10 runs. The convergence speed is based on the average of the training losses. The conﬁdence interval of the mean training losses is also presented. If the conﬁdence interval for the mean training losses is small, the performance of the model is considered to be satisfactory in the stability. Performance of Models Designed for Forecast To assess the performance of TLSTM and TGRU, which are proposed by us to capture both temporal and spatial relations in the multimodal time series data, we compare these two new proposed models with the other models: the classic LSTM, the classic GRU and SRU. Similarly, the three cases are analysed, where the noises in the data follow in the

238

a

M. Bai and B. Zhang

b

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.2 0.15 0.1 0.05 0

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

Training loss

0.25

0.3

0.2 0.15 0.1 0.05

0

0

100 200 300 400 500 600 700 800 900 1000

0

100 200 300 400 500 600 700 800 900 1000

Iteration

c

Iteration

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

Training loss

0.25 0.2 0.15 0.1 0.05 0

0

5

10

15

20

25

30

35

Iteration

e

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

Training loss

0.25

Training loss

d

0.2 0.15 0.1 0.05 0 0

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25 0.2 0.15 0.1 0.05

5

10

15

20

Iteration

25

30

0

0

5

10

15

20

25

30

Iteration

Fig. 9.13 Convergence of the training loss (noise in the Chi-squared distribution)

Chi-squared distribution, the Student’s t distribution, and the normal distribution, respectively. The convergence speed of the training loss and the test error for forecast accuracy are presented to evaluate the performance. In all the graphs in Figs. 9.13, 9.14, and 9.15 of the training loss convergence, (a) is for the classic LSTM [4] and (b) is for the classic GRU [5]; (c) is for SRU [18]. TLSTM and TGRU are presented in (d) and (e), respectively. The blue line is the mean training loss for the 10 runs with 1000 iterations for each run. The red dotted lines are the 95% of the mean training loss. The graphs also just present the iterations before and around where the training losses start to converge to a certain minimum value.

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

a

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

0.25

Training loss

b

0.3

0.2 0.15 0.1

239

0.2 0.15 0.1 0.05

0.05

0 0

0 0

50

100 150 200 250 300 350 400

50

100

150 200

Iteration

c

300

350 400

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

250

Iteration

0.2 0.15 0.1 0.05 0 0

5

10

15

20

25

30

35

Iteration

d

e

0.3 Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.2 0.15 0.1 0.05 0 0

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

Training loss

0.25

0.3

0.2 0.15 0.1 0.05

5

10

15

20

Iteration

25

30

0 0

5

10

15

20

25

30

Iteration

Fig. 9.14 Convergence of the training loss (noise in Student’s t distribution)

As for the training loss convergence speed, ﬁrstly, we present and analyse the case where the noise in data is Chi-squared distributed. From Fig. 9.13, it can be observed that widths of the conﬁdence intervals of the two classic RNNs are both much larger than TRNNs and SRU. In terms of the convergence speed, the training losses of two classic RNNs do not generally converge to a number even at the 1000th iteration. The ﬂuctuations of the mean training losses for the classic RNNs are evident as well. However, for SRU, TLSTM, and TGRU, the mean training losses starts to converge around the 15th iteration, where TGRU has a smoother convergence pattern than TLSTM. Therefore, in terms of the training loss convergence, SRU, TLSTM, and TGRU outperform the classic LSTM and GRU, respectively.

240

a

M. Bai and B. Zhang

0.3

0.3

0.2 0.15

0.2 0.15

0.1

0.1

0.05

0.05

0

0 0

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

0.25

Training loss

b

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

100 200 300 400 500 600 700 800 900 1000

0

100 200 300 400 500 600 700 800 900 1000

Iteration

Iteration

c

0.3

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

Training loss

0.25 0.2 0.15 0.1 0.05 0

0

5

10

15

20

25

30

35

Iteration

d

0.3

0.2 0.15 0.1 0.05 0

0.3

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.25

Training loss

0.25

Training loss

e

Mean Training Loss Training Loss 95% Lower Confidence interval Training Loss 95% Upper Confidence interval

0.2 0.15 0.1 0.05

0

5

10

15

20

25

30

0

0

5

10

Iteration

15

20

25

30

Iteration

Fig. 9.15 Convergence of the training loss (noise in normal distribution)

Subsequently, from Fig. 9.14, when the noise is from the heavy-tailed Student’s t distribution, the overall performance of SRU, TLSTM, and TGRU is still more satisfactory than the classic LSTM and GRU. Even though the overall width of the conﬁdence intervals of the mean training loss for LSTM, GRU, TLSTM, and TGRU are generally the same and the conﬁdence interval for SRU is slightly smaller than the others, the mean training losses of two classic RNNs start to decrease to a stable value at approximately the 300th iteration and there are slightly more ﬂuctuations in the mean training loss of SRU. For TLSTM and TGRU, the iterations are around 15, on the contrary. Hence, the performance of SRU and the new proposed TRNNs

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

241

Table 9.5 Forecast error Test error

Classic LSTM

Classic GRU

SRU

TLSTM

TGRU

Chi-Squared noise Student’s T noise Gaussian noise

0.0303

0.0316

4.4456 105

2.3706 105

3.1434 105

5.3224 106

9.3369 106

1.1297 107

6.1100 108

8.0198 108

0.0342

0.0387

2.4632 105

1.3045 105

1.7403 105

Bold values are the best performing ones with the least forecast error

is more favourable than the two classic RNNs with regard to the training loss convergence Finally, concerning the case where the noise is normally distributed. Figure 9.15 demonstrates that SRU, TLSTM, and TGRU have evidently narrower conﬁdence intervals of the mean training loss than the classic LTSM and GRU. In respect to the iteration where the training loss decays to a roughly minimum value, for SRU, TLSTM, and TGRU, it is around 15, while the classic LSTM and GRU still do not start to converge at the 1000th iteration. Accordingly, SRU, TLSTM, and TGRU outperform the classic LSTM and GRU in this case. For the accuracy of the prediction, as demonstrated in Table 9.5, TLSTM has the least test errors than all the other models in all three cases and TGRU ranks second for the forecast accuracy with the marginally larger test errors than TLSM. SRU only performs worse than TLSTM and TGRU. It is notable that the test errors of TLSTM and TGRU are remarkably smaller than the two classic RNNs for the noises in the Chi-squared distribution and the normal distribution with exponential decrease. In summary, the actual performance of the new proposed TLSTM and TGRU in this chapter is consistent with the presumed performance. With regard to the convergence of training loss, SRU, TLSTM, and TGRU have a faster speed and overall smaller conﬁdence interval of the mean training loss than the classic LSTM and GRU. Therefore, the training process of our newly proposed methods is more stable and efﬁcient than the existing classic models and similar with the state-of-theart existing model. It can be also concluded that the computation efﬁciency of the newly proposed methods is higher than the classic models and no worse than the state-of-the-art model. In terms of the accuracy of the prediction, TLSTM and TGRU outperform the classic LSTM, the classic GRU and SRU for all the three cases for different noises, with much smaller test errors than the classic models. Thus, the newly proposed TLSTM and TGRU are generally robust to the noise, which is skewed or heavy-tailed.

9.11

Conclusion

In this chapter, the representative RNN models and the classic autoregressive model for multimodal time series data analysis are presented and discussed. These models are the most basic RNNs: Elman RNN and Jordan RNN, the classic LSTM, the

242

M. Bai and B. Zhang

classic GRU, the autoregressive model, NARX, ESN, SRU, and our newly proposed models: TRNNs (TLSTM and TGRU). Except TRNNs which directly process the multimodal time series data, the models in this chapter vectorise the multidimensional data to analyse them and thus encounter the loss of the structural spatial information. Three experiments are conducted on LSTM, GRU, SRU, and TRNNs with international relation data, the image data and the simulated data with noises in three different cases: high skewness, heavy tail, and Gaussian. The results from the experiments reveal that the loss of the structural spatial information can cause the slow training speed and/or the low accuracy in forecasting, as TRNNs outperform all the other models in general and succeed in capturing the spatial and temporal information in multimodal time series data. Thus, it is reasonable to conclude that for multimodal time series data processing and analysis, it is preferable to have the multimodal mapping in the layer of RNNs. Since the performance of SRU is generally more favourable than the other models except TLSTM and TGRU, one possible future work can be focused on developing the tensorial SRU to obtain even faster training process and accurate forecast.

References 1. Bai, M., Zhang, B., Gao, J.: Tensorial Recurrent Neural Networks for Longitudinal Data Analysis (2017). http://arxiv.org/abs/1708.00185 2. Elman, J.L.: Finding structure in time. Cogn. Sci. 14(2), 179–211 (1990) 3. Kolda, T.G.: Multilinear Operators for Higher-Order Decompositions. Technical report, Sandia National Laboratories (2006) 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 5. Cho, K., van Merriënboer, B., Gűlçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, Qatar (2014). http://www.aclweb.org/anthology/D14-1179 6. Goudarzi, A., Banda, P., Lakin, M.R., Teuscher, C., Stefanovic, D.: A comparative study of reservoir computing for temporal signal processing. arXiv preprint arXiv:1401.2224 (2014) 7. Jordan, M.I.: Serial order: a parallel distributed processing approach. Adv. Psychol. 121:471–495 (1997) 8. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009) 9. Zhang, S., Wu, Y., Che, T., Lin, Z., Memisevic, R., Salakhutdinov, R.R., Bengio, Y.: Architectural complexity measures of recurrent neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 1822–1830. Curran Associates (2016). http://papers.nips.cc/paper/6303architectural-complexity-measures-of-recurrent-neural-networks.pdf 10. Bengio, Y., Simard, P., Frasconi, P.: Learning long-term dependencies with gradient descent is difﬁcult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994) 11. Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut fűr Informatik, Lehrstuhl Prof. Brauer, Technische Universität Műnchen (1991)

9 Recurrent Neural Networks for Multimodal Time Series Big Data Analytics

243

12. Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J.: Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies. In: Kremer, S.C., Kolen, J.F. (eds.) A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press (2001) 13. Gers, F.A., Schmidhuber, E.: LSTM recurrent networks learn simple context-free and contextsensitive languages. Trans. Neural Netw. 12(6), 1333–1340 (2001). https://doi.org/10.1109/72. 963769 14. Hochreiter, S., Heusel, M., Obermayer, K.: Fast model-based protein homology detection without alignment. Bioinformatics. 23(14), 1728–1736 (2007) 15. Chen, K., Zhou, Y., Dai, F.: A LSTM-based method for stock returns prediction: a case study of China stock market. In: 2015 I.E. International Conference on Big Data (Big Data), pp. 2823–2824 (2015) 16. Bianchi, F.M., Maiorino, E., Kampffmeyer, M.C., Rizzi, A., Jenssen, R.: Recurrent Neural Networks for Short-Term Load Forecasting: An Overview and Comparative Analysis. SpringerBriefs in Computer Science. Springer (2017). https://books.google.com.au/books? id¼wu09DwAAQBAJ 17. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling (2014) 18. Lei, T., Zhang, Y.: Training RNNs as Fast as CNNs. arXiv preprint arXiv:1709.02755 (2017) 19. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000) 20. Hoff, P.D.: Multilinear tensor regression for longitudinal relational data. Ann. Appl. Stat. 9(3), 1169–1193 (2015) 21. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017) 22. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) 23. Fortunato, M., Blundell, C., Vinyals, O.: Bayesian Recurrent Neural Networks. arXiv preprint arXiv:1704.02798 (2017)

Chapter 10

Scalable Multimodal Factorization for Learning from Big Data Quan Do and Wei Liu

Abstract In this chapter, we provide readers with knowledge about tensor factorization and the joint analysis of several correlated tensors. The increasing availability of multiple modalities, captured in correlated tensors, provides greater opportunities to analyze a complete picture of all the data patterns. Given large-scale datasets, existing distributed methods for the joint analysis of multi-dimensional data generated from multiple sources decompose them on several computing nodes following the MapReduce paradigm. We introduce a Scalable Multimodal Factorization (SMF) algorithm to analyze correlated Big multimodal data. It has two key features to enable Big multimodal data analysis. Firstly, the SMF design, based on Apache Spark, enables it to have the smallest communication cost. Secondly, its optimized solver converges faster. As a result, SMF’s performance is extremely efﬁcient as the data increases. Conﬁrmed by our experiments with one billion known entries, SMF outperforms the currently fastest coupled matrix tensor factorization and tensor factorization by 17.8 and 3.8 times, respectively. We show that SMF achieves this speed with the highest accuracy.

10.1

Introduction

Recent technological advances in data acquisition have brought new opportunities as well as new challenges [17] to research communities. Many new acquisition methods and sensors enable researchers to acquire multiple modes of information about the real world. This multimodal data can be naturally and efﬁciently represented by a multi-way structure, called a tensor, which can be analyzed to extract the underlying meaning of the observed data. The increasing availability of multiple modalities, captured in correlated tensors, provides greater opportunities to analyze a complete picture of all the data patterns.

Q. Do (*) · W. Liu Advanced Analytics Institute, University of Technology Sydney, Chippendale, NSW, Australia e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_10

245

246

Q. Do and W. Liu

The joint analysis of multimodal tensor data generated from different sources would provide a deeper understanding of the data’s underlying structure [6, 8]. However, processing this huge amount of correlated data incurs a very heavy cost in terms of computation, communication, and storage. Traditional methods which operate on a local machine such as coupled matrix tensor factorization (CMTF) [3] are either intractably slow or memory insufﬁcient. The former issue is because they iteratively compute factors on the full coupled tensors many times; the latter is due to the fact that the full coupled tensors cannot be loaded into a typical machine’s local memory. Both computationally efﬁcient methods and scalable work have been proposed to speed up the factorization of multimodal data. Whereas concurrent processing using CPU cores [20, 21] or GPU massively parallel architecture [27] computed faster, it did not solve the problem of insufﬁcient local memory to store the large amount of observed data. Other MapReduce distributed models [5, 13, 14, 23] overcame memory problems by keeping the big ﬁles in a distributed ﬁle system. They also improved computational speed by having many different computing nodes processed in parallel. Computing in parallel allows factors to be updated faster, yet the factorization faces a higher data communication cost if it is not well designed. One key weakness of MapReduce algorithms is when a node needs data to be processed, the data is transferred from an isolated distributed ﬁle system to the node [22]. The iterative nature of factorization requires data and factors to be distributed over and over again, incurring huge communication overhead. If tensor size is doubled, the algorithm’s performance is 2*T times worse (T is the number of iterations). This leads to the MapReduce method’s disadvantage due to its low scalability. In this chapter, we describe an even more scalable multimodal factorization (SMF) model to improve the performance of MapReduce-based factorization algorithms as the observed data becomes bigger. The aforementioned deﬁciencies of MapReduce-based algorithms can be overcome by minimizing the data transmission between computing nodes and choosing a fast converge optimization. The chapter begins with a background of tensor factorization (TF) in Sect. 10.2 where fundamental deﬁnitions and notations are introduced. Next, we review the existing work in multimodal factorization for Big data in Sect. 10.3. We then describe SMF in Sect. 10.4 in two parts: the ﬁrst explains the observations behind processing by blocks and caching data on computing nodes as well as provides a theoretical analysis of the optimization process and the second part shows how SMF can be scaled up to an unlimited input of multimodal data. The advantages of this method in terms of minimal communication cost and scaling up capability are essential features of any method dealing with multimodal Big data. We also demonstrate how it works in Sect. 10.5 by performing several tests with real-world multimodal data to evaluate its scalability, its convergence speed, its accuracy and performance using different optimization methods. Although research into scalable learning from multimodal Big data has developed quickly, the ideas explained here have consistently been used as the building blocks to achieve the scalability of developed algorithms.

10

Scalable Multimodal Factorization for Learning from Big Data

10.2

247

Preliminary Concepts

This section provides a brief introduction of the core deﬁnitions and the preliminary concepts relating to tensors, tensor factorization, and coupled matrix tensor factorization.

10.2.1 Tensors and Our Notations Multimodal data can be naturally represented as tensors which are multidimensional arrays [7, 25]. They are often speciﬁed by their number of modes (a.k.a. orders or ways). Speciﬁcally, a mode-1 tensor is a vector; a matrix is a mode-2 tensor. A mode-3 or higher order tensor is often called tensor for short. We denote tensors by boldface script letters, e.g., X. A boldface script with indices in its subscript is used for an entry of a tensor. For example, Xi1 , i2 , ..., iN is the (i1, i2, . . ., iN)th entry of X. Table 10.1 lists all the symbols used in this chapter.

Table 10.1 Symbols and their description Symbol x, x, X, X kXk XT X‐1 L N M R K T I1 I2 IN |Ω|, Xi1 , i2 , ..., iN XðnÞ ðnÞ

Description A scalar, a vector, a matrix, and a tensor Frobenius norm of X Tranpose of X Inverse or pseudo inverse of X Loss function Mode of a tensor Number of machine Rank of decomposition Number of tensor Number of iteration Dimension of N-mode tensor X Observed data size of X and its entries Mode nth of X ðnÞ

Xin

Slice in of XðnÞ —all entries X∗, ..., ∗, in , ∗, ..., ∗

U(n) uin

ðnÞ

nth mode factor of X inth row of factor U(n)

V(2)

2nd mode factor of Y

ð2Þ vj2

j2th row of factor V(2)—all entries V∗, j2

U1, U2, . . ., UK

Factors of tensors X1 , X2 , . . ., XK

ð2Þ

248

Q. Do and W. Liu

10.2.2 Matrix Factorization Matrix factorization (MF) decomposes a full matrix into much lower dimensional factor matrices. In case the given matrix is incomplete, computing its exact decomposition is an intractable task. Thus, a more efﬁcient approach is to approximate an incomplete matrix Y 2 ℜI 1 I 2 as a matrix multiplication of Vð1Þ 2 ℜI 1 R and Vð2Þ 2 ℜRI 2 , where R is the rank of the factorization. The optimal solution can be found by formulating it into a least squares error (LSE) optimization problem. Let Yi1 , i2 be known entries, LSE model minimizes 2 I 1 , I 2 L ¼ Vð1Þ Vð2ÞT Y ¼ ∑ i1 , i2

R

ð1Þ

2

ð2Þ

∑ Vi1 , r Vi2 , r Yi1 , i2

r¼1

10.2.3 Tensor Factorization Tensor factorization (TF) is an extension of MF into multidimensional data [4]. Under CANDECOMP/PARAFAC (CP) decomposition [12], TF expresses an N-order tensor into a sum of a ﬁnite number of low-rank factors, being formulated as: D 2 E L ¼ Uð1Þ ; Uð2Þ ; . . . ; UðN Þ X where X 2 ℜI 1 I 2 ...I N is a N-mode tensor, its N rank-R factors are UðnÞ 2 ℜI n R , 8n 2 [1, N], and D

Uð1Þ ; Uð2Þ ; . . . ; UðN Þ

E

R

i1 , i2 , ..., iN

¼ ∑

N Y

r¼1 n¼1

ðnÞ

U in , r

10.2.4 Coupled Matrix Tensor Factorization In addition to a main tensor, there is often other information. This additional data is usually transformed into a form of a matrix which has one mode in common with the main tensor. Acar et al. [3] proposed a joint analysis of both datasets to improve the accuracy of this coupled matrix tensor decomposition. The author performed this joint analysis with a coupled loss function: D 2 D 2 E E L ¼ Uð1Þ ; Uð2Þ ; . . . ; UðN Þ X þ Uð1Þ ; Vð2Þ Y

10

Scalable Multimodal Factorization for Learning from Big Data

249

where X and Y are coupled in their ﬁrst mode, U(1), U(2), . . ., U(N ) are N factors of X and U(1) and V(2) are two factors of Y. Note that CMTF assumes that U(1) is the common factor of both X and Y.

10.3

Literature Review

The effectiveness of Matrix Factorization (MF) was demonstrated notably in the Netﬂix Prize competition [16]. It decomposed an incomplete movie rating dataset in the form of a big N-user-by-M-movie matrix into two lower rank matrices, called factors. They were then used to reconstruct the full matrix to predict missing ratings with high accuracy. The success of MF motivated the research community to extend it to tensor factorization (TF) [15] to deal with multi-mode, high-dimensional, and big multimodal data. With the evolution of MF, researchers have attempted to solve three key problems: (1) how to integrate coupled information into the factorization processes, (2) how to optimize factors, and (3) given these huge datasets, how can this computation be done in a reasonable time. The following sections discuss these three issues.

10.3.1 Joint Analysis of Coupled Data As new data acquisition methods are developed or new sensors are introduced over time, new modes of data can be collected. To deal with this multimodal analysis of joint datasets from heterogeneous sources, researchers once again extended TF to decompose two tensors or a tensor and a matrix with one correlated dimension. For example, the MovieLens dataset [11] includes ratings from users on movies over a period of time. This information can be represented in a form of a three-dimensional tensor X of users by movies by weekdays whose entries are ratings. Furthermore, MovieLens also captures the users’ identity. This additional information forms a matrix Y of users by user proﬁles. It is more interesting that the ﬁrst dimension of X is correlated with the ﬁrst dimension of Y. Figure 10.1 visualizes this relationship. On this occasion,X is said to be coupled with Y in its ﬁrst mode. The joint analysis of this multimodal data helps to deepen our understanding of the underlying patterns and to improve the accuracy of tensor composition. Developing this idea, researchers have made several achievements. Early work by [24] introduced collective matrix factorization (CMF) to take an advantage of the correlations between different coupled matrices and simultaneously factorized them. CMF techniques have been successfully applied to capture the underlying complex structure of data [10]. Acar et al. [3] later expanded CMF to coupled matrix tensor factorization (CMTF) to factorize coupled heterogeneous datasets by modeling them as a higher-order tensor and a matrix in a coupled loss function. Researchers also proved the possibility of using these low-rank factors to recover missing entries. Papalexakis [20] illustrated the possibility of utilizing CMTF to predict brain activity from decomposed latent variables.

250

(a)

Q. Do and W. Liu

(b)

Fig. 10.1 Joint factorization of correlated multimodal data. (a) Correlation among different aspects of a dataset: X is a tensor of ratings made by users for movies on weekdays. Matrix Y represents user information. Movie rating tensor X is, therefore, coupled with user information matrix Y in ‘user’ ð1Þ ð2Þ ð3Þ mode. (b) X is factorized as a sum of dot products of low-rank vectors ui , ui and ui ; matrix Y is ð1Þ ð2Þ ð1Þ decomposed as a sum of dot products of ui and vi (vector ui is common between them)

10.3.2 Factorization Methodologies A large number of different methodologies have been proposed to optimize TF and CMTF. The most popular is alternating least squares (ALS). In a nutshell, ALS optimizes the least square of one factor while ﬁxing the other ones. ALS does this iteratively, alternating between the ﬁxed factors and the one it optimizes until the algorithm converges. Shin et al. [23] used ALS but updated a subset of a factor’s columns at a time. ALS-based algorithms are computationally efﬁcient but may converge slowly with sparse data [3]. Gradient-based optimization (GD), such as stochastic gradient descent, is an alternative for ALS. Starting with the initial factors’ values, GD reﬁnes them by iterating the stochastic difference equation. Distributed GD for MF and TF was proposed to scale it up to a distributed environment [5, 10]. On one hand, GD is simple to implement. On the other hand, choosing good optimization parameters such as learning rate is not straightforward [18]. A learning rate is usually decided based on experiments. Another approach is to conduct a backtracking search for the maximum decrease from the data. CMTF-OPT [3] used nonlinear conjugate gradient (NCG) with a line search to ﬁnd an optimal decrease direction. However, a backtracking search is computationally heavy and is therefore not suitable for very Big data.

10.3.3 Distributed Factorization ALS and GD have proved their effectiveness in optimizing TF and CMTF, and they are also excellent for small data. As applications of TF and CMTF usually deal with many gigabytes of data, researchers have focused on developing distributed algorithms. The nature of TF and CMTF requires the same computation to be done on different sets of data. Consequently, several levels of data parallelism have been proposed. Data parallelism in a multiprocessor system divides large tasks into many identical small subtasks; each subtask is performed on one processor. Turbo-SMT [20] followed this direction by sampling sparse coupled tensors into several tiny coupled tensors,

10

Scalable Multimodal Factorization for Learning from Big Data

251

concurrently decomposing them into factors using the Matlab Parallel Computing ToolBox and then merging resulting factors. Another approach is GPUTensor [27] which utilized multi-processors in GPU for factor computation. Even though these methods improved the factorization speed signiﬁcantly, they only performed tasks on a single machine (although it had multi-processors or multi cores of powerful GPUs). Thus, if the data is too big to be loaded into the local machine, it would run out of memory. Distributed data parallelism scales better with the data size and makes use of distributed processors to perform the calculations. Additionally, the Big data inputs are often stored in a distributed ﬁle system which can theoretically store large ﬁles of any size. ScouT proposed by Soo et al. [13], FlexiFact by Beutel et al. [5], and GigaTensor by Kang et al. [14] deﬁned factorization as the MapReduce processes. If calculations are to be done on a distributed processor, a corresponding part of the data from a distributed ﬁle system needs to be transmitted to the processor. This process repeats along the algorithm’s iteration, incurring a heavy overhead. Shin et al. [23] introduced SALS to overcome this weakness of the MapReduce framework by caching data in computing nodes’ local disks. SALS reduced the communication overhead signiﬁcantly. Yet this communication can be reduced even more. As data is stored on disks, reading it to memory for each access takes time, especially for huge datasets and many iterations. All the algorithms with different levels of data parallelism are put in an x–y coordinate, as shown in Fig. 10.2. Naturally, there is no algorithm in quadrant IV of the x–y coordinate as data computed on a single machine is normally located locally. Algorithms in quadrant III perform calculations within a local system with local data. As all the data is located in local memory, these algorithms will struggle as the data size increases. Those in quadrant II are distributed algorithms in which data is centralized in a distributed ﬁle server. These scale quite well as the data increases. Nevertheless,

Distributed systems SALS [23] FlexiFact [5]

ScouT [13]

GigaTensor [14] Centralized Data

II Turbo-SMT [20]

. CMTF OPT [3]

I

III IV

Distributed Data

GPUTensor [27]

Local systems

Fig. 10.2 Distributed Factorization algorithms on a data-computation coordinate. The x-axis represents the level of data distribution from the data located in a centralized memory or ﬁle server on the left to the data distributed to the computing nodes’ memory on the right. The y-axis captures the level of distributed computation from algorithms processed on a local machine to those processed in a distributed cluster (bottom to top)

252

Q. Do and W. Liu

centralized data may be a problematic issue. As data is stored in an isolated server, it is transmitted to computing nodes per each calculation. Therefore, this communication overhead is one of its biggest disadvantages. SALS [23] in quadrant I overcame the heavy communication overhead by caching data on local disks.

10.4

SMF: Scalable Multimodal Factorization

In this section, we introduce SMF for the joint analysis of several N-mode tensors with one or more modes in common. Let X 2 ℜI 1 I 2 I N be a mode-N tensor. X has at most N coupled matrices or tensors. Without loss of generality, we ﬁrst explain a case where X and another matrix Y 2 ℜI 1 J 2 are coupled in their ﬁrst modes. The joint analysis of more than two tensors is discussed in Sect. 10.4. Based on the coupled matrix tensor factorization of X and Y whose ﬁrst modes are correlated as in Sect. 10.2.4, SMF decomposes X into Uð1Þ 2 ℜI 1 R , Uð2Þ 2 ℜI 2 R , . . ., UðN Þ 2 ℜI N R and Y into Uð1Þ 2 ℜI 1 R and Vð2Þ 2 ℜJ 2 R , where U(1) is the common factor and R is the decomposition rank. D 2 D 2 E E L ¼ Uð1Þ ; Uð2Þ ; . . . ; UðN Þ X þ Uð1Þ ; Vð2Þ Y

ð10:1Þ

Observation 1 Approximating each row of one factor, while ﬁxing the other factors reduces the complexity of CMTF. Let U(k) ¼ hU(1), . . ., U(k 1), U(k + 1), . . ., U(N )i. Based on Observation 1, instead of ﬁnding U(1),U(2),. . .,U(N ) and V(2) that minimize the loss function (Eq. 10.1), our ð1Þ problem can be formulated as optimizing every single row ui1 of the coupled factor L¼

2 R ð1Þ ð1Þ ð1Þ ∑ ui1 , r Ui2 , ..., iN , r Xi1 , i2 , ..., iN ∑ i2 , ..., iN r¼1 2 R ð1Þ ð2Þ ð1Þ þ ∑ ∑ ui1 , r Vj2 , r Yi1 , j2 j2

ð10:2Þ

r¼1

ðnÞ

minimizing each uin of a non-coupled factor U(n) (n > 1) L¼

∑ i1 , ..., iN

R

ðnÞ

ðnÞ

ðnÞ

∑ uin , r Ui1 , ..., iN , r Xi1 , i2 , ..., iN

2 ð10:3Þ

r¼1

where u(n) is the variable we want to ﬁnd while ﬁxing the other terms and Xi1 , i2 , ..., iN are the observed entries of X. ð2Þ

and minimizing vj2 of a non-coupled factor V(2)

10

Scalable Multimodal Factorization for Learning from Big Data

L¼∑ i1

R

ð1Þ

ð2Þ

ð2Þ

253

2

∑ Ui1 , r vj2 , r Yi1 , j2

ð10:4Þ

r¼1

ð2Þ

where vj2 is the variable we want to ﬁnd while ﬁxing the other terms and Yi1 , j2 are the observed entries of Y. ð1Þ According to Eq. (10.2), Do et al. [9] show that computing a row ui1 of the coupled factor while ﬁxing the other factors requires observed entries of X and those ð1Þ ð1Þ of Y that are located in a slice Xi1 , i2 , ..., iN and Yi1 , j2 , respectively. Figure 10.3 illustrates these tensor slices for calculating each row of any factor. Similarly, ðnÞ ðnÞ Eq. (10.3) suggests a slice Xi1 , i2 , ..., iN for updating a corresponding row uin and ð2Þ

ð2Þ

Eq. (10.4) suggests Yi1 , j2 for updating a corresponding row vj2 . ðnÞ

ðnÞ

ðnÞ

Deﬁnition 1 Two slices Xi and Xi0 are independent if and only if 8x 2 Xi , 8x0 2 ð nÞ

Xi0 and i 6¼ i0 then x 6¼ x0 .

Fig. 10.3 Tensor slices for updating each row of U(1), U(2), U(3), and V(2) when the input tensors are a mode-3 tensor X coupled with a matrix Y in their ﬁrst mode. (a) Coupled slices in the ﬁrst mode of both Xð1Þ and Y(1) required for updating a row of U(1). (b) A slice in the second mode of Xð2Þ for updating a row of U(2). (c) A slice in the third mode of Xð3Þ for updating a row of U(3). (d) A slice in the second mode of Y(2) for updating a row of V(2)

254

Q. Do and W. Liu

Observation 2 Row updates for each factor as in Eqs. (10.2), (10.3) and (10.4) require independent tensor slices; each of these non-overlapping parts can be processed in parallel. ð1Þ ð1Þ ð1Þ ð1Þ ð1Þ Figure 10.3a shows that Xi , Yi for updating Ui and Xi0 , Yi0 for updating ð1Þ

Ui0 are non-overlapping 8i, i0 2 [1, I1] and i 6¼ i0 . Consequently, all rows of U(1) are independent and can be executed concurrently. The same parallel updates are for all rows of U(2), . . ., U(N ), and V(2).

10.4.1 SMF on Apache Spark This section discusses distributed SMF for large-scale datasets. Observation 3 The most critical performance bottleneck of any distributed CMTF algorithm is transferring a large-scale dataset to computing nodes at each iteration. As with Observation 2, optimizing rows of factors can be done in parallel with distributed nodes; each one needs a tensor slice and other ﬁxed factors. Existing distributed algorithms, such as FlexiFact [5], GigaTensor [14], and SCouT [13], store input tensors in a distributed ﬁle system. Computing any factor requires the corresponding data to be transferred to processing nodes. Because of the iterative nature of the CP model, this large-scale tensor distribution repeats, causing a heavy communication overhead. SMF eliminates this huge data transmission cost by robustly caching the required data in memory of the processing nodes. SMF partitions input tensors and localizes them in the computing nodes’ memory. It is based on Apache Spark because Spark natively supports local data caching with its resilient distributed datasets (RDD) [26]. In a nutshell, an RDD is a collection of data partitioned across computational nodes. Any transformation or operation (map, reduce, foreach, . . .) on an RDD is done in parallel. As the data partition is located in the processing nodes’ memory, revisiting this data many times over the algorithm’s iterations does not incur any communication overhead. SMF designs RDD variables and chooses the optimization method carefully to maximize RDD’s capability.

10.4.2 Block Processing SMF processes blocks of slices to enhance efﬁciency. As observed in Fig. 10.3, a ð 1Þ ð1Þ ð1Þ ð2Þ ð3Þ coupled slice (Xi1 , Yi1 ) is required for updating a row of Ui1 . Slices Xi2 , Xi3 , and ð2Þ ð2Þ ð3Þ ð2Þ Yj2 are for updating Ui2 , Ui3 and Vj2 , respectively. On one hand, it is possible to ð1Þ

ð1Þ

ð2Þ

ð3Þ

work on every single slice of I1 slices Xi1 and Yi1 , I2 slices Xi2 , I3 slices Xi3 and J2 ð2Þ slices Yj2 separately. On the other hand, dividing data into too many small parts is not a wise choice as the time needed for job scheduling may exceed the computational time. Thus, merging several slices into non-overlapping blocks and working

10

Scalable Multimodal Factorization for Learning from Big Data

255

on them in parallel increases efﬁciency. An example of this grouping is presented in Fig. 10.4 as illustrated in [9].

10.4.3 N Copies of an N-mode Tensor Caching SMF caches N copies of X and N copies of Y in memory. As observed in Fig. 10.4, blocks of CB(1) are used for updating the ﬁrst mode factor U(1); blocks of B1(2) and B2(2) are used for the second mode factors U(2)and V(2), respectively; blocks of B1 (3) are used for the third mode U(3). Thus, to totally eliminate data transmission, all of these blocks should be cached. This duplicated caching needs more memory, yet it does not require a huge memory extension as we can add more processing nodes.

A pseudo code for creating these copies as RDD variables in Apache Spark is in Function createRDD(). Lines 1 and 2 create two RDDs of strings (i1, . . ., iN, value) for X and Y. The entries of RDDs are automatically partitioned across processing nodes. These strings are converted into N key-value pairs of , one for each mode, in lines 4 and 9. Lines 5 and 10 group the results into slices, as illustrated in Fig. 10.3. These slices are then merged into blocks (lines 6 and 11) to be cached in working nodes (lines 7 and 12). Coupled blocks are created by joining

256

Q. Do and W. Liu

Fig. 10.4 Dividing coupled matrix and tensor into non-overlapping blocks. (a) Coupled blocks CB (1) (B1(1),B2(1)) in the ﬁrst mode of both X and Y for updating U(1). (b) Blocks in the second mode of X for updating U(2) and those in the second mode of Y for V(2). (c) Blocks in the third mode for U(3). All blocks are independent and can be performed concurrently

corresponding blocks of Xð1Þ and Y(1) in line 13. It is worth noting that these transformations (line 4 to 7 and line 9 to 12) are performed concurrently on parts of RDDs located in each working node.

10.4.4 Optimized Solver SMF uses a closed form solution for solving each row of the factors. This optimizer not only converges faster but also helps to achieve higher accuracy. ð1Þ

Theorem 1 The optimal row ui1 of a coupled factor U(1) is computed by 1 T ð1Þ ui1 ¼ AT A þ CT C A bi1 þ CT di1

ð10:5Þ

ð1Þ

where bi1 is a column vector of all observed Xi1 ; di1 is a column vector of all ð 1Þ observed Yi1 ; A and C are all U(1) and Vj2 with respect to all observed bi1 and di1 , respectively. Proof Equation (10.2) can be written for each row of U(1) as 2 2 ð 1Þ ð1Þ L ¼ ∑ Aui1 bi1 þ ∑ Cui1 di1 i1

ð 1Þ

i1

Let x be the optimal ui1 , then x can be derived by setting the derivative of L with respect to x to zero. Hence,

10

Scalable Multimodal Factorization for Learning from Big Data

257

L ¼ ðAx bi1 ÞT ðAx bi1 Þ þ ðCx di1 ÞT ðCx di1 Þ ¼ xT AT Ax 2bi1 Ax þ bi1 T bi1 þ xT CT Cx 2di1 Cx þ di1 T di1 ∂L ¼ 2AT Ax 2AT bi1 þ 2CT Cx 2CT di1 ¼ 0 , ∂xT , A A þ CT C x ¼ AT bi1 þ CT di1 1 T , x ¼ AT A þ CT C A bi1 þ CT di1 ðnÞ

Theorem 2 The optimal row uin of a coupled factor U(n) is computed by 1 ðnÞ uin ¼ AT A AT bin

ð10:6Þ

ðnÞ

where bin is a column vector of all observed Xin ; A is all U(n) with respect to all observed bin . ðnÞ

Proof Similar to the proof of Theorem 1, uin derivative with respect to it goes to 0. ∂L

minimizes Eq. (10.3) when the

ðnÞ

¼ 2AT Auin 2AT bin ¼ 0

ðnÞ ∂uin ð nÞ , uin

1 ¼ AT A AT bin

Performing pseudo inversion in Eqs. (10.5) and (10.6) is expensive. Nevertheless, as ATA + CTC and ATA are small squared matrices of ℜR R, a more efﬁcient operation is to use Cholesky decomposition for computing pseudo inversion and ðnÞ solving uin as in Algorithm 1. At each iteration, SMF ﬁrst broadcasts all the newly updated factors to all processing nodes. Then each factor is computed. While SMF updates each factor of each tensor sequentially, each row of a factor is computed in parallel by either updateFactor() (lines 10 and 12) or updateCoupledFactor() (line 7). These two functions are processed concurrently by different computing nodes with their cached data blocks. These steps are iterated to update the factors until the algorithm converges.

258

Q. Do and W. Liu

Algorithm 1: SMF with Data Parallelism Input: Output: 1 2 3 4 5 6 7 8 9 10 11 12 13 14

cached CB(1), B1(2), …, B1(N), B2(2), ε U(1), …, U(N), V(2)

Initialize L by a small number Randomly initialize all factors repeat Broadcast all factors PreL = L // coupled Factor U(1) updateCoupledFactor(CB(1), 1) // non-coupled Factor U foreach mode n ∈[2, N] do U(n) updateFactor(B1(n), n) // non-coupled Factor V V(2) updateFactor(B2(2), 2) Compute L following Eq. (10.1) until

Theorem 3 The computational complexity of Algorithm 1 is ! Nk K X X I kn 3 j Ωk j ðN k þ RÞR þ R Ο T M M k¼1 n¼1 Proof An N-mode tensor requires ﬁnding N factors. A factor is updated in either lines 1–4 of the function updateFactor() or lines 1–6 of updateCoupledFactor(). Lines 1–4 prepare A, compute ATA, ATbi and perform Cholesky decompositions. Lines 1–6 double A preparation and ATA, ATbi computations. Computing Ωj Ωj ðN 1ÞR operations while ATA and ATbi take jM ðR 1ÞR each. A requires jM I 3 In RR matrices is Ο Mn R . Updating a factor requires Cholesky decomposition of M ℜ Ωj In 3 ðN þ RÞR þ M R Ο jM ; all factors of K tensors take ! Nk K X X Ik j Ωk j ðN k þ RÞR þ n R3 Ο . These steps may iterate T times. ThereM M k¼1 n¼1 fore, the computational complexity of Algorithm 1 is ! Nk K X X I kn 3 j Ωk j ðN k þ RÞR þ R : Ο T M M k¼1 n¼1

Function updateFactor(B, n) Output: U 1 2 3

B.map( hi, (i1, . . ., iN, bi)i A U(-n)

Bi

10

Scalable Multimodal Factorization for Learning from Big Data

4 5 6

Compute ui(n) by (6) ) Collect result and merge to U(n)

259

Function updateCoupledFactor(CB, n) Output: U 1 2 3 4 5 6 7 8

CB.map( B1i hi, (i1, . . ., iN, bi)i A U(-n) hi, (i1, j2, di)i B2i C V(2) Compute ui(n) by (5) ) Collect result and merge to U(n)

Theorem 4 The communication complexity of Algorithm 1 is Ο T

Nk K X X

I kn R

3

! :

k¼1 n¼1

Proof At each iteration, Algorithm 1 broadcasts Ο

K X

! Nk

factors to M machines

k¼1

(line 4). As the broadcast in Apache Spark is done using ! the BitTorrent technique, Nk K X X I kn R3 . The total T times broadeach broadcast to M machines takes Ο ! k¼1 n¼1 Nk K X X I k n R3 . cast requires Ο T k¼1 n¼1

Theorem 5 The space complexity of Algorithm 1 is 0P K

1

jΩk jN k

Bk¼1 ΟB @ M

þ

Nk K X X k¼1 n¼1

C ðI kn RÞC A:

Proof Each computing node stores blocks of tensor data and all the factors. Firstly, Nk copies of jΩMk j observations of the kth tensor need to be stored on each node, 0P 1 K jΩk jN k Bk¼1 C k jN k C. Secondly, storing all requiring Ο jΩM . So, K tensors take ΟB @ A M

260

Q. Do and W. Liu

factors in each node requires Ο 0P K Bk¼1 is ΟB @

Nk K X X

! ðI kn RÞ . Therefore, the space complexity

k¼1 n¼1

1

jΩk jN k M

þ

Nk K X X k¼1 n¼1

C ðI kn RÞC A.

10.4.5 Scaling Up to K Tensors The implementation of Algorithm 2 supports K N-mode tensors. In this case, K tensors have (K 1) coupled blocks CB(1), . . ., CB(K 1). The algorithm checks which mode of the main tensor is the coupled mode and applies the updateCoupledFactor() function with the corresponding coupled blocks. Algorithm 2: SMF for K tensors where the ﬁrst mode of X 1 is coupled with the ﬁrst mode of X2, . . ., (K 1)th mode of X 1 is joint with the ﬁrst mode of XK. Input: cached CB(1), ...,CB(K-1), B1(2), ..., B1(N1), ..., BK(2), ..., BK(NK), ε Output: U1(1),…,U1(N1), ..., UK(2),…,UK(NK) 1 Initialize L by a small number 2 Randomly initialize all factors 3 repeat 4 Broadcast all factors 5 PreL = L 6 foreach tensor k ∈ [1,K] do 7 if (k is the main tensor) then 8 foreach mode n 2 [1;Nk] do 9 if (n is a coupled mode) then 10 Uk(n) ← updateCoupledFactor(CB(n), n) 11 else 12 Uk(n) ← updateFactor(Bk(n), n) 13 else 14 foreach mode n ∈ [2,Nk] do 15 Uk(n) ← updateFactor(Bk(n), n) 16 Compute L following Eq. (10.1) 17 until ⎛⎜ Pr eL − L < ε ⎞⎟ ⎝ Pr eL

⎠

10

Scalable Multimodal Factorization for Learning from Big Data

10.5

261

Performance Evaluation

SMF was implemented in Scala and tested on Apache Spark 1.6.01 with the Yarn scheduler2 from Hadoop 2.7.1. In our experiments, we compare the performance of SMF with existing distributed algorithms to assess the following questions. (1) How scalable is SMF with respect to the number of observations and the number of machines? (2) How fast does SMF converge? (3) What level of accuracy does SMF achieve? and (4) How does the closed-form solution perform compared to the widely chosen gradient-based methods? All the experiments were executed on a cluster of nine nodes, each having 2.8 GHz CPU with 8 cores and 32 GB RAM. Since SALS [23] and SCouT [13] were shown to be signiﬁcantly better than FlexiFact [5], we included comparisons with SALS and SCouT and discarded the FlexiFact. CTMF-OPT [3] was also run on one of our nodes. We used publicly available 22.5 M (i.e., 22.5 million observations) movie ratings with movie genre information in MovieLens [11], 100 M Netﬂix’s movie ratings [1], and 718 M song ratings coupled with song-artist-album information from Yahoo! Music dataset [2]. All ratings are from 0.2 to 1, equivalent to 1–5 stars. When evaluated as a missing value completion (rating recommendation) problem, about 80% of the observed data was for training and the rest was for testing. The details of our datasets are summarized in Table 10.2. To validate the scalability of SMF, we generated four synthetic datasets with different observation densities as summarized in Table 10.2. We measured the scalability of SMF with respect to the number of observations and machines.

10.5.1 Observation Scalability Figure 10.5 compares the observation scalability of SALS, CMTF-OPT, and SMF (in the case of TF of the main tensor X—Fig. 10.5a) and of SCouT, CMTF-OPT, and SMF (for CMTF of the tensor X and the additional matrix Y—Fig. 10.5b. As shown in Fig. 10.5a, the performance of SALS is similar to ours when the number of observation is from 1 M to 10 M. However, SALS performs worse as the observed data size becomes larger. Speciﬁcally, when observed data increases 10 (i.e., 10 times) from 100 M to 1 B, SALS’s running time per iteration slows down 10.36, 151% down rate of SMFs. As for CMTF, SMF signiﬁcantly outperforms SCouT being 73 faster. CMTF-OPT achieves similar performance for 1 M dataset, but it experiences “out-of-memory” when dealing with the other larger datasets.

1 2

Apache Spark http://spark.apache.org/ Yarn scheduler https://hadoop.apache.org/docs/r2.7.1/hadoop-yarn/hadoop-yarnsite/ FairScheduler.html

262

Q. Do and W. Liu

Table 10.2 Data for experiments Dataset MovieLens [11] Netﬂix [1] Yahoo! Music [2] Synthetic (1) Synthetic (2) Synthetic (3) Synthetic (4)

Tensor X Y X X Y X Y X Y X Y X Y

|Ω|train 18 M 649 K 80 M 136 K 700 M 1M 100 10 M 1K 100 M 10 K 1B 100 K

a

|Ω|test 4.5 M – 20 M – 18 M – – – – – – – –

I2 247,753 19 17,770 20,543 1,823,179 100 K 100 K 100 K 100 K 100 K 100 K 100 K 100 K

I3 7 – 2182 9442 – 100 K – 100 K – 100 K – 100 K –

b

4K

10K SMF SALS CMTF-OPT

4.17x

2K 0 1

10 100 1000 Observation (million)

Time/ iter (sec)

6K Time/ iter (sec)

I1 34,208 34,208 480,189 136,736 136,736 100 K 100 K 100 K 100 K 100 K 100 K 100 K 100 K

5K

SMF SCouT CMTF-OPT 73x

0 1

10 100 1000 Observation (million)

Fig. 10.5 Observation scalability. SMF scales up better as the number of known data increases. (a) Only X is factorized. SMF is 4.17, and 2.76 faster than SALS in the case of 1 B and 100 M observations, respectively. (b) Both X and Y are jointly analyzed. SMF consistently outperforms SCouT at a rate of over 70 in all test cases. In both cases, CMTF-OPT runs out of memory for more than 1 M datasets

10.5.2 Machine Scalability We measure the increase in the speed of each algorithm as more computational power is added to the cluster. The dataset of Synthetic (3) with 100 M is used in this test (Fig. 10.6). We calculate the speedup rate by normalizing the time each algorithm takes on three machines (T3) with time (TM) on M machines (in this test, M is 6 and 9). In general, SMF speeds up to a rate which is similar to SALS and to a much higher rate than SCouT.

Scalable Multimodal Factorization for Learning from Big Data

Speedup (T3/TM)

a

b

2.5

Speedup (T3/TM)

10

SMF SALS

2 1.5 1

263

2 SMF SCouT 1.5

1 3

6

9

3

6

9

Machine

Machine

Fig. 10.6 Machine scalability with 100 M synthetic dataset. In (a) only X is factorized. In (b), both X and Y are jointly analyzed. SMF speeds up to a rate which is similar to SALS and to a much higher rate than SCouT

b

750 500

SMF SALS

250 0

0.22 SMF SALS

0.2 RMSE

Time (sec)

a

0.18 0.16

10

20

30

40

Iteration

50

0.14

10

20

30

40

50

Iteration

Fig. 10.7 Factorization speed (a) and training RMSE per iteration (b) of the tensor factorization of X in MovieLens

10.5.3 Convergence Speed This section investigates how fast SMF converges in a benchmark with both SALS and SCouT on the three real-world datasets. As observed in Figs. 10.7, 10.8, and 10.9, when the tensor size becomes bigger from MovieLens (Fig. 10.7) to Netﬂix (Fig. 10.8) and to Yahoo! Music (Fig. 10.9), the advantages of SMF over SALS increase. SMF eliminates all data streaming from local disk to memory, especially for large-scale data, improving its efﬁciency signiﬁcantly. Speciﬁcally, SMF outperforms SALS 3.8 in the case of 700 M observations (Yahoo! Music) and 2 in 80 M observations (Netﬂix). This result, in combination with the fact that it is 4.17 faster than SALS in 1B synthetic dataset, strongly suggests that SMF is the fastest tensor factorization algorithm for large-scale datasets. While Figs. 10.7, 10.8, and 10.9 show single tensor factorization results, Figs. 10.10, and 10.11 provide more empirical evidence to show that SMF is able to perform lightning-fast coupled matrix tensor factorization for the joint analysis of

264

Q. Do and W. Liu

a

b 0.22

1.5K

SMF SALS RMSE

Time (sec)

2.0K

1K

0.2 0.18

0.5K 0

SMF SALS

5

10

15

0.16

20

5

10

15

20

Iteration

Iteration

Fig. 10.8 Factorization speed (a) and training RMSE per iteration (b) of the tensor factorization of X in Netﬂix

b

30K 20K

SMF SALS

RMSE

Time (sec)

a

3.8x 10K 0

10

20 30 Iteration

40

1.5

0.5 0

50

SMF SALS

1

10

20 30 Iteration

40

50

Fig. 10.9 Factorization speed (a) and training RMSE per iteration (b) of the tensor factorization of X in Yahoo! Music

b

40K SMF SCouT

0.8 0.6

RMSE

Time (sec)

a

20K

SMF SCouT

0.4 0.2

0

10

20 30 Iteration

40

50

0

10

20 30 Iteration

40

50

Fig. 10.10 Factorization speed (a) and training RMSE per iteration (b) of the coupled matrix tensor factorization of X and Y in MovieLens

heterogeneous datasets. In this case, only MovieLens and Yahoo! Music are used as the Netﬂix dataset does not have side information. SMF surpasses SCouT, the current fastest CMTF algorithm, by 17.8 on the Yahoo! Music dataset.

Scalable Multimodal Factorization for Learning from Big Data

Time (sec)

a

b

60K 40K

SMF SCouT 17.8x

20K 0

5

10 15 Iteration

20

RMSE

10

265

0.8 0.6

SMF SCouT

0.4 0.2

5

10 15 Iteration

20

Fig. 10.11 Factorization speed (a) and training RMSE per iteration (b) of the coupled matrix tensor factorization of X and Y in Yahoo! Music Table 10.3 Accuracy of each algorithm on the real-world datasets Algorithm SALS SCouT SMF

TF MovieLens 0.1695 – 0.1685

Netﬂix 0.1751 – 0.1749

Yahoo 0.2396 – 0.2352

CMTF MovieLens – 0.7110 0.1676

Yahoo – 0.7365 0.2349

Decomposed factors are used to predict missing entries. We measure the accuracy with a RMSE on the test sets

10.5.4 Accuracy In addition to having the fastest convergence speed, SMF also recovers missing entries with the highest accuracy. Table 10.3 lists all the prediction results on the test sets. Note that SALS does not support CMTF and SCouT does not support TF. In this test, SALS is almost as good as SMF for missing entry recovery, while SCouT performs much worse. This also shows the power of using coupled information in the factorization processes.

10.5.5 Optimization We benchmark different optimizers in this section. Instead of computing each row of any factor as line 4 of updateFactor() or line 6 of updateCoupledFactor(), we use two gradient-based optimizers: nonlinear conjugate gradient (called SMF-NCG) with the More-Thuente line search [19] and gradient descent (called SMF-GD) to update it. All the optimizers stop when ε < 104. These results are compared with the closed form solution (hereafter called SMFCF) as displayed in Fig. 10.12. The results demonstrate that the closed form solution converges to the lowest RMSE faster than conventional gradient-based

266

0.2

0.15

0

500 1000 1500 Time (sec)

c

0.22 0.2 0.18 0.16

SMF-CF SMF-NCG SMF-GD

RMSE

SMF-CF SMF-NCG SMF-GD

RMSE

b

0.25

RMSE

a

Q. Do and W. Liu

600 1200 1800 Time (sec)

0.26 0.25

SMF-CF SMF-NCG SMF-GD

0.24 0.23

1500 3000 4500 Time (sec)

Fig. 10.12 Benchmark of different optimization methods for (a) MovieLens, (b) Netﬂix and (c) Yahoo! Music datasets. In all cases, SMF-CF quickly goes to the lowest RMSE Table 10.4 Accuracy of predicting missing entries on real-world datasets with different optimizers. SMF’s optimized solver achieves the lowest tested RMSE under the same stopping condition Optimizer SMF-CF SMF-NCG SMF-GD

MovieLens 0.1672 0.1783 0.1716

Netﬂix 0.1749 0.1904 0.1765

Yahoo 0.2352 0.2368 0.2387

optimizers in general. The results on the test sets shown in Table 10.4 also conﬁrm that SMF-CF has the highest precision on the test sets.

10.6

Conclusion

In this chapter, we introduced a solution and demonstrated its high impact on the multimodal factorization problem. We describe SMF for the large-scale factorization of correlated multimodal data. SMF’s data parallelism model eliminates the huge data transmission overhead. Its advanced design and the optimized solver stabilize the algorithm’s performance as the data size increases. Our validations show that SMF scales the best, compared to the other existing methods, with respect to the number of tensors, observations, and machines. The experiments also demonstrate SMF’s effectiveness in terms of convergence speed and accuracy on real-world datasets. All these advantages suggest SMF’s design principles should be potential building blocks for large-scale multimodal factorization methods.

References 1. Netﬂix’s movie ratings dataset. http://www.netﬂixprize.com/ 2. Yahoo! research webscope’s music user ratings of musical artists datasets. http://research. yahoo.com/

10

Scalable Multimodal Factorization for Learning from Big Data

267

3. Acar, E., Kolda, T.G., Dunlavy, D.M.: All-at-once optimization for coupled matrix and tensor factorizations. arXiv:1105.3422 (2011) 4. Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15(1), 2773–2832 (2014) 5. Beutel, A., Kumar, A., Papalexakis, E.E., Talukdar, P.P., Faloutsos, C., Xing, E.P.: Flexifact: scalable ﬂexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining, pp. 109–117 (2014) 6. Bhargava, P., Phan, T., Zhou, J., Lee, P.: Who, what, when, and where: multi-dimensional collaborative recommendations using tensor factorization on sparse user-generated data. In: Proceedings of the 24th International Conference on World Wide Web (WWW ’15), pp. 130–140 (2015) 7. Choi, J., Kim, Y., Kim, H.-S., Choi, I.Y., Yu, H.: Tensor-Factorization-Based Phenotyping using Group Information: Case Study on the Efﬁcacy of Statins. In: Proceedings of the 8th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics (ACM-BCB ‘17), pp. 516–525 (2017) 8. Diao, Q., Qiu, M., Wu, C.-Y., Smola, A.J., Jiang, J., Wang, C.: Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS). In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14), pp. 193–202 (2014) 9. Do, Q., Liu, W.: ASTEN: an accurate and scalable approach to Coupled Tensor Factorization. In: Proceedings of the 2016 International Joint Conference on Neural Networks (IJCNN), pp. 99–106 (2016) 10. Gemulla, R., Nijkamp, E., Haas, P.J., Sismanis, Y.: Large-scale matrix factorization with distributed stochastic gradient descent. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge discovery and Data Mining (KDD ’11), pp. 69–77 (2011) 11. Maxwell Harper, F., Konstan, J.A.: The MovieLens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5(4), 1–19 (2015) 12. Harshman, R.A.: Foundations of the PARAFAC procedure: models and conditions for an ‘Explanatory’ multi-modal factor analysis. UCLA Work. Pap. Phon. 16, 1–84 (1970) 13. Jeon, B.S., Jeon, I., Lee, S., Kang, U.: SCouT: Scalable coupled matrix-tensor factorization – algorithm and discoveries. In Proceedings of the 2016 I.E. 32nd International Conference on Data Engineering (ICDE), pp. 811–822 (2016) 14. Kang, U., Papalexakis, E., Harpale, A., Faloutsos, C.: GigaTensor: scaling tensor analysis up by 100 times – algorithms and discoveries. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’12), pp. 316–324 (2012) 15. Kolda, T.G., Bader, B.W.: Tensor decompositions and applications. SIAM Rev. 51(3), 455–500 (2009) 16. Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer. 42(8), 30–37 (2009) 17. Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE. 103(9), 1449–1477 (2015) 18. Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., Ng, A.Y.: On optimization methods for deep learning. In: Proceedings of the 28th International Conference on International Conference on Machine Learning (ICML ’11), pp. 265–272 (2011) 19. Moré, J.J., Thuente, D.J.: Line search algorithms with guaranteed sufﬁcient decrease. ACM Trans. Math. Softw. 20(3), 286–307 (1994) 20. Papalexakis, E.E., Faloutsos, C., Mitchell, T.M., Talukdar, P.P., Sidiropoulos, N.D., Murphy, B.: Turbo-smt: accelerating coupled sparse matrix-tensor factorizations by 200x. In: SIAM International Conference on Data Mining (SDM), pp. 118–126 (2014) 21. Papalexakis, E.E., Faloutsos, C., Sidiropoulos, N.D.: Parcube: Sparse parallelizable tensor decompositions. In: Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference (ECML PKDD), pp. 521–536 (2012)

268

Q. Do and W. Liu

22. Shi, J., Qiu, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Özcan, F.: Clash of the titans: MapReduce vs. Spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015) 23. Shin, K., Kang, U.: Distributed methods for high-dimensional and large-scale tensor factorization. In: Proceedings of the 2014 I.E. International Conference on Data Mining, Shenzhen, 2014, pp. 989–994 (2014) 24. Singh, A.P., Gordon, G.J.: Relational learning via collective matrix factorization. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’08), pp. 650–658 (2008) 25. Wang, Y., Chen, R., Ghosh, J., Denny, J.C., Kho, A., Chen, Y., Malin, B.A., Sun, J.: Rubik: knowledge guided tensor factorization and completion for health data analytics. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’15), pp. 1265–1274 (2015) 26. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing (HotCloud ’10). USENIX Association, Berkeley, pp. 10–10 (2010) 27. Zou, B., Li, C., Tan, L., Chen, H.: GPUTENSOR: efﬁcient tensor factorization for contextaware recommendations. Inf. Sci. 299, 159–177 (2015)

Part V

Multimodal Big Data Processing and Applications

Chapter 11

Big Multimodal Visual Data Registration for Digital Media Production Hansung Kim and Adrian Hilton

Abstract Modern digital media production relies on various heterogeneous source of supporting data (snapshots, LiDAR, HDR and depth images) as well as videos from cameras. Recent developments of camera and sensing technology have led to huge amounts of digital media data. The management and process of this heterogeneous data consumes enormous resources. In this chapter, we present a multimodal visual data registration framework. A new feature description and matching method for multimodal data is introduced, considering local/semi-global geometry and colour information in the scene for more robust registration. Combined 2D/3D visualisation of this registered data allows an integrated overview of the entire dataset. The proposed framework is tested on multimodal dataset of ﬁlm and broadcast production which are made publicly available. The resulting automated registration of multimodal datasets supports more efﬁcient creative decision making in media production enabling data visualisation, search and veriﬁcation across a wide variety of assets.

11.1

Introduction

Visual sensing technology has been developed over recent decades and led to various 2D/3D multimedia data acquisition devices which are widely available in our daily lives. Recent digital media productions deal with Big data captured not only from video or photography but also from various digital sensors in order to recover full geometry of the scene or target objects. Digital ﬁlm production creates ﬁnal movie frames by combining data captured on the ﬁlm set with additional metadata created during post production. To be able to design these additional meta-data, capture of a large number of assets on the set is necessary. For example, replacing something in the frame with computer-graphics-rendered objects requires at least texture reference, information on the principal camera and lens pose, high dynamic H. Kim (*) · A. Hilton Centre for Vision Speech and Signal Processing, University of Surrey, Guildford, UK e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_11

271

272

H. Kim and A. Hilton

Table 11.1 Examples of data types generated in ﬁlm production Data Principal camera Witness cameras Motion capture Texture ref. Spherical HDR LIDAR scans

Device 4K/HD Camcoder HD Camcoder Xsens MOVEN2 DSLR camera Spheron Leica/FARO

Format DPX/RAW H.264/MP4 Joint Angle RAW/JPG EXR Point cloud

Dimension 2D+Time 2D+Time 3D 2D 2D (Spherical) 3D

Used in All Animation Animation/Rigging Modelling/Texturing Lighting/Modelling Modelling/FX

range (HDR) lighting information and Light Detection and Ranging (LIDAR) scans of the set. The appearance of a scene is captured using different video cameras, from mobile phones cameras to professional High-deﬁnition (HD)/4K/8K cameras. Kinect-like RGBD sensors or Time-of-ﬂight (ToF) cameras can capture video-rate depth information, while 3D laser scanners create a dense and accurate point cloud of the scene. Spherical imaging sensors capture full 360 texture and illumination data for backplates and relighting. There can be other data sources such as video capture using drones or large collections of photos taken with digital single-lens reﬂex (DSLR) cameras. The result is an ocean of unstructured media data which is hard to be searched, arranged and managed efﬁciently. In digital production, it is typical for a single ﬁlm to use >1PB of storage for media footage, and the requirements are increasing year-on-year. 350TB was allocated to the footage from various capture devices for the production of John Carter of Mars (2012), and Avengers: Age of Ultron (2015) is reported to have required >1PB of storage. The types of data that are typically captured using visual sensors for ﬁlm production, games, VR experience and TV production is shown in Table 11.1. While data storage is cheaper than ever, all of this data need to be sorted, indexed and processed, which is a largely manual task keeping many artists busy for weeks during production. Moreover, datasets exist in different domains with different types of format, characteristic and sources of error. Smart tools are needed to make this processing more efﬁcient and free-up human resources. This motivates research on seamless integration of sensor data into a uniﬁed coordinate system to improve the quality and efﬁciency of 3D data management. Starck et al. previously presented a multiple HD video camera system for studio production [1], which addressed the registration of multiple cameras to the world coordinate system through calibration for 3D video production of actor performance. This has been extended to outdoor capture by combining multiple HD cameras and a spherical camera [2]. Dynamic objects captured by HD video cameras and static background scenes scanned by a spherical camera were registered to the world coordinate system. In this chapter, we discuss further extension of the capture system to allow automatic registration of the wide variety of visual data capture devices typically used in production. An essential issue is to validate data collection at the point of capture by automatically registering multimodal visual data into a common coordinate system

11

Big Multimodal Visual Data Registration for Digital Media Production

273

to verify the completeness of the data. Historically, each type of data was usually processed on its own, e.g. LIDAR data was merged and converted into mesh representations, HDR photography was prepared for use in image-based lighting further down the pipeline, and so on. Registration of assets into a single reference frame only took place later, which could lead to problems if there were errors in the raw data such as gaps in scene coverage. The work presented in this chapter takes a different approach, in that it explicitly takes advantage of the multimodal nature of the captured data. For example, reference photographs are taken not in arbitrary locations but in the same space that is covered by LIDAR data. Being able to register the reference footage in the same coordinate system as the LIDAR data has many beneﬁts, e.g. using the image for projection mapping. It also allows detection of errors in the raw data early in the processing pipeline, when there may still be time for correction. The task of handling 3D data is not merely a case of extending the dimensionality of existing 2D image processing. Data matching and registration is more difﬁcult because 3D data can exist in different domains with different types of format, characteristics, density and sources of error. In this chapter, a uniﬁed 3D space (Fig. 11.1a) where 2D and 3D data are registered for efﬁcient data management is introduced. 2D data are registered via 3D reconstruction because direct registration of 2D to 3D structure [3, 4] is difﬁcult for general multimodal data registration. We assume that multiple 2D images are available for the same scene so that 3D structure can be recovered from the dataset. Once the reconstructed 3D structure is registered to a target 3D dataset, the original 2D data can be automatically registered because the capture positions of the 2D device are deﬁned as calibration parameters in the reconstruction as shown in Fig. 11.1b. This chapter is organized as follows. Section 11.2 discusses some related work on multimodal visual data registration and feature detector and descriptors. Section 11.3 discusses various types of input modalities (active and passive sensor). The system overview for multimodal data registration and visualisation is given in Sect. 11.4. Section 11.5 presents a detailed discussion for our proposed multimodal data registration approach. Section 11.6 discusses a public multimodal database used to validate our work. Experimental results are given in Sects. 11.7, and 11.8 gives concluding remarks.

11.2

Related Work

11.2.1 Multimodal Visual Data Registration In general visual media processing, there has been researches for 2D/3D data matching and registration via Structure-from-Motion (SfM) and feature matching for a single modality. Sattler et al. proposed a method to register a 2D query image to a reconstruction of large-scale scenes using SfM [5]. They implemented a two-way matching scheme for 2D-to-3D and 3D-to-2D using the SIFT descriptor and

274

H. Kim and A. Hilton

Fig. 11.1 Multimodal data registration. (a) Overview of multimodal data registration. (b) Example of multimodal data registration (Left: Multiple photographs and their 3D reconstruction, Right: Registration to LIDAR coordinate system)

RANSAC-based matching. 2D to 3D registration between two different modalities such as images/LIDAR [4, 6, 7], images/range sensor [8, 9] and spherical images/ LIDAR [10, 11] has also been investigated. Stamos et al. integrated 2D-to-3D and 3D-to-3D registration technique [12]. They registered 2D images to a dense point cloud from range scanners by reconstructing sparse point cloud using a SfM method. 3D lines and circular feature matching were used for registration. However, little research has been carried out on 2D and 3D data registration for three or more visual modalities in general environments. The work introduced in this chapter investigates 3D feature descriptors and their application to different domains (local, keypoint and colour domains) to evaluate and verify the inﬂuence of colour and feature geometry on multimodal data

11

Big Multimodal Visual Data Registration for Digital Media Production

275

registration. A full 2D/3D multimodal data registration pipeline using multi-domain feature description and hybrid RANSAC-based registration is also introduced. They are tested on public multimodal datasets, and objective analysis of matching and registration performance is provided.

11.2.2 Feature Detector and Descriptors Feature detection identiﬁes distinct points in terms of variation in data such as geometric or appearance changes. Various keypoint detectors have been developed and evaluated for distinctiveness and repeatability on 3D data [13, 14]. However, most of the best performing detectors in the literature are not suitable for multimodal data cases because multimodal sources have different colour distribution and geometric errors according to the characteristics of the capture device. We found that classic detectors which produce a relatively large number of evenly distributed keypoints such as Kanade–Tomasi detector [15] and SIFT [16] are more suitable for multimodal data which have high outlier ratio. Feature descriptors describe the characteristics of feature points. Restrepo and Mundy evaluated performance of various local 3D descriptors for models reconstructed from multiple images [17]. They reconstructed urban scenes with a probabilistic volumetric modelling method and applied different descriptors for object classiﬁcation to ﬁnd the best descriptor. However, previous research has focused only on registration for a single modality data. Guo et al. [18, 19] provided a comprehensive evaluation of feature descriptors on various datasets from different modalities, but the test was not on cross-modality but only within the same modality in each dataset. We carried out the similar evaluation on cross-modal datasets [20] and concluded that Fast Point Feature Histograms (FPFH) [21] and Signature of Histograms of Orientations (SHOT) [22] descriptors show the best performance in multimodal data registration. Tombari et al. [23] proposed CSHOT descriptor, combining local shape and colour information in the conventional SHOT descriptor. Alexandre [24] proposed to combine the PFH descriptor with BGR colour data and conﬁrmed that the combining local geometry and colour information is helpful in 3D feature matching by testing both CSHOT and PFHRGB descriptors. However, colour information is not always reliable in multimodal 3D data because it is difﬁcult to balance colour histograms between modalities. Appearance information cannot be trusted for non-Lambertian surface or repetitive patterns. Concatenation of different descriptors may lead to poor performance when the matching is dominated by one descriptor. The inﬂuence of colour information to 3D multimodal data registration has been investigated in [25]. In this chapter, we introduce a novel feature matching and registration algorithm adaptively considering multiple feature elements.

276

11.3

H. Kim and A. Hilton

Input Modalities

Sensor technologies for scene acquisition are classiﬁed into two categories: active sensing using laser/infra-red (IR) sensors and passive methods using normal photographic or video devices. We consider a wide range of 2D/3D and active/passive sensors in this work.

11.3.1 LIDAR Scan LIDAR is one of the most popular depth-ranging techniques to acquire 3D scene geometry. LIDAR measures distance by the time delay between transmission and reﬂection of a light pulse signal. It is the most accurate depth-ranging device, but conventional LIDAR scanners could retrieve only a point cloud set without colour or connectivity. Some recent LIDAR devices provide coloured 3D mesh structure by mapping photos simultaneously taken during the scan. We use FARO Focus 3D X1301 to scan coloured 3D point clouds in this work. Multiple scans acquired from different viewpoints are merged into a complete scene structure using the software provided with the device.

11.3.2 Photographs Digital still images are one of the most common devices to acquire scene information. Multiple camera calibration and image-based 3D reconstruction have been actively researched for a long time. Multiple photographs can be localised to a 3D space by registering the reconstructed 3D model from those photos because the camera poses are estimated during the reconstruction process. Bundler [26] followed by PMVS [27] provides a 3D reconstruction with camera poses from multiple photos. RECAP2 by Autodesk also provides an online image-based 3D reconstruction.

11.3.3 Videos If a single video camera is used, the same approach as used in Sect. 11.3.2 can be utilised, because image frames from a moving camera are considered to be a series of multi-view images. In case of multiple wide-baseline ﬁxed video cameras, it is 1 2

FARO, http://www.faro.com/en-gb/products/construction-bim-cim/faro-focus/ Autodesk RECAP, https://recap.autodesk.com/

11

Big Multimodal Visual Data Registration for Digital Media Production

277

difﬁcult to get the 3D model for automatic registration if the camera viewpoints do not have sufﬁcient overlap. In this case, camera poses can be estimated by wandbased calibration [28] aligned to the origin of the LIDAR sensor.

11.3.4 RGBD Video RGB+Depth camera is increasingly popular in medium-sized static indoor scene reconstruction. Though IR interference limits their validity in outdoor environments, they are still useful in indoor or shaded outdoor areas. KinectFusion system [29] demonstrated its effectiveness in scene reconstruction from an RGBD video sequence by camera pose estimation and tracking. It is extended to more robust methods using SLAM-based or depth map fusion approaches [30–32]. The Xtion PRO camera3 is used to acquire a RGBD video stream of the scene in our work.

11.3.5 360 Cameras Omnidirectional spherical (360 ) imaging is also tested in this research because it is a common device in acquiring texture map or lighting condition measurement. A 360 camera captures a full surrounding scene visible from the camera location, but it always requires post-processing to map the image in spherical coordinates to other images captured in a different coordinate system [33]. The classic way to capture the full 3D space instantaneously is to use a catadioptric omnidirectional camera using an ellipsoidal mirror combined with a CCD. However, the catadioptric camera is difﬁcult to calibrate and has limited resolution. Recently, inexpensive off-the-shelf 360 cameras become popular in our daily lives4,5 and various 3D reconstruction methods for 360 images have been proposed [34, 35]. We use a commercial off-theshelf line-scan camera, Spheron,6 with a ﬁsheye lens in order to capture the full environment as an accurately aligned high resolution equi-rectangular image. We assume that the scene is captured as vertical stereo pairs to allow dense stereo reconstruction of the surrounding scene for automatic registration. In order to get complete scene structure, multiple stereo pairs can be captured and merged into one structured scene.

3

XtionPRO, https://www.asus.com/3D-Sensor/Xtion_PRO/ Ricoh Theta S, https://theta360.com/en/ 5 Samsung Gear 360, http://www.samsung.com/global/galaxy/gear-360/ 6 Spheron, http://spheron.com 4

278

H. Kim and A. Hilton

11.3.6 Proxy Model Simpliﬁed scene models are useful in understanding and representing rough geometry of the scene with small amount of data. Proxy models are used in a variety of areas such as augmented/virtual reality (AR/VR), pre-visualisation in ﬁlm production, virtual maps, urban planning. They are normally generated using computer graphics, but there are some automatic algorithms for proxy model generation from images [36, 37]. SketchUp7 provides an intuitive semi-automatic 3D model reconstruction using vanishing points alignment in images. We use an axis-aligned planebased scene reconstruction from spherical images [38] for proxy model generation in the experiments. In feature detection and description, the planes are densely re-sampled to acquire sufﬁcient number of points for feature computation.

11.4

System Overview

The overall process for multimodal data registration and visualisation is illustrated in Fig. 11.2. Direct matching and registration of a 2D image to 3D structure is a difﬁcult problem due to differences in the information content between a 2D projection and the 3D scene structure. Therefore, we assume that 2D images are at least a stereo pair, video sequence or multiple images so that 3D geometric information can be extracted from the 2D images. Colour 3D point clouds are used as a common input format for 3D feature detection and matching. 3D data from 3D sensors or proxy (computer graphics) objects are directly registered, and 2D data are registered via 3D reconstruction techniques such as stereo matching or SfM. External camera parameters are extracted during 3D reconstruction so that the original camera poses are simultaneously transformed in registration.

Fig. 11.2 Pipeline for multimodal data registration and visualisation

7

SketchUp, http://www.sketchup.com/products/sketchup-pro

11

Big Multimodal Visual Data Registration for Digital Media Production

279

Point clouds from different modalities have different point density, and some of them have irregular distribution even in the same scene. For example, point clouds from LIDAR or spherical images become sparser as it goes farther from the capture points, and it causes bias in feature detection and description. A 3D voxel grid ﬁlter sampling points in a uniform 3D grid is applied to make even density of point clouds. Feature points (Keypoints) are detected by combining 3D Kanade–Tomasi detector [15] and 3D SIFT detector [16] (details in Sect. 11.5.1). 3D features are extracted in multi-domain (local, semi-global and colour domain) as 2D vectors for each feature point (details in Sect. 11.5.2). The feature descriptors from different modalities are matched to ﬁnd the optimised registration matrix transferring the original 3D model to the target coordinate system (details in Sect. 11.5.3). The point cloud registration can be reﬁned over the whole point cloud using the Iterative Closest Point (ICP) algorithm [39].

11.5

Multimodal Data Registration

11.5.1 3D Feature Detector Keypoint detection is an essential step prior to feature description and matching. Keypoints are salient points which are distinctive among their neighbourhoods in their geometry and locality. There have been many 3D feature detectors developed and evaluated [13, 14]. All keypoint detectors were evaluated for accurate 3D models generated by computer graphics or single-modal sensors in terms of distinctiveness and repeatability. The distinctiveness describes the characteristics of the point to ﬁnd correct matches, while the repeatability means the reliability to detect the same keypoints in various environments. A few 3D keypoint detectors show high performances in both distinctiveness and repeatability [14]. However, highly ranked detectors in those evaluations do not guarantee such high performances for multimodal datasets which potentially include different types of geometrical errors, sampling density and distortions. Heat Kernel Signature (HKS) detector [40] shows good repeatability and distinctiveness in single modal evaluations but is too selective to produce enough number of repeatable feature points between crossmodalities due to geometrical errors resulted from incomplete 3D reconstruction with outliers. Feature detectors producing a relatively large number of evenly distributed keypoints is preferred for robust multimodal data registration. The 2D Kanade–Tomasi detector [15] uses an eigenvalue decomposition of the covariance matrix of the image gradients. This 2D detector is extended to 3D for 3D keypoint detection in this work. 3D surface normals calculated in the volume radius of rj are used instead of 2D edge information. Eigenvalues represent the principal surface directions, and the ratios of eigenvalues are used to detect 3D corners in the point cloud. The SIFT feature detector [16] uses a Difference-of-Gaussian (DoG) ﬁlter to select scale-space extrema and reﬁnes the results by Hessian eigenvalue test to

280

H. Kim and A. Hilton

eliminate low contrast points and edge points. Parameters for 3D SIFT feature detector are deﬁned as (Minimum scale Sm, Number of octaves So, Number of scales Ss). We use a combination of 3D Kanade–Tomasi detector and 3D SIFT feature detector in this work. 3D versions of the Kanade–Tomasi detector and the SIFT detector implemented in the open source Point Cloud Library8 are utilised.

11.5.2 3D Feature Descriptors Most current 3D feature descriptors use local projection onto a 2D tangent plane or extend existing 2D descriptors to the 3Ds. There are relatively less researches on full 3D descriptors. Here, we introduce classic 3D descriptors directly operating on 3D point clouds. Spin Images (SI) [41] Spin Images encodes surface properties in a local objectoriented system. The position of points in the support radius rj from a keypoint is described by two parameters in a cylindrical system using a single point basis constructed from the oriented point. A 2D histogram of points falling within a cylindrical volume is computed by a 2D plane spinning around the cylinder axis. The SI descriptor is generally represented as 153 dimensional vectors. 3D Shape Context (SC) [42] 3D shape context is an extension of the 2D shape context descriptor. The support region for SC is a sphere centred on the keypoint and its north pole oriented with the surface normal. A volumetric region in the support radius rj is divided into bins equally spaced in the azimuth and elevation dimensions and logarithmically spaced along the radial dimension. Each bin accumulates weighted counts by local point density for each point. The SC descriptor is generally represented as 1980 dimensional vectors. SHOT [22] SHOT descriptor is generated based on a repeatable local Reference Frame (RF) based on the eigenvalue decomposition of the scatter matrix of surface points in the support radius rj. Given the local RF, a signature structure is deﬁned with an isotropic spherical grid. For each sector of the grid, normal histograms are deﬁned, and the overall descriptor is calculated by the juxtaposition of these histograms. In the experiments, the number of spatial bins is set as 32 as suggested in [22] and the angle between normal vectors as 10. The SHOT descriptor requires a 9-dimensional vector for RF. As a result, the SHOT descriptor is represented as 329 dimensional vectors. PFH [21] PFH is calculated on the relationship between the points in the support radius rj and their estimated surface normals. Given two local 3D points p and q, PFH sets three unit vector [u, v, w] and extracts four features [α, φ, θ, d], where α is angle to the v axis, φ an angle to the u axis, θ a rotation on the uw plane and d a 8

PCL, http://pointclouds.org/

11

Big Multimodal Visual Data Registration for Digital Media Production

281

distance between two points. To create the ﬁnal PFH descriptor, the set of all quadruplets is binned into a histogram resulting in a 125 dimensional vector. FPFH [21] The FPFH is a faster version of PFH by recycling previously computed feature histograms. FPFH uses cumulation of Simpliﬁed Point Feature Histogram (SPFH) [21]. SPFH extracts a set of tuples [α, φ, θ] from a keypoint p and its neighbouring local points {pk}. The FPFH histogram is computed by weighted sum of their neighbouring SPFH values as in Eq. (11.1). The weight ωk is determined by a distance between points p and pk. In the experiments, the number of bins is set as 11 for each α, φ and θ. Therefore, one FPFH descriptor can be represented as a vector with 33 bins. FPFHðpÞ ¼ SPFHðpÞ þ

k 1X 1 SPFHðpk Þ k i¼1 wk

ð11:1Þ

Their performance evaluation on multimodal data registration is given in the preliminary research [20]. The performances of most descriptors are acceptable for indoor datasets with stable material, lighting condition and background, but FPFH works slightly better in terms of accuracy and speed. For outdoor scenes with a more variable environment, SHOT and FPFH show better performance. The SHOT descriptor is good at registering dense reconstructions with high accuracy. However, it is poor at proxy model registration and sometimes shows unstable behaviours. The FPFH descriptor fails in pseudo-symmetric structure registration, but it shows relatively stable performances in general scene registration.

11.5.3 Description Domains Most 3D feature descriptors rely only on local geometric or colour features. However, these descriptors are not suitable for multimodal data registration because input sources may have a high level of geometric reconstruction error or different colour histograms. Here, we introduce domains where feature descriptors can be generated. Local Domain FPFH and SHOT are local descriptors deﬁning relationship of the keypoint with neighbouring 3D points within a certain volume radius. Surface normal vectors calculated from 3D points within the radius of rl are used to compute descriptors. Performance of the descriptor largely depends on the choice of the radius rl. It should be large enough to include a sufﬁcient number of points for sparse point cloud and small enough to represent the surface normal of the point without smoothing. It is selected according to the scene scale in our work. Keypoint Domain Most feature descriptors assume that the point clouds are accurate and dense enough to stand for local geometry. However, local descriptors may fail if the point cloud has noise from the characteristics of sensing devices or

282

H. Kim and A. Hilton

Fig. 11.3 Description in local and keypoint domains

geometrical errors due to reconstruction errors. Increasing the support radius is not a good solution because it smoothens surface normals and increases computational complexity. In 2D descriptors, a few methods consider clusters [43] or spatial matching consistency of neighbouring keypoints [44] to support local matches. In this framework, the spatial distribution of 3D keypoints in a larger area of neighbourhood with the supporting radius rk is considered to get over this problem. This is implemented by applying the same feature descriptor only with keypoints by excluding all other local points as shown in Fig. 11.3. Colour Domain It has been proved that colour information is useful for 3D feature matching between uni-modal 3D structures in [23, 24]. Tombari et al. [23] found that the colour information in the CIELab colour space is more useful in colour SHOT descriptor than in the RGB colour space. For colour description, three channel RGB or CIELab information can be taken as input of descriptor instead of surface normals. We used the CIELab colour space which is more perceptually uniform than the RGB space as proved in [23].

11.5.4 Multi-domain Feature Descriptor Under the hypothesis that joint descriptors, which combine information from different domains, can increase the overall performance, Tombari et al. [23] proposed CSHOT, a cascade combination of shape and colour description. Alexandre [24] also tested this CSHOT and PFHRGB, a cascade combination of shape and RGB for the PFH descriptor. Our preliminary research [25] also found that the combination of descriptors applied on different domains can improve the matching and registration performance for multimodal data.

11

Big Multimodal Visual Data Registration for Digital Media Production

283

In this work, FPFH is used as the base descriptor because it shows fast and stable performance in our preliminary research [20]. The FPFH descriptor is extended to multiple domains to accommodate geometry and colour information together. For the same input point cloud and keypoints, three different FPFH descriptors are computed in three domains: local, keypoint and colour. The result is represented as a 2D vector with 33 3 bins. FPFH in the local domain FL describes the characteristic of local geometry measured with the keypoint and its neighbouring local 3D points in the radius of rl. FPFH in the keypoint domain FK shows the spatial distribution and variation of keypoints, which represents semi-global geometric feature of the scene in the volume radius of rk, which is much larger than rl. FPFH in the colour domain FC deﬁnes the colour characteristics of local 3D points around the keypoint in the same local volume radius of rl. FC uses colour components instead of surface normal components to compute descriptors.

11.5.5 Hybrid Feature Matching and Registration The Hybrid RANSAC registration method is proposed to ﬁnd an optimal 3D transform matrix between keypoint sets. This is modiﬁed from the SAC-IA algorithm [21] by introducing a new distance measure with weighted sum of multidomain FPFH descriptors. A block diagram of the new feature matching and registration method for the registration of keypoint set P in the source model to keypoint set Q in the target model as illustrated in Fig. 11.4. The weights of each description domain in matching process are adaptively selected according to the distinctiveness of the descriptor. For example, it has high possibility of a wrong match even with a low matching cost if the point is selected from repetitive patterns. In order to avoid these cases, the reliability λ( p) for a point p is calculated by the ratio of the second to ﬁrst nearest neighbour distances in Q as shown in Eq. (11.2). D( p,q) denotes the distance between descriptors of p and q, and pNN[] is an element of p’s k-nearest neighbours in Q.

Fig. 11.4 Hybrid RANSAC-based feature matching and registration

284

H. Kim and A. Hilton

λ ¼ D p; pNN½1 =D p; pNN½0

ð11:2Þ

The matching cost DT( p,q) for a source keypoint p to a target keypoint q is calculated by the weighted sum of individual domain descriptors as in Eq. (11.3). DT ðp; qÞ ¼ λL DL ðp; qÞ þ λK DK ðp; qÞ þ λC DC ðp; qÞ

ð11:3Þ

Algorithm 1 shows the registration process in detail. Algorithm 1 Hybrid RANSAC registration Input: Keypoint descriptor sets P={pi} and Q={qi} 1. Randomly select three samples S = {si}P 2. Calculate reliability set λ(si) = {λL(si), λK(si), λC(si)} for si 3. Find matches Qs = {qs(i)} Q with min(DT(si,qs(i))) 4. Compute a rigid 3D transform matrix T from S to Qs 5. Exclude unreferenced keypoints in T(P) and Q which have no corresponding points within a range of Rmax from the keypoints 6. Compute registration error ER for the rest of keypoints in T(P) and Q 7. If ER < Emin, then replace Emin with ER and keep T as Topt 8. Repeat steps 1–7 until it meets the termination criteria: (a) Reach the maximum iteration Imax (b) Emin < Rmin Output: Rigid 3D transform matrix Topt

11.6

Public Multimodal Database

To support researches in the multimodal data processing area, a big multimodal database acquired in various indoor and outdoor environments, including the datasets used in this chapter, is provided at http://cvssp.org/impart/ This repository includes various indoor/outdoor static scenes captured with multimodal sensors including multiple synchronised video captures for dynamic actions and 3D reconstructions. Grey colour LIDAR scanners, 360 spherical and line scan cameras, DSLR/compact still cameras, HD video cameras, HD 2.7K/4K GoPro cameras and RGBD Kinect cameras were used. The HD (19201080) video cameras were genlock synchronised and calibrated. The database also contains some pre-processed data to make the datasets more useful to various researches. Details can be found in the capture notes provided on the repository [45]. The multimodal data registration pipeline in this chapter is tested on three datasets from this database: Studio, Patio and Cathedral. The Studio set is an indoor scene with stable KinoFlo ﬂuorescent lighting on the ceiling. Patio is an outdoor scene covering around 15 m 10 m area surrounded by walls. It has a symmetric structure and repetitive geometry/texture patterns from bricks and windows. The Cathedral set

11

Big Multimodal Visual Data Registration for Digital Media Production

285

Fig. 11.5 Examples of multimodal datasets (Top: Studio, Middle: Patio, Bottom: Cathedral)

is an outdoor scene covering around 30 m 20 m. It is a large open area and the scene was captured under the direct sun light which resulted in strong contrast and shadows. Figure 11.5 shows examples of the multimodal datasets used in the experiments.

11.7

Experiments

In order to evaluate general performance of the multi-domain feature descriptor and hybrid registration, they are tested on single modality cases – the RGB-D Scenes Dataset from University of Washington.9 This dataset provides 3D colour point clouds of four indoor scenes. Each scene has three to four takes with different main objects and coverage for the same place. One take per scene was randomly chosen and the takes were merged into a single model as shown in Fig. 11.6a to generate the

9

Washington Dataset, http://rgbd-dataset.cs.washington.edu/index.html

286

H. Kim and A. Hilton

Fig. 11.6 Washington RGB-D Scenes Dataset. (a) Target reference scene. (b) Test sets to be registered

test target scene frame, and another takes of each scene in Fig. 11.6b were selected as sources and registered to the target scene of Fig. 11.6a using the proposed pipeline. Different objects and coverage of the source sets can be considered as noise or errors against the target scene, which makes the test challenging. Ground-truth registrations

11

Big Multimodal Visual Data Registration for Digital Media Production

Table 11.2 Experimental datasets

LIDAR Spherical (S) Spherical-P (SP) Photos 1 (P1) Photos 2 (P2) RGBD (R) Proxy (PR)

Studio 2 scans 1 scans – 94 photos – 1444 frames –

Patio 3 scans 3 scans – 70 photos 95 photos 1950 frames –

287 Cath 72 scans 3 scans 1 scan 50 photos 14 photos – 1 model

were generated by manual 4-points matching and ICP reﬁnement using MeshLab10 for objective evaluation. In the experiments on the multimodal datasets, the LIDAR scan is used for the target reference, and all other models are registered to the LIDAR coordinate system. Table 11.2 shows the datasets used in the experiments. “Spherical-P” is a part of spherical reconstruction to verify the performance of part-to-whole scene registration. 3D models are generated in the real-world scale using the reconstruction methods introduced in Sect. 11.3. Autodesk RECAP is used for image-based 3D reconstruction with the studio and patio scenes, and the Bundler [26] + PMVS [27] for the cathedral scene. The 3D point clouds reconstructed from 2D data are illustrated in Fig. 11.7. Ground-truth registration was generated in the same manner as the Washington dataset. Figure 11.8a illustrates the original datasets in their own coordinate (Left) and ground-truth registration results (Right). Figure 11.8b is the registration error maps of the ground-truth registration, visualising Hausdorff’s distance to the LIDAR model mapped in the range of 0–3 m to a blue–red colour range. It shows that even the ground-truth registration has errors against the target model in geometry because the source model includes its own reconstruction errors, different coverage and density from the target model. Therefore, registration performance is evaluated by measuring the RMS error to the ground-truth registration points instead of the distance to the LIDAR model. In 3D point cloud registration, the ICP algorithm requires an initial alignment. It fails if the initial position is not close enough to the ﬁnal position. Therefore, the performance of initial registration is judged by success or failure of the following ICP registration. We found that the ICP converges successfully if the initial registration is within 1–2 m of RMS error range to the ground-truth registration according to the scene scales.

10

MeshLab, http://meshlab.sourceforge.net/

288

H. Kim and A. Hilton

Fig. 11.7 3D models for registration. (a) Studio. (b) Patio. (c) Cathedral

11.7.1 3D Feature Detector The existing 3D feature detectors and their combinations are evaluated for multimodal registration: 3D Noble [46], 3D SIFT, 3D Tomasi, 3D Noble+SIFT and 3D Tomasi+SIFT. The combination of Tomasi and Noble is not included because both are geometry-based detectors. The volume radius rs for surface normal calculation is set as 0.5 m for the outdoors scenes and 0.2 m for indoor scene. The scale parameters for the SIFT detector are set as [Sm, So, Ss] ¼ [rs, 8, 10] as suggested in the original implementation. Detected keypoints for the spherical reconstruction of the cathedral scene using single detectors are shown in Fig. 11.9. The Noble detector detected much more keypoints than other detectors but those keypoints are concentrated in speciﬁc regions. The SIFT and Tomasi detectors detected similar number of feature points but the result of Tomasi is more evenly spread over the scene. The registration results with the detected keypoints in Table 11.3 clearly show the inﬂuence of the feature detectors to registration performances. In feature description and matching, we used the FPFH descriptor with the parameter set [rl, Rmin, Rmax, Imax] ¼ [0.8(outdoor)/0.3(indoor), 0.2, 0.8, 8000] considering the scale of the scenes.

11

Big Multimodal Visual Data Registration for Digital Media Production

289

Fig. 11.8 Ground-truth registration. (a) Registration (Top: Studio, Middle: Patio, Bottom: Cathedral). (b) Error map of Spherical model (Left: Studio, Middle: Patio, Right: Cathedral)

Fig. 11.9 Feature detection result (Cath-S). (a) Noble (9729 points). (b) SIFT (2115 points). (c) Tomasi (2461 points)

290 Table 11.3 Registration results with different feature detectors (N+S: Noble+SIFT, T+S: Tomsi+SIFT, S: Success, F: Failure)

H. Kim and A. Hilton Data set Studio-P Studio-S Studio-R Patio-P1 Patio-P2 Patio-S Patio-R Cath-P1 Cath-P2 Cath-S Cath-SP Cath-PR No. Suc. A.RMSE

Noble 1.99 5.00 1.90 10.41 9.22 1.13 10.40 1.69 26.67 17.79 13.45 16.19 4 1.68

SIFT 1.10 4.58 1.44 15.96 10.31 1.20 18.97 1.66 20.44 3.25 13.42 1.53 5 1.39

Tomasi 0.42 3.25 0.45 1.34 1.84 12.67 10.45 0.61 10.94 1.26 1.63 18.26 7 1.08

N+S 1.21 1.03 2.92 1.71 7.22 1.52 11.13 1.24 26.31 1.85 1.06 3.79 7 1.38

T+S 1.11 4.21 0.28 1.44 4.99 2.59 10.44 0.59 0.32 1.73 0.69 0.89 8 0.88

Parameters are ﬁxed for all multimodal datasets, and different parameters are used only for the Washington datasets because their scale is not known. In Table 11.3, ﬁgures emphasized in italics show failed cases in initial registration for ICP and bold ones show the best cases in each dataset. No.Suc. means the number of models succeeded in initial registration for ICP, and A.RMSE is the average RMS error of the successful registrations. The Noble detector shows the worst performance in the single detector test despite the largest number of feature points because the keypoints concentrated in speciﬁc areas could not contribute to efﬁcient matching and registration. The Tomasi detector shows the best performance among the single detectors with the highest number of successful registrations and the lowest RMS registration error. The combinations of geometric and colour detectors show better results as expected. Tomasi+SIFT detector shows good registration performance even with a normal FPFH descriptor. However, it fails with the Patio set due to repetitive geometry and texture in the scene. This Tomasi+SIFT detector is used for multidomain feature description and Hybrid matching in the next experiment.

11.7.2 Feature Matching and Registration The proposed 3D feature descriptors are computed for the keypoints detected by the Tomasi+SIFT detector, and their registration performance are compared with conventional descriptors. The multi-domain FPFH descriptor and Hybrid RANSAC registration (denoted as FHYB) is evaluated against those of normal FPFH (F), SHOT (S), and cascade combinations of FPFH descriptors in different domains (FLK, FLC and FLKC). We use the same parameter set of Sect. 11.7.1 for the multimodal datasets and [rl, rk, Rmin, Rmax, Imax] ¼ [0.2, 1.0, 0.05, 1.0, 5000] for the Washington datasets.

11

Big Multimodal Visual Data Registration for Digital Media Production

291

Best matching pairs of all detected keypoints to the target reference are calculated and compared with the ground-truth feature matching pairs to evaluate matching performance. Ground-truth feature matching pairs are deﬁned by the closed keypoints of the target reference in the range of rgt from the source keypoints transformed by the ground truth registration. rgt was set as 0.03 for the Washington dataset (the scale is unknown) and 5 cm for the multimodal dataset. Precision values are computed as suggested in [19]: Precision ¼

Number of correct matches Number of matches

ð11:4Þ

Single-Modal Dataset Matching precision and registration results of the Washington RGB-D scenes dataset according to the descriptors are shown in Table 11.4. We show not recall but only precision values in this experiment, because the outlier ratio is more important in RANSAC-based registration. Avg. in the last row means the average of the precision values for the precision columns and the average RMS registration error of the “successful registrations” for the registration columns. In this experiment, combination of features from various domains shows higher precision rate. Especially, it shows better results both in matching and registration when the colour information is involved. The hybrid RANSAC registration shows competitive performances against other cascade combination methods but is not advantageous considering its computational complexity. Multimodal Dataset Feature matching and initial registration results of the multimodal dataset are shown in Table 11.5. The level of Precision rates for multimodal set are lower than those for single-modal set shown in Table 11.4 due to different characteristics and errors of cross-modalities. It is observed that the proposed FHYB shows higher precision compared with other descriptions. Figure 11.10 illustrates examples of feature matching using conventional SHOT and FPFH local descriptors and the proposed multi-domain hybrid matching. Best 20 keypoints matches for the Patio set and 200 matches for the Cathedral set are visualised. The proposed method shows more consistent matching to the correct position, while other local descriptor matching results are scattered over the scene. In Table 11.5, the Studio set shows better performance than Patio and Cathedral sets, and especially the colour information improves the performance of feature matching because the Studio set was captured in stable lighting conditions. However, it shows poor result with the spherical reconstruction, because the Studio-S model has large self-occlusion areas in the geometry. In the Patio scene, repetitive texture and geometry such as bricks and window frames cause relatively low feature matching rates. In the registration results, it is observed that some sets are misregistered by 180 rotated as shown in Fig. 11.11a. Keypoint descriptions (FK), which considers feature distribution over a larger area, achieves better performance than local descriptors due to repetitive local geometry and appearance.

Dataset Scene1-T2 Scene2-T2 Scene3-T2 Scene4-T2 No. Suc. Avg. 20.17

13.96

11.55

12.26

FLC 28.07 13.24 29.52 9.85

Precision of feature matching (%) F S FLK 9.56 13.02 19.74 14.47 15.85 9.56 8.60 12.20 16.87 13.55 7.96 9.68 21.26

FLKC 29.82 17.65 26.81 10.75 21.93

FHYB 28.51 19.12 30.42 9.68

Table 11.4 Matching and registration results for single-modal dataset (Washington dataset) Registration error (RMSE) F S FLK 0.10 0.08 0.06 0.09 0.06 0.14 0.21 0.04 0.03 0.09 0.17 0.07 3 3 3 0.06 0.06 0.05

FLC 0.03 0.05 0.03 0.06 4 0.04

FLKC 0.06 0.09 0.03 0.04 4 0.05

FHYB 0.06 0.08 0.05 0.05 4 0.06

292 H. Kim and A. Hilton

Dataset Studio-P Studio-S Studio-R Studio Avg. Patio-P1 Patio-P2 Patio-S Patio-R Patio Avg. Cath-P1 Cath-P2 Cath-S Cath-SP Cath-PR Cath Avg.

Precision of feature matching (%) F S FLK FLC 5.26 1.85 4.39 5.26 1.15 1.05 1.52 4.52 3.58 5.05 3.82 6.11 3.33 2.65 3.24 5.30 1.14 0.95 1.33 0.38 1.69 0.92 1.49 0.89 0.38 0.25 0.31 0.85 0.52 0.52 7.52 1.04 0.93 0.66 2.66 0.79 1.72 1.74 2.43 1.63 4.25 4.15 5.11 3.22 2.58 2.29 1.99 2.44 2.24 1.17 5.01 1.02 1.18 0.10 0.53 0.88 2.39 1.89 3.02 1.84 FLKC 6.43 4.20 5.34 5.33 1.05 1.12 1.52 6.45 2.54 2.07 3.30 2.71 4.53 1.28 2.78

Table 11.5 Matching and registration results for multimodal dataset FHYB 6.14 6.72 6.87 6.58 1.25 1.37 1.95 8.06 3.16 2.18 4.40 2.60 5.49 1.14 3.16

Registration error (RMSE) F S FLK 1.11 0.28 0.35 4.21 2.15 2.75 0.28 0.12 2.85 1.87 0.85 1.98 1.44 0.82 0.68 4.99 0.55 0.36 2.59 12.80 0.98 10.44 10.71 0.89 4.87 6.22 0.73 0.59 1.08 0.30 0.32 0.45 0.27 1.73 1.06 1.39 0.69 8.35 0.48 0.89 19.32 8.58 0.84 6.05 2.21 FLC 0.17 0.50 0.21 0.29 9.85 0.87 1.03 10.54 5.57 1.93 36.36 1.58 12.75 1.21 10.77

FLKC 0.11 0.38 0.09 0.19 0.51 1.41 0.68 0.43 0.76 1.46 36.83 1.33 0.25 2.47 8.47

FHYB 0.15 0.45 0.12 0.24 0.47 0.91 1.02 0.27 0.66 0.38 0.22 1.12 0.43 0.94 0.62

11 Big Multimodal Visual Data Registration for Digital Media Production 293

294

H. Kim and A. Hilton

Fig. 11.10 Matched features (Top: SHOT, Middle: FPFH, Bottom: Proposed). (a) Patio-R to LIDAR. (b) Cath-SP to LIDAR

Fig. 11.11 Failure cases in registration. (a) Patio-S with FPFH. (b) Cath-P2 with FPFHLC

The appearance information is less trusted in the Cathedral scene models because brightness and colour balance are changed according to the capture device, capture direction and time in the outdoor environment. Figure 11.11b shows that the left part of the building has been mapped to the right part in the LIDAR model. It happens with descriptors whose local and colour components dominate the matching over the semi-global geometric component. The hybrid matching and registration sort out this bias. However, the colour information is useful in the case of proxy model (Cath-PR) whose distinctiveness of geometrical features is very low. The SHOT descriptor also shows poor result in feature matching due to failure of local RF deﬁnition. The combinations of descriptors show slightly better performances than the single descriptors, but they sometimes perform worse as seen in the case of Patio-P1 with FLC, Cath-P2 with FLK, Cath-SP with FLC and Cath-PR with FLK, because the

11

Big Multimodal Visual Data Registration for Digital Media Production

295

features from different domains compete each other without considering their reliabilities. The proposed hybrid method FHYB successfully registered all datasets with high precision of feature matching and low registration error.

11.8

Conclusion

In this chapter, we have introduced a framework for multimodal digital media data management, which allows various input modalities to be registered into a uniﬁed 3D space. Multi-domain feature descriptor extended from existing feature descriptors and a hybrid RANSAC-based registration technique were introduced. The pipeline was evaluated on datasets from the multimodal database acquired from various modalities including active and passive sensors. The proposed framework shows two times higher precision of feature matching and more stable registration performance than conventional 3D feature descriptors. Future work aims to apply current framework to a large-scale spatio–temporal scene data in order to produce a coherent view of the world. This will include synchronisation and registration of multimodal data streams captured by diverse collections from consumer-level to professional devices under uncontrolled and unpredictable environments. Another direction will be registration of non-visual modalities such as audio and text. Feature description and matching method for such cross-modalities are still open problems.

References 1. Starck, J., Maki, A., Nobuhara, S., Hilton, A., Matsuyama, T.: The multiple-camera 3-d production studio. IEEE Trans. Circuits Syst. Video Technol. 19(6), 856–869 (2009) 2. Kim, H., Guillemaut, J.-Y., Takai, T., Sarim, M., Hilton, A.: Outdoor dynamic 3d scene reconstruction. IEEE Trans. Circuits Syst. Video Technol. 22(11), 1611–1622 (2012) 3. Namin, S.T., Najaﬁ, M., Salzmann, M., Petersson, L.: Cutting edge: Soft correspondences in multimodal scene parsing. In: Proceedings of ICCV (2015) 4. Brown, M., Windridge, D., Guillemaut, J.-Y.: Globally optimal 2d-3d registration from points or lines without correspondences. In: Proceedings of ICCV (2015) 5. Sattler, T., Leibe, B., Kobbelt, L.: Improving image-based localization by active correspondence search. In: Proceedings of ECCV (2012) 6. Mastin, J.K., Fisher, J.: Automatic registration of lidar and optical images of urban scenes. In: Proceedings of CVPR, pp. 2639–2646 (2009) 7. Budge, S., Badamikar, N., Xie, X.: Automatic registration of fused lidar/digital imagery (texel images) for three-dimensional image creation. Opt. Eng. 54(3), 031105 (2015) 8. Wang, A., Lu, J., Cai, J., Cham, T.-J., Wang, G.: Large-margin multimodal deep learning for rgb-d object recognition. IEEE Trans. Multimed. 17(11), 1887–1898 (2015) 9. Zhao, Y., Wang, Y., Tsai, Y.: 2d-image to 3d-range registration in urban environments via scene categorization and combination of similarity measurements. In: Proceedings of ICRA (2016) 10. Wang, R., Ferrie, F., Macfarlane, J.: Automatic registration of mobile lidar and spherical panoramas. In: Proceedings of CVPR, pp. 33–40 (2012)

296

H. Kim and A. Hilton

11. Chen, L. Cao, H.X., Zhuo, X.: Registration of vehicle based panoramic image and lidar point cloud. In: Proceedings of SPIE, vol. 8919 (2013) 12. Stamos, L., Liu, C., Chen, G., Wolberg, G.Y., Zokai, S.: Integrating automated range registration with multiview geometry for the photorealistic modeling of large-scale scenes. Int. J. Comput. Vis. 78(2–3), 237–260 (2008) 13. Dutagaci, H., Cheung, C.P., Godil, A.: Evaluation of 3d interest point detection techniques via human-generated ground truth. Vis. Comput. 28(9), 901–917 (2012) 14. Tombari, F., Salti, S., Di Stefano, L.: Performance evaluation of 3d keypoint detectors. Int. J. Comput. Vis. 102, 198–220 (2013) 15. Tomasi, Kanade, T.: Detection and tracking of point features. Pattern Recognit. 37, 165–168 (2004) 16. Lowe, G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004) 17. Restrepo, M., Mundy, J.: An evaluation of local shape descriptors in probabilistic volumetric scenes. In: Proceedings of BMVC, pp. 46.1–46.11 (2012) 18. Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J.: 3d object recognition in cluttered scenes with local surface features: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 36(11), 2270–2287 (2014) 19. Guo, Y., Bennamoun, M., Sohel, F., Lu, M., Wan, J., Kwok, N.M.: A comprehensive performance evaluation of 3d local feature descriptors. Int. J. Comput. Vis. 116(1), 66–89 (2016) 20. Kim, H., Hilton, A.: Evaluation of 3d feature descriptors for multimodal data registration. In: Proceedings of 3DV, pp. 119–126 (2013) 21. Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: Proceedings of ICRA, pp. 3212–3217 (2009) 22. Tombari, S.S., Di Stefano, L.: Unique signatures of histograms for local surface description. In: Proceedings ECCV, pp. 356–369 (2010) 23. Tombari, S.S., Stefano, L.D.: A combined texture-shape descriptor for enhanced 3d feature matching. In: Proceedings of ICIP, pp. 809–812 (2011) 24. Alexandre, L.A.: 3d descriptors for object and category recognition: a comparative evaluation. In: Proceedings of Workshop on Color-Depth Camera Fusion in Robotics at IROS (2012) 25. Kim, Hilton, A.: Inﬂuence of colour and feature geometry on multimodal 3d point clouds data registration. In: Proceedings of 3DV, pp. 4321–4328 (2014) 26. Snavely, N., Seitz, S., Szeliski, R.: Photo tourism: exploring photo collections in 3d. In: Proceedings of ACM SIGGRAPH, pp. 835–846 (2006) 27. Furukawa, Y., Ponce, J.: Accurate, dense, and robust multiview stereopsis. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1362–1376 (2010) 28. Mitchelson, Hilton, A.: Wand-based multiple camera studio calibration. CVSSP Technical Report, vol. VSSP-TR-2/2003 (2003) 29. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J., Kohli, P., Shotton, J., Hodges, S., Fitzgibbon, A.: Kinectfusion: real-time dense surface mapping and tracking. In: Proceedings of IEEE ISMAR (2011) 30. Whelan, T., Leutenegger, S., Salas-Moreno, R.F., Glocker, B., Davison, A.J.: Elasticfusion: dense slam without a pose graph. In: Proceedings of RSS (2015) 31. Kähler, V. Prisacariu, A., Murray, D.W.: Real-time large-scale dense 3d reconstruction with loop closure. In: ECCV 2016, pp. 500–516 (2016) 32. Hunt, M., Prisacariu, V., Golodetz, S., Torr, P.: Probabilistic object reconstruction with online loop closure. In: Proceedings of 3DV (2017) 33. Im, S., Ha, H., Rameau, F., Jeon, H.-G., Choe, G., Kweon, I.S.: All-around depth from small motion with a spherical panoramic camera. In: Proceedings of ECCV (2016) 34. Schoenbein, Geiger, A.: Omnidirectional 3d reconstruction in augmented manhattan worlds. In: Proceedings of IROS, pp. 716–723 (2014) 35. Barazzetti, M.P., Roncoroni, F.: 3d modelling with the samsung gear 360, pp. 85–90 (2017)

11

Big Multimodal Visual Data Registration for Digital Media Production

297

36. Gupta, A., Efros, A.A., Hebert, M.: Blocks world revisited: image understanding using qualitative geometry and mechanics. In: Proceedings of ECCV (2010) 37. Xiao, J., Fang, T., Zhao, P., Lhuillier, M., Quan, L.: Image-based street-side city modeling. In: Proceedings of SIGGRAPH ASIA (2009) 38. Kim, Hilton, A.: Planar urban scene reconstruction from spherical images using facade alignment. In: Proceedings of IVMSP (2013) 39. Besl, P., McKay, N.: A method for registration of 3-d shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256 (1992) 40. Sun, M.O., Guibas, L.: A concise and provably informative multi-scale signature based on heat diffusion. In: Proceedings of SGP, pp. 1383–1392 (2009) 41. Johnson, A., Hebert, M.: Using spin images for efﬁcient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999) 42. Frome, D. Huber, R., Kolluri, T.B., Malik, J.: Recognizing objects in range data using regional point descriptors. In: Proceedings of ECCV (2004) 43. Estrada, F., Fua, P., Lepetit, V., Susstrunk, S.: Appearance-based keypoint clustering. In: Proceedings of CVPR, pp. 1279–1286 (2009) 44. Sattler, T., Leibe, B., and Kobbelt, L.: Scramsac: improving ransac’s efﬁciency with a spatial consistency ﬁlter. In: Proceedings of ICCV, pp. 2090–2097 (2009) 45. Kim, H., Hilton, A.: Impart multimodal/multi-view datasets. https://doi.org/10.15126/ surreydata.00807707. Available: http://cvssp.org/impart/ 46. Filipe, S., Alexandre, L.A.: A comparative evaluation of 3d keypoint detectors in a RGB-D object dataset. In: Proceedings of VISAPP, pp. 476–483 (2014)

Chapter 12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data Song Wei and Hani Hagras

Abstract In this chapter, we introduce a novel system based on Hybrid Interval Type-2 Fuzzy Logic Classiﬁcation Systems (IT2FLCS) that can deal with a large training set of complicated video sequences to extract the main scenes in a football match. Football video scenes present added challenges due to the existence of speciﬁc objects and events which have high similar features like audience and coaches as well as being constituted from a series of quickly changing and dynamic frames with small inter-frame variations. In addition, there is an added difﬁculty associated with the need to have light-weight video classiﬁcation systems which can work in real time with the massive data sizes associated with video analysis applications. The proposed fuzzy-based system allows achieving relatively high classiﬁcation accuracy with a small number of rules, thus increasing the system interpretability.

12.1

Introduction

We have wintessed a huge increasing growth in video data in the last decade, as one of the Big data component, which assembled from various domains including sports, movies, security, etc. This rapid growth has led to a speedy increase in the volume of video data to be stored, managed, and analyzed. Meanwhile, the data is growing at a 40% compound annual rate, reaching 45ZB (109 TB), which is three times the current data size [1]. With overwhelming amounts of the Big data in the videos, there have been several research efforts to automate the target object detection, video scenes classiﬁcation, and events recognition [2]. In the last decade, there has been a major focus on linguistic video summarization which is driven by the massive commercial potential and broad application prospects in security, surveillance, communications, sports, entertainment, etc. Football videos have the largest

S. Wei (*) · H. Hagras The Computational Intelligence Centre, School of Computer Science and Electronic Engineering, Colchester, UK e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_12

299

300

S. Wei and H. Hagras

audience in the world, and hence there is a need to develop tools that can automatically classify the video scenes. Video scenes can be regarded as continuous sequences of images, but the classiﬁcation problem is much more complicated than single image classiﬁcation due to the dynamic nature of a video sequence and the associated changes in light conditions, background, camera angle, occlusions, indistinguishable scene features, etc. However, this problem can be resolved though the new techniques in Big data. There are several studies associated with football videos which have been reported based on object and movement detection. For instance, a framework was demonstrated for football video analysis and summarization using object-based features like color histograms and target object segments to recognize players and the ball in football videos [3]. In [4], an approach was presented to classify replay scenes using real-time subtitle text in combination with local image features and text semantics. However, these techniques are based on supplied subtitle text and local segmental image features, which are not applicable to process all kinds of sports videos. In [5], research was presented using local image features with lines and boundaries to detect team movements and give a prediction of team in the “attack” or “defense.” A support vector machine (SVM)-based classiﬁcation system was produced to classify signiﬁcant video shots in football match videos [6]. However, this work is a single image process only and did not focus on continuous video frames [6]. An approach based on the decision trees was used to recognize scenes in sport match videos with the aim to classify the basketball scenes by using color, texture, and motion directions [7]. However, this approach employed too many basketballunique features, making it difﬁcult to apply for video scenes classiﬁcation in other sports videos. A novel approach using the combination of type-1 fuzzy logic system and Hidden Markov Mode (HMM) to recognize and annotate events in football video was presented by Hosseini [8]. This fuzzy rule-based reasoning framework detects the speciﬁc object from the scenes which can be regarded as the representative target event in football videos. There are researches presented in almost every aspect of Big data processing and applications, including technical challenges and non-technical ones. In the next few years, Big data will bring creativity and innovation into the traditional industry area and change the original techniques to better efﬁciency, more security, and higher adaptability [9]. One prominent example analyzing a large amount of tumors would reveal general patterns to improve diagnosis and treatment [10]. Big-data researchers believe that analysing the data of the thousands of tumours that have come before will reveal patterns that can improve screening and diagnosis, and inform treatment. In the statistic domains, Big data is leading to a revolution because the data can be collected with universal or nearuniversal population coverage instead of relatively small-sample surveys [11]; thus it is possible to provide more data-based applications that would be bring huge changes into the world. Apparently, the discovery of knowledge from Big data calls for the support of certain techniques and technologies. The Fuzzy Logic Classiﬁcation System (FLCS) employs the fuzzy sets (FS) and rule base and provides a white box approach that can handle the uncertainties associated with football videos. This chapter presents a system capable of detecting and classifying scenes in football videos mainly using a

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

301

scene classiﬁcation system of an interval type-2 FLCS (T2FLCS) to process the Big data volumes of football video. The traditional fuzzy logic classiﬁcation systems can result in huge rule bases due to the curse of the dimensionality problem associated with fuzzy systems. We continue our previous works reported in [12, 13] and will focus on processing with new type videos and bigger datasets from different football matches from various countries. The presented Interval Type-2 Fuzzy Logic Classiﬁcation Systems (IT2FLCS) parameters are optimized by the Big Bang–Big Crunch (BB–BC) algorithm. The presented system can deal with large training set of complicated video sequences to extract the needed football match main scenes. The BB–BC optimization is employed to optimize rules and fuzzy set parameters for our fuzzy logic system parameters. The proposed fuzzy-based system allows achieving relatively high classiﬁcation accuracy with a small number of rules, thus increasing the system interpretability. Section 12.2 provides a brief overview of type-2 FLCS and fuzzy rules generated and BB–BC optimization algorithm. Section 12.3 presents the proposed type-2 fuzzy logic video scenes classiﬁcation system for football videos. Section 12.4 presents the experiments and results, while Section 12.5 presents the conclusions and future work.

12.2

Background Knowledge

In this section, we will present an overview of Interval Type-2 Fuzzy Sets (IT2FS) and Big Bang–Big Crunch (BB–BC) algorithm. We will also provide an explanation about the basic concepts of Fuzzy Classiﬁcation System (FCS) which is employed in type-1 (T1) and type-2 (T2) fuzzy sets.

12.2.1 Type-2 Fuzzy Sets The Type-2 Fuzzy Sets (T2FS) was introduced by [14] as an extension of the ~ , is concept of an ordinary fuzzy set (T1FS). A type-2 fuzzy set, denoted A characterized by a type-2 membership function μA~ ðx; μÞ, where x 2 X and μ 2 Jx [0, 1], i.e. [14], ~ ¼ A

ðx; μÞ; μA~ ðx; μÞ j8x 2 X; 8μ 2 J x ½0; 1

ð12:1Þ

~ can also be expressed as: in which 0 A ~ ¼ A

ð

ð

x2X μ2J x

μA~ ðx; μÞ=ðx; μÞ J x ½0; 1

ð12:2Þ

302

S. Wei and H. Hagras

RR where denotes union over all admissible x and μ. Jx is called primary membership of Jx [0, 1], where for 8x 2 X [14]. The uncertainty in the primary memberships of a type-2 fuzzy set consists of a bounded region that is called the Footprint of Uncertainty (FOU) [15], which is the aggregation of all primary memberships [15]. According to [18], the upper membership function is associated with the upper ~ of a type-2 membership function. The bound of the footprint of uncertainty FOU A ~ . The upper lower membership function is associated with the lower bound of A and lower Membership Functions (MFs) of μA~ ðxÞ can be represented as μA~ ðxÞ and μ ~ ðxÞ, so that μA~ ðxÞ can be expressed as: A

ð μA~ ðxÞ ¼

h

i 1=u

μ2 μA~ ðxÞ; μ ~ ðxÞ

ð12:3Þ

A

In the interval type-2 fuzzy sets (shown in Fig. 12.1a), all the third-dimension values are equal to one [15] (shown in Fig. 12.1b).

12.2.2 Fuzzy Logic Classiﬁcation System Rules Generate The fuzzy logic classiﬁcation system employs the concept of conﬁdence and support in each generated fuzzy rule [16]. The fuzzy rule in a fuzzy classiﬁcation system is written as [16]: Rule Rq : IF x1 is Aq1 . . . xn is Aqn then class C q

ð12:4Þ

where wq(xp) represents the ﬁring strength of the rule q to a crisp input xp and it can be written as: wq xp ¼ min μ , . . . :μ Aq1 xp

Aqn xp

ð12:5Þ

where μAq1 ðxp Þ represents membership values of the crisp input xp to the fuzzy sets Aq in type-1 fuzzy sets, and n is the number of inputs in each rule. Meanwhile, for type-2 fuzzy sets, wq xp and wq xp represent the upper and lower ﬁring strengths of the rule q to a crisp input xp,which can be written as: wq xp ¼ min μ , . . . :μ

ð12:6Þ

wq xp ¼ min μ , . . . :μ

ð12:7Þ

Aq1 xp

Aq1 xp

Aqn xp

Aqn xp

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

a

Fig. 12.1 (a) Type-2 fuzzy sets with Gaussian Membership Function; (b) 3D view of interval type-2 Gaussian Membership Function

303

1

0.9 0.8 0.7

Footprint of Uncertainty (FOU).

Upper Membership Function (UMF).

Lower Membership Function (LMF).

0.6 s1

0.5 0.4 0.3

s2

0.2 0.1 m1 m2

0 0

1

2

3

4

5

6

7

8

9

10

b

1 0.8 0.6 0.4 0.2 0 100 80 100

60

80 60

40 40

20

20 0

0

where μ and μ represent the upper membership value and lower Aq1 xp

Aq1 xp

membership value of the crisp input xp to the fuzzy sets Aq respectively, and n is the number of inputs in each rule.

304

S. Wei and H. Hagras

In order to train the system and learn from the data, we introduce concept of conﬁdence and support. The conﬁdence of the type-1 fuzzy rule is written as follows [15]:

P

c Aq ) C q ¼

xp 2Class C q wq xp Pm p¼1 wq xp

ð12:8Þ

where m is the number of rules in the rule base. The conﬁdence can be viewed as measuring the validity of Aq ) Cq. It can be also viewed as a numerical approximation of the conditional probability. On the other hand, the support of Aq ) Cq for type-1 fuzzy sets is written as follows [16]: X s Aq ) Cq ¼ x

p 2Class

Cq

wq xp

ð12:9Þ

The support can be viewed as measuring the coverage of training patterns by Aq ) Cq. The conﬁdence and support also employed by type-2 fuzzy sets, with the upper and lower conﬁdence of the fuzzy rule, are written as: c Aq ) Cq ¼

c Aq ) Cq ¼

P P

xp 2Class Cq wq xp Pm p¼1 wq xp

ð12:10Þ

xp 2Class Cq wq xp Pm p¼1 wq xp

ð12:11Þ

where m is the number of rules in the rule base. The upper and lower conﬁdence can be viewed as measuring the validity of Aq ) Cq and as a numerical approximation of the conditional probability. On the other hand, the upper and lower support of Aq ) Cq is written as: X w xp s Aq ) C q ¼ q xp 2Class C q X s Aq ) C q ¼ w x x 2Class C q p p

q

ð12:12Þ ð12:13Þ

The upper and lower support can be viewed as measuring the coverage of training patterns by Aq ) Cq. The conﬁdence and support are employed by type-1 and type-2 fuzzy logic classiﬁcation system during testing phase. The system learns during the training phase to classify the incoming data. For type-1 fuzzy logic classiﬁcation system, the crisp inputs then be fuzziﬁed and the ﬁring strength of each rule computed would using wq V Ui , and the strength of each class is given by:

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

X i OClassh V Ui ¼ wq V U ∗c Aq ) Cq ∗s Aq ) C q

305

ð12:14Þ

q

where the output classiﬁcation will be the class with the highest class strength. In type-2 fuzzy logic classiﬁcation system, the crisp inputs would be fuzziﬁed andthen we will compute the upper ﬁring strength wq V Ui and lower ﬁring strength wq V Ui respective to each ﬁred rule. The strength of each class is given by: X i Classh V i ¼ O wq V U ∗ c¯ Aq ) C q ∗ ¯s Aq ) Cq U

ð12:15Þ

q

X i OClassh V Ui ¼ wq V U ∗c Aq ) C q ∗s Aq ) C q q

OClassh

V Ui

Classh V i þ OClass V i O U U h ¼ 2

ð12:16Þ ð12:17Þ

The highest class strength would be the winner of all classes as the output classiﬁcation.

12.2.3 The Big Bang–Big Crunch Optimization Algorithm The BB-BC optimization is a heuristic population-based evolutionary approach presented by Erol and Eksin [17]. Fast convergence, ease of implementation, and low computational cost are the advantages of BB-BC. The theory of BB-BC is inspired by the Big Bang theory in physics and its two eponymous phases. In the BB phase, the candidate solutions are randomly distributed over the search space in a uniform manner, while in the BC phase the candidate solutions are drawn into a single representative point via a center of mass or minimal cost approach [17, 18]. The procedures of the BB-BC are as follows [18]: Step 1: (Big-Bang phase): An initial generation of N candidates is generated randomly in the search space, similar to the other evolutionary search algorithms. Step 2: The cost function values of all the candidate solutions are computed. Step 3: (Big Crunch phase): This phase comes as a convergence operator. Either the best ﬁt individual or the center of mass is chosen as the center point. The center of mass is calculated as: PN

xi i¼1 f i 1 i¼1 f i

xc ¼ P N

ð12:18Þ

306

S. Wei and H. Hagras

where xc is the position of the center of mass, xi is the position of the candidate, fi is the cost function value of the ith candidate, and N is the population size. Step 4: New candidates are calculated around the new point calculated in Step 3 by adding or subtracting a random number whose value decreases as the iterations elapse, which can be formalized as: xnew ¼ xc þ

γρðxmax xmin Þ k

ð12:19Þ

where γ is a random number, ρ is a parameter limiting search space, xmin and xmax are lower and upper limits, and k is the iteration step. Step 5: Return to Step 2 until stopping criteria have been met.

12.3

The Proposed Type-2 Fuzzy Logic Scenes Classiﬁcation System for Football Video in Future Big Data

The type-2 fuzzy logic scenes classiﬁcation system consists of three phases. Figure 12.2 shows the details. First, a T1FLCS was built by using the training data. Second, we generate this T1FLCS to T2FLCS using T2FSs and upgrade the needed conﬁdence and support to type-2. The fuzzy sets and parameters are optimized with BB-BC algorithm to make this system results in better accuracy, less computational cost, and higher ability to handle uncertainties. This system has three steps (Fig. 12.2) as presented in the following subsections.

Fig. 12.2 The overview of the proposed type-2 fuzzy logic classiﬁcation system

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

307

12.3.1 Video Data and Feature Extraction The scenes classiﬁcation system uses the video scenes data for system training and testing. This data is selected from the original football match videos which mainly consist of three signiﬁcant scenes which are center scenes, players’ close-up scenes, people scenes, and some scenes from irregular camera angle. The center scenes represent the scenes in the center ﬁeld with an overview from long camera shot, which comprises most football video scenes. Figure 12.3a shows the representative frame from the center scenes. Figure 12.3b represents the people scenes, which includes the coaches, audience, and people with no direct relationship

Fig. 12.3 (a) Center ﬁeld scene; (b) Players’ close-up scene; (c) People Scene

308

S. Wei and H. Hagras

to the football match. Figure 12.3c shows the players’ close-up scenes where we see a zoom shot of one or several players during the match. To get the inputs the system needed, we process the video frames and their histograms ﬁrst. The histograms are calculated from each video frame which is representative of color distribution in graphic form. The histograms are not fed directly as inputs to the classiﬁcation system, but we need ﬁrst to compute their differences as well as compute some mathematical characteristics before feeding this to the fuzzy classiﬁcation system. First, we compute dcs, which is the color histograms Chi-Square correlation written as: dcsðH a ; H b Þ ¼

I X ðhai hbi Þ2 i¼0

hai

ð12:20Þ

where Ha and Hb are two different histograms of two different frames in the same color channels. Generally, Ha is a standard scene image which represents the scene. Hb is one frame from training video, I is the range of pixel level (generally 256), and i is the sequence. Thus, hai and hbi are i-th values for two image color histograms Ha and Hb, respectively. dco represents the correlation between two histograms Ha and Hb. It can be written as: PI i¼0 hai H a hbi H b dcoðH a ; H b Þ ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ 2 P I 2 PI i¼0 hai H a i¼0 hbi H b

ð12:21Þ

H x represents the average color level of a histogram from image x; it can be written as: Hx ¼

I 1 X hxi N i¼0

ð12:22Þ

where N is the total number of histogram bins and hxi is i th values. H a and H b are average color levels for image histograms Ha and Hb. dbd represents the Bhattacharyya distance, which is widely used in statistics to measure the difference between two discrete or continuous probability distributions, which can be written as: sﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ XI 1 dbd ðH a ; H b Þ ¼ 1 pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ h h i¼0 ai bi Ha Hb N2 where i, I, N, H a , H b , hai, and hbi are the same as in Eq. (12.20–12.22).

ð12:23Þ

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

309

din represents the intersection, which denotes the similarity between the two images. din could be written as: dinðH a ; H b Þ ¼

XI i¼0

minðhai ; hbi Þ

ð12:24Þ

To train and test the classiﬁcation system in sports video scenes, the inputs of the system include Chi-Square distance (dcs),correlation (dco), intersection (din), and Bhattacharyya distance (dbd) for RGB. Thus, the primary input vector V Ti of the scene classiﬁcation system at the i th frame can be written as: V Ti

¼

dcoBi , dcoGi , dcoRi , dcsBi , dcsGi , dcsRi , i ;O dinBi , dinGi , dinRi , dbd Bi , dbd Gi , dbd Ri

ð12:25Þ

where Oi is the label of class of the scene at i frame.

12.3.2 Type-1 Fuzzy Sets and Classiﬁcation System Fuzzy sets (FSs) and their membership functions (MFs) are always the core of any fuzzy logic systems. The traditional MFs are generally created by prior knowledge. In this book chapter, we employ the Fuzzy C-means (FCM) clustering algorithm to calculate and output the needed MFs in FSs. Figure 12.4a shows how we approximate the raw function generated by the FCM with a Gaussian type-1 fuzzy set. Figure 12.4b shows the speciﬁc MFs of FS dcoB. The parameters of type-1 fuzzy sets are obtained using fuzzy C-means clustering. The data is clustered into maximum three Gaussian fuzzy sets: LOW (red), MEDIUM (green), and HIGH (yellow).

Fig. 12.4 (a) Approximating the raw function generated by FCM with a Gaussian Type-1 Fuzzy Set; (b) Type-1 fuzzy sets membership functions for dcoB

310

S. Wei and H. Hagras

Fig. 12.5 The progress of Type-1 fuzzy logic classiﬁcation system build

According to the FCM clustering results, we have extracted the type-1 fuzzy set Gaussian membership functions (Fig. 12.4a) which could be written as [15]: 1 x m2 N ðm; σ; xÞ ¼ exp 2 σ

ð12:26Þ

where m is mean value and σ is standard deviation of the type-1 fuzzy set. For each type-1 fuzzy set, the Gaussian MF will have a mean value of mki and a standard deviation σ ki , where k is the number of inputs and k ¼ 1, . . ., p, p ¼ 12; i is the number of fuzzy sets for each input and i ¼ 1. . ., 3. The Type-1 fuzzy logic classiﬁcation system is trained by pre-processed video data using the fuzzy rules generation approach mentioned in [15]. The type-2 fuzzy classiﬁcation system will be built based on the upgrade from type-1 fuzzy system. Hence, the ﬁrst step of system build is to create the required type-1 fuzzy sets and type-1 fuzzy logic classiﬁcation system. Figure 12.5 shows the speciﬁc progress of type-1 system build which is an extension following Fig. 12.2.

12.3.3 Type-2 Fuzzy Sets Creation and T2FLCS To obtain the interval type-2 fuzzy set from the extracted type-1 fuzzy set, we blur the mean of the type-1 fuzzy set σ ki equally to the upper and lower with a distance d ki , to generate the upper and lower standard deviation (σ ki and σ ki respectively) for the generated Gaussian type-2 fuzzy set with uncertain standard deviation (Fig. 12.6b).

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

311

Fig. 12.6 Type-1 fuzzy sets; (d) Type-2 fuzzy sets

Fig. 12.7 The progress of Type-2 fuzzy logic classiﬁcation system build

The mean value of mki for the generated type-2 fuzzy set will be the same as the corresponding type-1 fuzzy set. Thus, the upper (μA ðxÞ and lower (μA ðxÞ MFs for the generated type-2 fuzzy set can be written as follows [15]: μA ðxÞ ¼ Nðmki , σ ki ; xÞ

ð12:27Þ

μA ðxÞ ¼ Nðmki , σ ki ; xÞ

ð12:28Þ

The value dki for each interval type-2 fuzzy set will be optimized via the BB–BC mentioned in Sect. 12.2. The type-2 fuzzy logic classiﬁcation employs the type-2 fuzzy sets and learnt by type-2 rule base with the associated conﬁdence and support which has been shown in Sect. 12.3. The speciﬁc progress of system build is shown in Fig. 12.7.

312

12.4

S. Wei and H. Hagras

Optimization of Type-2 Fuzzy Logic Classiﬁcation System

In order to obtain the high performance classiﬁcation system to process the video data in real-time, we have to optimize our type-2 fuzzy logic classiﬁcation system reducing the size of rule base and choosing the optimized MFs.

12.4.1 Membership Functions Optimization We encode d ki , the feature parameters of the type-2 MFs into a form of a population to apply BB–BC. In order to construct the type-2 MFs in our scene classiﬁcation system, we use dki where k ¼ 1, . . ., p, p ¼ 12 is the number of antecedents, i is the number of fuzzy sets representing each input. The population of BB–BC is shown in Fig. 12.8, and in Fig. 12.9, the selected type-2 fuzzy sets are represented.

12.4.2 Rule Base Optimization and Similarity We also optimize the rule base of fuzzy logic classiﬁcation system in order to allow the system and maintain a high classiﬁcation performance while minimizing the size of rule base. The parameters of the rule base are encoded into a form of a population which can be represented as shown in Fig. 12.10. As shown in Fig. 12.10, m rj are the antecedents and okr are the consequents of each rule, where j ¼ 1,. . ., p, p ¼ 12 is the number of antecedents; k ¼ 1,. . ., q, q ¼ 3 (the number of output classes); r ¼ 1,. . ., R, where R is the number of the rules to be tuned. However, the values describing the rule base are discrete integers while the original BB–BC supports continuous values. Thus, instead of Eq. (12.19), the following equation is used in the BB–BC paradigm to round off the continuous values to the nearest discrete integer values modeling the indexes of the fuzzy set of the antecedents or consequents: Dnew ¼ Dc þ round

…

…

γρðDmax Dmin Þ k

…

Fig. 12.8 The population representation for the parameters of the type-2 fuzzy sets

ð12:29Þ

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

313

a 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

b

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

Fig. 12.9 (a) The type-1 fuzzy set and the membership functions of dcoG; (b) The Optimized Type-2 fuzzy set and membership functions of dcoG

Fig. 12.10 The population representation for the parameters of the rule base

where Dc is the ﬁttest individual, γ is a random number, ρ is a parameter limiting search space, Dmin and Dmax are lower and upper bounds, and k is the iteration step [19]. As the BB–BC will result in the reduction of the rule base, there will be situations where the input vector will not ﬁre any rules in the rule base. In this situation, we will employ the similarity metric, which enables production of a classiﬁcation output from similar rules in the rule base [19]. In order to calculate the similarity in the (t) antecedent parts between the rule generated ∗ by the input xi ∗ and each rule Rq, a ðqÞ function distance will be deﬁned as D Aki ; Ai , where Aki xi ðtÞ represents the

314

S. Wei and H. Hagras ðqÞ

fuzzy set matched by an input xi(t) and Ai represents the antecedent fuzzy set for rule q. For example, in our application, we have ﬁve fuzzy sets for each input: Very Low, Low, Medium, High, Very High; each label can be encoded as an integer, whereby Very Low is 1, Low is 2, . . . Very High ¼ 5. So D (High, High) ¼ 0, D (High, Low) ¼ 2, D (Medium, Very High) ¼ 2. With this aim, we deﬁne a distance that ﬁnds the difference between the coded linguistic labels. Using this distance, the similarity between the rules created by the input x(t) with each rule Rq is calculated as [19]: S xðtÞ ; Rq ¼

Pn i¼1

1

D ðAi ;Ai V1 ðt Þ

n

ð qÞ

Þ,

ð12:30Þ

where S(x(t), Rq) 2 [0, 1], V is the number of fuzzy sets, and i ¼ 1, . . ., n, where n is the number of values of the inputs, which is the number of antecedents of the rule [19].

12.5

Experiments and Results

We performed various experiments using the selected data from over 20 football videos from British, Spanish, German, and French leagues [20]. All videos had a resolution of 1280*720 pixels with frame rates of up to 25 frames/second. The video data is totally over 800,000 frames which were split into three parts at random where we selected approximate 15% matches to be the out of sample data in order to test the ﬂexibility and capability of classiﬁcation system when running with out-of-sample video data. The other 85% frame were randomly divided into the training data (70%) using to build the system needed parameters and rule base, and testing data (15%) using to evaluate the classiﬁcation accuracy of the system. We also extracted features using OPENCV library to capture histograms and compute distance. This work is the extension version from our various works which have been done in [12, 13]. In Fig. 12.11, we present some screenshots of our system operation for scene classiﬁcation, showing the prediction class label in “Center” scene detection (a) and “People” scene detection (b). The experiment results are described in tables below. Table 12.1 shows the comparison group from Back Propagation Neural Network (BP-NN) classiﬁcation system. Table 12.2 shows another comparison over the testing data and out-of-sample testing data of the proposed Scenes Classiﬁcation System using T1FLCS. The results in Table II use the full rule base (296,028 rules) with no BB-BC tuning (bold font), and we present groups of reduced rule base with BB-BC optimization T1FLCSs. As can be seen from Table 12.2, the proposed T1 outperforms the BP-NN from Table 12.1 on most classiﬁcations of average and individual class.

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

315

Fig. 12.11 (a) Scene classiﬁcation system in “Center Field” scene detection; (b) Scene classiﬁcation system in “People” scene detection

Table 12.1 BP-NN classiﬁcation system on scenes classiﬁcation BP-NN Training Testing Out-of-Sample Testing

Center (%) 73.1645 61.4736 64.5819

People (%) 68.8574 63.6586 47.7024

Close-up (%) 70.7103 73.4074 39.5120

Average (%) 71.9346 64.1461 56.6065

As can be seen from Tables 12.2 and 12.3, the T2FLCS outperformed the T1FLCS in testing and out-of-range testing groups. The T2 achieved an average accuracy uplift of 0.4% and 0.1% when compared to the T1 systems for the full rule base in the testing and out-of-range testing. However, the T2FLCS perform better when rule base decreases from full rules to 50 rules. As can be seen, the T2FLCSs can give a very close performance with only 1000 rules (thus enabling the real-time and maximum interpretability) as opposed to using the full rule base of 296,028 rules

316

S. Wei and H. Hagras

Table 12.2 Type-1 fuzzy logic classiﬁcation system on scenes classiﬁcation Type-1 Training (Full Rule Base) Testing (Full Rule Base) Testing-1000 rules Testing-200 rules Testing-100 rules Out of range data Testing (Full Rule Base) Out of range data Testing-1000 rules Out of range data Testing-200 rules Out of range data Testing-100 rules

Center (%) 95.3804 93.7818 93.3201 80.2350 70.0562 84.5601 82.6507 75.7023 58.6911

People (%) 85.3642 83.5201 78.2921 64.9031 41.3084 56.9450 37.2615 39.2953 33.4255

Close-up (%) 74.1028 70.6501 65.5671 57.2034 35.7032 32.6095 25.4658 27.2059 23.4029

Average (%) 90.1855 87.1553 84.7178 72.4097 59.9732 69.6532 62.9436 59.4484 47.1337

Table 12.3 Type-2 fuzzy logic classiﬁcation system on scenes classiﬁcation Type-2 Training (Full Rule Base) Testing (Full Rule Base) Testing-1000 rules Testing-200 rules Testing-100 rules Out-of-range data Testing (Full Rule Base) Out-of-range data Testing-1000 rules Out-of-range data Testing-200 rules Out-of-range data Testing-100 rules

Center (%) 94.6135 93.9607 92.6501 82.2314 69.3527 83.7021 83.2684 77.6121 58.7037

People (%) 87.1903 85.6233 81.3656 66.4139 42.0644 54.7832 44.5811 43.6174 39.2567

Close-up (%) 72.3802 69.5595 69.1742 58.8425 38.6799 38.5502 40.2099 37.8850 31.6022

Average (%) 89.7939 87.5675 85.7162 74.2252 57.2824 69.6641 67.4372 63.3797 49.8181

which cannot enable interpretability or real-time performance. The IT2FLC outperform the T1FLC by about 1% for the testing data and by about 5% for the out-of-sample data which veriﬁes the IT2FLC ability to handle the faced uncertainties and produce resilient performance in face of high uncertainty levels.

12.6

Conclusion

In this chapter, we presented a video scene classiﬁcation system for the Big data football videos by using an optimized type-2 fuzzy logic system. In order to automatically obtain the optimized parameters of the type-2 fuzzy sets and decrease the size of rule base of the T2FLCS (to increase the system interpretability and allow for real-time processing), we employed an optimization approach based on the BB– BC algorithm. The results of our football video classiﬁcation experiments over samples from different countries leagues show that the proposed system with T2FLCS outperforms the T1FLCS for scene classiﬁcation accuracy in testing data

12

A Hybrid Fuzzy Football Scenes Classiﬁcation System for Big Video Data

317

and out-of-range data. In ongoing research, we intend to extend the proposed system to employ more classiﬁcation systems to handle the high uncertainty levels available in more complicated classiﬁcations such as event detection and object summarization. We aim to expand the video activity detection system to more functions within sports videos in order to allow us to move to real-time video classiﬁcation and summarization.

References 1. Aziza, B.: Predictions for Big Data. (April 2013) 2. Keazor, H., Wübbena, T.: Rewind, play, fast forward: the past, present and future of the music video. (2015) 3. Ekin, A., Tekalp, A., Mehrotra, R.: Automatic football video analysis and summarization. IEEE Trans. Image Process. 12(7), 796–807 (2003) 4. Dai, J., Duan, L., Tong, X., Xu, C.: Replay scene classiﬁcation in football video using web broadcast text. IEEE International Conference on Multimedia and Expo, July 6–8, Amsterdam, 2005 (ICME 2005), pp. 1098–1101 5. Alipour S., Oskouie, P., Eftekhari-Moghadam, A.-M.: Bayesian belief based tactic analysis of attack events in broadcast football video. In: Proceedings of the International Conference on Informatics, Electronics & Vision (ICIEV), Dhaka, pp. 612–617 (2012) 6. Bagheri-Khaligh, A., Raziperchikolaei, R., Moghaddam, M.E.: A new method for shot classiﬁcation in football sports video based on SVM classiﬁer. In: Proceedings of the IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), 22–24 April, Santa Fe, NM, USA, pp.109–112 (2012) 7. Boutell, M.R., Luo, J., Shen, X., Brown, C.: Learning multi-label scene classiﬁcation. Pattern Recogn. 37(9), 1757–1771 (2004) 8. Hosseini, M.-S., Eftekhari-Moghadam, A.-M.: Fuzzy rule-based reasoning approach for event detection and annotation of broadcast football video. Appl. Soft Comput. 13(2), 846–866 (2013) 9. Wang, H., Xu, Z., Pedrycz, W. An overview on the roles of fuzzy set techniques in big data processing: Trends, challenges and opportunities. Knowledge-Based Systems. (8 July 2016) 10. Adams, J.U.: Big hopes for big data. Nature. 527, S108–S109 (2015) 11. Blumenstock, J., Cadamuro, G., On, R.: Predicting poverty and wealth from mobile phone metadata. Science. 350, 1073–1076 (2015) 12. Song, W., Hagras, H. A big-bang big-crunch Type-2 Fuzzy Logic based system for football video scene classiﬁcation. In: Proceeding of IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Vancouver, BC, Canada (2016) 13. Song, W., Hagras, H. A type-2 fuzzy logic system for event detection in football videos. In: Proceeding of IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), Naples, Italy. (2017) 14. Hagras, H.: A hierarchical type-2 fuzzy logic control architecture for autonomous mobile robots. IEEE Trans. Fuzzy Syst. 12(4), 524–539 (2004) 15. Mendel, J.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall, Upper Saddle River, NJ (2001) 16. Ishibuchi, H., Yamamoto, T.: Rule weight speciﬁcation in fuzzy rule-based classiﬁcation systems. IEEE Trans. Fuzzy Syst. 13(4), 428–435 (2005) 17. Erol, Eksin, I.: A new optimization method: big bang–big crunch. Adv. Eng. Softw. 37(2), 106–111 (2006)

318

S. Wei and H. Hagras

18. Kumbasar, T., Eksin, I., Guzelkaya, M., Yesil, E.: Adaptive fuzzy model based inverse controller design using BB-BC optimization algorithm. Expert Syst. Appl. 38(10), 12356–12364 (2011) 19. Garcia-Valverde, T., Garcia-Sola, A., Hagras, H.: A fuzzy logic-based system for indoor localization using WiFi in ambient intelligent environments. IEEE Trans. Fuzzy Syst. 21(4), 702–718 (November, 2012) 20. Football Videos [online Resources]. http://www.jczqw.cc. Accessed 25/09/2017

Chapter 13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction Taiwo Adetiloye and Anjali Awasthi

Abstract Trafﬁc congestion is a widely occurring phenomenon characterized by slower vehicle speeds, increased vehicular queuing and, sometimes, a complete paralysis of the trafﬁc network. Trafﬁc congestion prediction requires analyzing enormous amount of trafﬁc data across multiple modalities, including trafﬁc cameras, GPS or location information, Twitter and vehicular sensors, and so on. We propose a Big data fusion framework based on homogenous and heterogeneous data for trafﬁc congestion prediction. The homogeneous data fusion model fuses data of same types (quantitative) estimated using machine-learning algorithms: back propagation neural network, random forest, and deep belief network; and applies extended Kalman ﬁlter for the stochastic ﬁltering of the non-linear noisiness while reducing the estimation and measurement errors. In the heterogeneous fusion model, we extend the homogenous model by integrating with qualitative data, i.e. trafﬁc tweet information from Twitter data source. The results of the extended Kalman ﬁlter and sentiment analysis are treated using the Mamdani Fuzzy Rule Inferencing for heterogeneous trafﬁc data fusion. The proposed approaches are demonstrated through application on Genetec data.

13.1

Introduction

Lomax et al. [1] deﬁne trafﬁc congestion as the travel time or delay incurred in excess of that in free trafﬁc ﬂow. According to Jain et al. [2], trafﬁc congestion is characterized not only by massive delays but by enormous cost incurred through increased fuel wastage and money loss, particularly in cities of developing countries and in almost other cities around the world. Complex, non-linear characteristics with cluster formation and shockwave propagation that deviate from the law of mechanics

T. Adetiloye (*) · A. Awasthi Concordia Institute for Information Systems Engineering, Concordia University, Montréal, QC, Canada e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_13

319

320

T. Adetiloye and A. Awasthi

are widely observed in trafﬁc. To address the problems of trafﬁc congestion, building more infrastructure is not enough. Better trafﬁc information prediction systems to monitor and route the trafﬁc efﬁciently in real time is also a must. Wang et al. [3] raise some concerns over serious trafﬁc congestion causing great economic loss and environmental problems. This brings the urgent need for best travel path that would be independent of a systemic data source failure due to lack of backup and alternative data plan. In fact, high congestion may persist for longer hours as a consequence of drivers missing the travelling path because of faulty trafﬁc information equipment like sensors which sometimes malfunction. They further argue that less reliance on a single data source is the solution to address this trafﬁc problem of ambiguity, thus necessitating fusing of various trafﬁc data. Over the last decade, trafﬁc data fusion has increasingly been adopted for improved trafﬁc congestion prediction due to its many advantages which include reduced ambiguity, increased robustness, increased conﬁdence factor, enhanced spatial and temporal coverage, and decreased cost (Anand et al [4]; Dailey et al [5]; Bachmann [6]). Angela Aida et al. [7] conducted experiment in Tanzania using ﬂoating car data collected and processed by a centralized server. The information gathered is communicated to road users via several interfaces including web, radio, television, and mobile phone. The data fusion is performed via Mamdani Fuzzy Rule Inferencing (MFRI). Kim and Kang [8] propose an adaptive navigation system for scalable route determination using extended Kalman ﬁlter (EKF); which had better accuracy than traditional prediction methods. Peng et al. [9] apply Kalman ﬁlter (KF) method to fuse the information from urban road sections in order to obtain speed information without GPS sampling signals. Anand et al. [4] use the linear and non-linear Kalman ﬁltering techniques for accurate estimation and prediction of trafﬁc parameters in data fusion process with two data sources, namely, ﬂow data from video and travel time data from GPS. Bin et al. [10] propose a GPS-integrated navigation system for multi-data fusion based on decentralized architecture. Chu et al. [11] apply a KF model using simulated loop detector and probe vehicle data to estimate travel time. El Faouzi et al. [12] highlight Kalman ﬁltering and other data fusion techniques utilizing Bayesian inference, Dempster–Shafer evidential reasoning, artiﬁcial neural networks, and fuzzy logic rule-based membership as widely accepted in the area. The experimental results from the above studies are promising; however, in the fusion process, there could be aggregated latency due to increase in the system’s bandwidth, noisiness, and longer runtime. Our multimodal trafﬁc data fusion framework addresses some of these concerns using distributed trafﬁc data fusion architecture. This involves use of EKF for treating our homogeneous data fusion with the MFRI for good interpretation of the heterogeneous data fusion for trafﬁc ﬂow prediction. The rest of the chapter is organized as follows. Section 13.2 presents the related work. Section 13.3 introduces our models for trafﬁc data fusion. Section 13.4 discusses our results. Section 13.5 draws the conclusion and future work.

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

13.2

321

Related Work

In this section, we present related works in Big data fusion for trafﬁc congestion prediction with regard to their data sources, data fusion algorithms, and architectures.

13.2.1 Multimodal Data for Trafﬁc Congestion Trafﬁc congestion data can be multimodal in nature (e.g. numeric, qualitative, images). The commonly used sources for collecting trafﬁc congestion information include GPS, trafﬁc simulation models, sensors, social media such as Twitter, and probe vehicles. Lwin and Naing [13] estimated trafﬁc states from road GPS trajectories collected from mobile phones on vehicles using the Hidden Markov Model (HMM). Necula [14] studied trafﬁc patterns on street segments based on GPS data. Necula [15] performed short-term dynamic trafﬁc prediction on fused data from GPS and trafﬁc simulation. Kaklij [16] performed data mining of GPS data using clustering and classiﬁcation algorithms. Ito and Hiramoto [17] proposed a process simulation-based approach for Electronic Toll Collection (ETC) System and trafﬁc expressway problems at toll plazas. Kim and Suh [18] used VISSIM, a microscopic multimodal trafﬁc ﬂow simulation package, to analyze the difference between standard trafﬁc ﬂow inputs and the trip chain method in overcapacity conditions and the sensitivity of the model to this parameter. A simulation model for heterogeneous trafﬁc with no lane discipline has been developed by Metkari et al. [19]. He [20] analyzed trafﬁc congestion based on spatiotemporal simulation. Nellore et al. [21] explored wireless sensor networks for trafﬁc congestion evaluation and control. A congestion-aware trafﬁc routing system has been introduced by Aslam et al. [22]. Mai and Hranac [23] use Twitter data to examine the trafﬁc. Elsafoury [24] proposed the use of Part of speech (POS) tag systems to analyze trafﬁc information from micro-bloggers data source like Twitter. A road trafﬁc congestion monitoring in social media with Hinge-Loss Markov Fields has been introduced by Chen et al. [25]. Hoﬂeitner et al. [26] proposed a dynamic Bayesian network to perform realtime trafﬁc estimation with streaming data from a probe vehicle. Wang et al. [27] developed a Hidden Markov Model for urban vehicle estimation using probe data from a ﬂoating car.

13.2.2 Data Fusion Algorithms for Trafﬁc Congestion Estimation Data fusion involves consolidation of various unstructured, structured, and semistructured data. Using data fusion has the beneﬁt of larger “degree of freedom”

322

T. Adetiloye and A. Awasthi

within the internal state that contributes to improved estimation on observed measurement. According to Klein [28], data fusion involves the following processes: • Level 0—Data alignment • Level 1—Entity assessment (e.g. signal/feature/object), i.e. tracking and object detection/recognition/identiﬁcation • Level 2—Situation assessment • Level 3—Impact assessment • Level 4—Process reﬁnement (i.e. sensor management) • Level 5—User reﬁnement Figure 13.1 presents a tree diagram of this data fusion framework. Data fusion often involves the use of powerful algorithms such as Bayesian network, Kalman ﬁlter (KF), and Dempster–Shafer theory. In Dempster–Shafer theory, each state equation or observation is considered a special case of a linear belief function, and the KF is a special case for combining linear belief functions on a join-tree or Markov tree. Additional approaches include belief ﬁlters which add Bayes or evidential updates to the state equations: Kalman Filter [30]. The trafﬁc data fusion involves fusion of multimodal data from multiple information sources such as induce loop vehicle detector, video detector, GPS ﬂoating car [31, 32], trafﬁc simulation [33], and Twitter [23, 24]. Table 13.1 summarizes commonly used approaches for trafﬁc data fusion.

Level 0 Source refinement

Level 2

Level 3

Situation assessment

Threat assessment

Level 1 Object refinement

Fusion domain

Sources Sensors Databased Knowledge

User interface

Information bus

Level 4

Level 5

Process refinement

User refinement

Database management system

Fusion database

Fig. 13.1 Data fusion framework (adapted from the JDL, Data Fusion Lexicon [29])

Support database

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

323

Table 13.1 Data fusion algorithms for trafﬁc congestion estimation Approach (Discretetime) Kalman Filter

Extended Kalman Filter

Bayesian and Neural Network

Author Yang [34]

Title Travel time prediction using the GPS test vehicle and Kalman Filtering techniques

Anand et al. [35]

Data fusion-based trafﬁc density estimation and prediction

Kim and Kang [8]

Congestion avoidance algorithm using Extended Kalman Filter

Guo et al. [36]

Kalman Filter approach to speed estimation using single loop detector measurements under congested conditions

Pamula and Krol [37]

The trafﬁc ﬂow prediction using Bayesian and neural networks

Brief summary A recursive, discrete-time KF is used on historic and real-time data to improve performance monitoring, evaluation, planning, and for efﬁcient management of special-events-related trafﬁc ﬂow. They used KF to fuse spatial and location-based data for the estimation of trafﬁc density to future time intervals using a time-series regression model. They used EKF algorithm for accurate, scalable, and adaptable trafﬁc ﬂow prediction for near-future congestion based on historical and real-time trafﬁc information. The user’s route preferences are improved using the adaptive trafﬁc route conditions with scalable routing services. Trafﬁc data from single loop and dual loop station are fused while employing EKF to relate the ratio of ﬂow rate over occupancy and the speed. This resulted in more accurate estimation than the traditional g-factor approach. Comparative evaluation of the performance of Bayesian and neural networks on short-term trafﬁc congestion models as well as comparing with Bayesian dynamic model. The study showed that there is prospect in the use of artiﬁcial intelligence methods for forecasting trafﬁc congestion and incorporating them into modules of intelligent trafﬁc management systems.

13.2.3 Data Fusion Architecture There are different data fusion architectures which could be broadly classiﬁed into centralized, decentralized, distributed, and hierarchical architecture [38]. In centralized case, the fusion nodes are located in the central processor that collects all the raw data and uses the provided raw data measurements from the sources to send instructions to the respective sensors. The decentralized case consists of a network of nodes in which each node has its own central processor. The distributed is an extension of the centralized architecture where measurements from each source node are processed independently before the information is sent to the fusion node. Lastly, the hierarchical architecture generates schemes that combine decentralized and distributed nodes in which data fusion process is performed at different level in the hierarchy.

324

13.3

T. Adetiloye and A. Awasthi

Proposed Models for Trafﬁc Data Fusion

We propose two categories of models for trafﬁc data fusion based on distributed architecture: 1. Homogeneous trafﬁc data fusion 2. Heterogeneous data fusion

13.3.1 Homogeneous Trafﬁc Data Fusion Figure 13.2 illustrates the homogeneous distributed data fusion for trafﬁc ﬂow prediction. We used trafﬁc data obtained from Genetec blufaxcloud travel-time system engine (GBTTSE) [39]. This provides records of daily motorway trafﬁc in Montreal. Figure 13.3 shows the map of the Montreal motorway network under consideration with each node on the map being either a start or end node. The trafﬁc data is

Fig. 13.2 Homogeneous distributed data fusion for trafﬁc ﬂow prediction

Fig. 13.3 Map of the Montreal motorway network

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

325

massive in size with over 100,000 data row samples when collected for 24 h over a month. The trafﬁc input variables consist of t, t 1, thist thist þ 1, where t is the current travel time (mins) and thist is the historic travel time (mins). The travel time is the time taken for a vehicle to traverse two nodes. Three data mining algorithms are applied to predict trafﬁc congestion using the historical data collected from trafﬁc API. These algorithms are DBN, RF, and NN. The DBN is a multilayered probability generative model able to assign our input variables to its input layer, while applying a simple learning modules, called Restricted Boltzmann Machines at the hidden layers where each of its subnetwork’s hidden layer serve as a visible layer for the next. More general details of how the DBN works can be found in (Hinton [40]; Bengio et al. [41]; Hinton et al. [42]). On the other hand, the RF, introduced by Breiman [43], is an ensemble of decision trees. It takes the input features and applies the rules of each randomly created decision tree to predict the outcome while computing the majority votes at each split of the predictors with the best split chosen from among those output variables. The NN is a simpliﬁed case of the DBN (see Hecht-Nielsen [44]), in that it has a single hidden layer. Basically, for learning of the input features, it uses backpropagation with gradient descent algorithm, which relies on sigmoid activation function to minimize its loss function. Next, the results from the three data mining algorithms are fused using EKF. EKF can be used in an instance where the process to be estimated has nonlinear characteristics. It serves as effective tool for nonlinear state estimation, navigation systems, and GPS (Eric [45]), which is able to linearize about the current and mean covariance. As illustrated in Fig. 13.4, the method involves estimating a process with the

Fig. 13.4 Cyclic operation of the EKF combining the high-level diagram (adapted from: Welch and Bishop [46])

326

T. Adetiloye and A. Awasthi

state x 2 Rn over a discrete state space based on a non-linear stochastic difference equation: xk ¼ f ðxk1 ; uk Þ

ð13:1Þ

for a non-linear function, f, which relates the state of the previous time step to that of the current time step. Let us assume also that there is observations z at time step k of the state x with non-linear measurement x 2 Rm that is: z k ¼ hð x k ; v k Þ

ð13:2Þ

The non-linear function, h, relates the measurement, xk to zk. In practice, it is reasonable to assume that wk ¼ 0 and vk ¼ 0, since they are more or less unknown individual values of the noise at each time step, k. Welch and Bishop [46] made it obvious that the “important feature of the EKF is that the Jacobian Hk in the equation for the Kalman gain Kk serves to correctly propagate or ‘magnify’ only the relevant component of the measurement information.” In relating the EKF to our trafﬁc congestion model, we assume that z represents the observed outputs at time step k ¼ 3 of the state x with non-linear measurement x 2 Rm which depends on the trafﬁc input variables. These variables are x3k ¼ t, x2k ¼ t 1, x1k ≔ thist with the option control inputs u3k ¼ thist þ 1, u2k ¼ u1k ¼ 0 , where t represents the current travel time (mins) and thist is the historic travel time (mins). We take wk, vk to be the process and measurement noise assumed to be random Gaussian variables with their anticipated noise interference, measurement approximations, and randomness. Hence, our EKF equations are deﬁned as: x3k ¼ f ðx2k ; u3k ; wk Þ

ð13:3Þ

x2k ¼ f ðx1k ; u2k ; wk Þ

ð13:4Þ

z3k ¼ hðx3k ; vk Þ

ð13:5Þ

Detail information on the EKF can be found in Welch and Bishop [46].

13.3.2 Heterogeneous Trafﬁc Data Fusion Figure 13.5 illustrates the heterogeneous distributed data fusion for trafﬁc ﬂow prediction. It is an extension of the homogeneous framework and involves multiple data types (quantitative from trafﬁc API, qualitative from Twitter). It can be seen that the top half of Fig. 13.5 relies on homogeneous data fusion (Sect. 13.3.1). The bottom half involves the use of sentiment analysis and cluster classiﬁcation to predict trafﬁc congestion using tweets data. The results of EKF and

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

327

Fig. 13.5 Heterogeneous distributed trafﬁc data fusion for trafﬁc ﬂow prediction

sentiment analysis are then treated using the Mamdani Fuzzy Rule Inferencing (MFRI) for the heterogeneous trafﬁc ﬂow prediction. In fact, the heterogeneous trafﬁc data fusion uses the distributed architecture with the homogeneous trafﬁc data fusion which now includes the Twitter data source. Trafﬁc delay tweets are obtained in near real time with unstructured data having mainly categorical characteristics. To achieve the heterogeneous trafﬁc data fusion, we integrate the homogeneous fused parameters with the Twitter normalized outputs, tn of the sentiment analysis, where n is the number of the relevant tweet instance. We deﬁne set of fuzzy rules for the trafﬁc delays propagated by the parameter, tn, using Mamdani inferencing while relying on existing EKF-distributed homogeneous model. Table 13.2 presents the rules generated based on the proposed framework for the trafﬁc congestion prediction. It involves alignment of the trafﬁc delays POS tags based on the sentiment and cluster classiﬁcation and associating them with the predictive EKF estimation (See Fig. 13.6).

13.4

Results and Discussion

13.4.1 Measurement and Estimation Errors We used Weka [47] and MATLAB [48] to develop the mathematical functions for the homogenous and heterogeneous data fusion. The EKF is performed over the estimated outputs of the NN, DBN, and RF algorithms obtained from Weka which were fused during the alignment and association stages of the distributed framework. This is done in order to address the noise and measurement approximations. This involves updating the a posteriori estimate error covariance with

328

T. Adetiloye and A. Awasthi

Table 13.2 MFRI with Twitter sentiment EKF distributed homogeneous data fusion model Rule 1 If ((tweets contain trafﬁc delay with high severity, serious accident, emergency, etc.) and EKF_Homogeneous_Has_High_Congestion_Output ) Then (Predict_High_Congestion) Rule 2 If ((tweets contain trafﬁc delay with high severity, serious accident, emergency, etc.) and Not_EKF_Homogeneous__Has_High_Congestion_Output ) Then (Possible Congestion) Rule 3 If ((tweets contain trafﬁc delay with trafﬁc road work, trafﬁc light control, etc.) and EKF_Homogeneous__Has_Medium_Congestion_Output ) Then (Predict_Medium_Congestion) Rule 4 If ((tweets contain trafﬁc delay with trafﬁc road work, trafﬁc light control, etc.) and Not_EKF_Homogeneous__Has_Medium_Congestion_Output ) Then (PossibleCongestion) Rule 5 If ((tweets contain trafﬁc delay with no hindrance, looking good, etc.) and EKF_Homogeneous__Has_Low_Congestion_Output ) Then (Predict_Low_Congestion) Rule 6 If ((tweets contain trafﬁc delay with no hindrance, looking good, etc.) and Not_EKF_Homogeneous__Has_Low_Congestion_Output ) Then (Possible Congestion)

HomogeneousTrafficDataFusion = 0.436

traffictweets = 0.55

estimate = 0.611

1 1

2

3 4 5

6 0

1

0

1

0

Fig. 13.6 Estimations from the heterogeneous data fusion model

1

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

a

1

Error Covar

13

0.8 0.6 0.4 0.2

b

329

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

3

Error

2 1 0 –1 No. of samples

Fig. 13.7 (a) Measurement error. (b) Estimation error

Table 13.3 Data mining algorithms: NN, DBN, and RF for prediction of trafﬁc ﬂow

Average MAE R

NN 0.5645 0.6922

DBN 0.2455 0.8625

RF 0.4968 0.5520

overriding corrections as measured by the residual error reduction over the data surface. By using its stochastic adaptive ﬁltering without a priori knowledge of the internal system’s dynamic state, the EKF obtained an optimal estimate solution. For example, we have Fig. 13.7a which shows the covariance of error before ﬁltering (measurement error) and Fig. 13.7b shows the covariance of error after ﬁltering (estimation error). With the EKF, we achieve a mean absolute error (MAE) of 0.1262, which is better than the best result of 0.2455 by DBN. The MAE is expected to be as low as possible, while the closer R is to þ1 the better is the model optimality. Table 13.3 presents the measurements based on the performance of the data mining algorithms. The MAE and R are deﬁned by: MAE ¼

n X

y t yj =n jb

ð13:6Þ

t¼1 n P b y t y y yt b

R ¼

t¼1 n n 2 P 2 P b y y yt yt b t¼1

t¼1

2

ð13:7Þ

330

T. Adetiloye and A. Awasthi

MAE is the closeness of the prediction to the eventual outcomes. R isthe strength and direction of a linear relationship between the predicted value b y t and actual value ( y).

13.4.2 Model Validation The model validation is done by measuring the predicted travel time (PTT) required for the vehicles’ traversal of segment of the road networks against the PTT estimates obtained from GBTTSE. Table 13.4 presents the results. The accuracy of the prediction is deﬁned as degree of closeness of a measured or calculated value to its actual value. The mathematical formula is given as 1 percentage error (PE) where PE ¼

jcurrent TT predicted TTj 100% current TT

ð13:8Þ

We see that the PTT for our model is better than that of GBTTSE in 12 cases, while the GBTTSE is better in 11 cases out of the total. In two instances, both models performed equally well in PTT. One could state that the ﬁltering of the noise using EKF contribute to the slight improvement in performance when compared with the GBTTSE. A sample of the PTT for St. Patrick-Upper Lachine MTL A15N is presented in Fig. 13.8. The mean travel time is the mean of the travel times to travel between the two nodes, e.g. from St. Patrick to Upper Lachine. Figure 13.9 is a sample plot of the selected algorithms with the respective PTT (tpred).

13.5

Conclusion

In this chapter, we proposed a multimodal Big data fusion framework for trafﬁc congestion prediction. It involves distributed trafﬁc data fusion architecture with homogenous (quantitative only) and heterogeneous (quantitative, qualitative) data. For the homogenous data fusion, using highway trafﬁc dataset, we predict the trafﬁc travel times using data mining predictive algorithms comprising of DBN, NN, and RF. We also used EKF to enhance its adaptive ﬁltering capability in order to reduce the noise measurement and estimation errors. For the heterogeneous data, we integrate the homogeneous fusion information with Twitter trafﬁc data and applied MFRI for good interpretability. Our results emphasize the improvements made in the prediction of travel times using trafﬁc Big data of vehicles’ traversing various road network nodes in the city of Montreal. The model validation is done with the

AL100

MTL Motorway networks A15N

PTT (mins) tpred GBTTSE Source node Destination node t (mins) t 1 (mins) thist (mins) thist þ 1 (mins) Accuracy (%) Champlain Atwater 2.15 2.12 1.18 2.23 2.00 93.02 St Patrick Upper Lachine 2.28 2.32 1.95 1.87 1.96 85.96 Upper Lachine St Luc 1.05 1.02 1.45 1.65 1.85 23.80 St Luc Cote St 1.62 1.58 1.47 1.53 1.61 Catherine 99.34 Cote St Jean Talon 1.05 0.62 0.92 1.03 1.08 Catherine 97.22 Jean Talon Duncan 1.95 1.95 2.03 1.68 2.08 93.33 Duncan Lucerne 4.48 4.48 3.03 4.06 3.04 67.85 Dunkirk Sauve 2.05 1.97 1.93 2.02 2.12 96.59 Sauve Salaberry 2.17 2.17 2.14 2.21 2.13 98.16 Salaberry Cartier 1.17 1.95 1.02 1.10 1.86 58.03 IlleSoeur FernandSeng 2.82 2.80 2.83 2.86 2.78 98.58 FernandSeguin Irlandais 1.45 1.45 1.56 1.50 1.53 94.48 Irlandais Wellington 1.18 1.20 1.06 1.08 1.08 91.52 1.81 84.19 1.85 81.14 1.66 41.90 1.78 90.12 0.99 94.29 1.98 98.46 4.05 90.40 2.09 98.05 2.40 89.4 1.75 50.43 2.89 97.52 1.50 96.55 1.11 94.07

(continued)

PTT (mins) tpred EKF data fusion Accuracy (%)

Table 13.4 Model validation of results between EKF data fusion model and the GBTTSE travel time predictions for Montreal (MTL) motorway networks

13 Multimodal Big Data Fusion for Trafﬁc Congestion Prediction 331

Sherbrooke

Rosemont

Jean Talon

Jarry

Lachine

PieIXN2

PieIXN3

PieIXN4

PieIXN5

RTM

LucienAllier

Notre-Dame

Guy

Peel

St Laurent

St Denis

Source node Papineau

PieIXN1

MTL Motorway networks ReneLevW

Table 13.4 (continued)

PTT (mins) tpred GBTTSE Destination node t (mins) t 1 (mins) thist (mins) thist þ 1 (mins) Accuracy (%) St Denis 2.85 2.83 2.62 2.60 2.72 95.44 St Laurent 1.62 2.72 1.65 1.71 1.68 96.30 University 2.45 2.45 1.86 1.95 1.62 66.12 Guy 1.60 1.60 1.30 1.30 1.52 95.0 Atwater 1.73 1.73 1.52 1.50 1.64 94.80 Sherbrooke 3.8 3.8 3.15 3.12 3.50 92.11 Rosemont 3.05 2.81 2.5 3.02 2.61 85.57 Jean Talon 2.72 2.70 2.30 2.41 2.71 99.63 Jarry 1.87 1.84 1.68 1.70 1.43 76.47 HenriB 8.87 8.87 9.02 10.50 9.82 89.29 LucienAllier 22.18 22.18 21.87 20.56 21.69 97.79 PalaisCongres 12.03 15.0 16.68 18.26 19.00 42.06 2.98 95.44 1.56 96.30 2.40 97.96 1.55 96.88 1.59 91.91 2.38 62.63 3.01 98.69 2.68 98.53 1.50 80.21 8.73 98.42 20.89 94.18 18.63 45.14

PTT (mins) tpred EKF data fusion Accuracy (%)

332 T. Adetiloye and A. Awasthi

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

333

Fig. 13.8 Example of calculated TTs from GBTTSE

Fig. 13.9 Example of calculated TTs from RF, DBN, NN, and GBTTSE

Genetec blufaxcloud travel-time system engine. The strength of proposed work is the use of Big data fusion for trafﬁc congestion prediction in near real-time situation on the basis of the trafﬁc travel times of road vehicles on urban motorways. The limitation is lack of adequate system tools to seamlessly integrate data from various sources for real-time trafﬁc information. As future work, one could also consider the use of trafﬁc image, video, and other real-time data to improve trafﬁc congestion prediction while integrating them to our existing framework. Secondly, the study can be extended by integrating various geographical contexts, connected locations, merges, ramps, etc. Acknowledgments The authors owe much gratitude to Genetec in Montreal, Canada, for the trafﬁc data. They also thank the reviewers for their constructive comments and criticism.

334

T. Adetiloye and A. Awasthi

References 1. Lomax, T., Turner, S., Shunk, G., Levinson, H.S., Pratt, R.H., Bay, P.N., Douglas, G.B.: Quantifying congestion. In: Transportation Research Board 1&2, NCHRP Report 398, Washington DC (1997) 2. Jain, V., Sharma, A., Subramanian, L.: Road trafﬁc congestion in the developing world. In: ACM DEV ’12 Proceedings of the 2nd ACM Symposium on Computing for Development, Atlanta, Georgia (2012) 3. Wang, C., Zhu, Q., Shan, Z., Xia, Y., Liu, Y.: Fusing heterogeneous trafﬁc data by Kalman ﬁlters and Gaussian mixture models. In: 2014 I.E. 17th international conference on intelligent transportation systems (ITSC), Qingdao (2014) 4. Anand, R.A., Vanajaskshi, L., Subramanian, S.: Trafﬁc density estimation under heterogenous trafﬁc conditions using data fusion. 2011 I.E. intelligent symposium (IV), pp. 31–36 (2011) 5. Dailey, D.J., Harn, P., Lin, P.J.: ITS data fusion. ITS research program, Final research report (1996) 6. Bachmann, C.: Multi-sensor data fusion for trafﬁc speed and travel time estimation. Toronto (2011) 7. Runyoro, A.-A.K., Jesuk, K.: Real-time road trafﬁc management using ﬂoating car data. Int. J. Fuzzy Log. Intell. Syst. 13(4), 269–277 (2013) 8. Kim, S.-S., Kang, Y.-B.: Congested avoidance algorithm using extended Kalman ﬁlter. In: International conference on convergence information technology (2007) 9. Peng, D., Zuo, X., Wu, J., Wang, C., Zhang, T.: A Kalman ﬁlter based information fusion method. In: 2nd International Conference on Power Electronics and Intelligent Transportation System (2009) 10. Bin, W., Jian, W., Jianping, W., Baigen, C.: Study on adaptive GPS INS integrated navigation system. IEEE Proc. Intell. Transp. Syst. 2, 1016–1021 (2003) 11. Chu, L., Oh, J., Recker, W.: Adaptive Kalman ﬁlter based freeway travel time estimation. In: Transportation Research Board 8th Annual Meeting, Washington DC (2005) 12. El Faouzi, N.-E., Klein, L.A.: Data fusion for ITS: techniques and research needs. Transp. Res. Procedia. 15, 495–512 (2016) 13. Lwin, H.T., Naing, T.T.: Estimation of road trafﬁc congestion using GPS data. Int. J. Adv. Res. Comput. Commun. Eng. 4(12), 1–5 (2015) 14. Necula, E.: Analyzing trafﬁc patterns on street segments based on GPS data using R. Transp. Res. Procedia. 10, 276–285 (2015) 15. Necula, E.: Dynamic trafﬁc ﬂow prediction based on GPS data. In: IEEE 26th International Conference on Tools with Artiﬁcial Intelligence, Limassol (2014) 16. Kaklij, S.P.: Mining GPS data for trafﬁc congestion detection and prediction. Int. J. Sci. Res. 4 (9), 876–880 (2015) 17. Ito, T., Hiramoto, T.: A general simulator approach to ETC toll trafﬁc congestion. J. Intell. Manuf. 17(5), 597–607 (2006) 18. Kim, S., Suh, W.: Modeling trafﬁc congestion using simulation software. In: International Conference on Information Science and Applications (ICISA) (2014) 19. Metkari, M., Budhkar, A., Maurya, A.K.: Development of simulation model for heterogeneous trafﬁc with no lane discipline (2013) 20. He, S.: Analysis method of trafﬁc congestion degree based on spatio-temporal simulation. Int. J. Adv. Comput. Sci. Appl. 3(4), 12–17 (2012) 21. Nellore, K., Hancke, G.P., Reindl, L.M.: A survey on urban trafﬁc management system using wireless sensor networks. Sensors. 16(2), 157 (2016) 22. Aslam, J., Lim, S., Rus, D.: Congestion-aware trafﬁc routing system using sensor data. In: 2012 15th International IEEE Conference on Intelligent Transportation Systems Anchorage, Alaska (2012) 23. Mai, E., Hranac, R.: Twitter interactions as a data source for transportation incidents. 92nd Annual Meeting Transportation Research Board, Washington DC (2013)

13

Multimodal Big Data Fusion for Trafﬁc Congestion Prediction

335

24. Elsafoury, F.A.: Monitoring urban trafﬁc status using Twitter messages, pp. 1–46 (2013) 25. Po-Ta, C., Chen, F., Qian, Z.: Road trafﬁc congestion monitoring in social media with hingeloss Markov random ﬁelds. 2014 I.E. International Conference on Data Mining, pp. 80–89 (2014) 26. Hoﬂeitner, A., Herring, R., Abbeel, P., Bayen, A.: Learning the dynamics of arterial trafﬁc from probe data using a dynamic Bayesian network. IEEE Trans. Intell. Transp. Syst. 1–15 (2012) 27. Wang, X., Peng, L., Chi, T., Li, M., Yao, X., Shao, J.: A hidden Markov model for urban-scale trafﬁc estimation using ﬂoating car data. PLoS One. 10(12), e0145348 (2015) 28. Klein, L.A.: Sensor and data fusion: a tool for information assessment and decision making, p. 51. SPIE Press, Washington (2004) 29. JDL, Data Fusion Lexicon: Technical Panel for C3, F.E. White, San Diego, California (1991) 30. Kalman Filter: 2016 [Online]. https://en.wikipedia.org/wiki/Kalman_ﬁlter. Accessed 16 Oct 2016 31. Zhanquan, S., Mu, G., Wei, L., Jinqiao, F., Jiaxing, H.: Multisource trafﬁc data fusion with entropy based method. In: International Conference on Artiﬁcial Intelligence and Computational Intelligence (2009) 32. Kuwahara, M., Tanaka, S.: Urban transport data fusion and advanced trafﬁc management for sustainable mobility. In: Spatial Data Infrastructure for Urban Regeneration, vol. 5, pp. 75–102. Institute of Industrial Science, The University of Tokyo, 4-6-1, Komaba (2008) 33. Ben-Akiva, M., Bierlaire, M., Koutsopoulos, H.N., Mishalani, R.: Real-time simulation of trafﬁc demand-supply interactions within DynaMIT. In: Gendreau, M.a.M.P. (ed.) Transportation and Network Analysis: Current Trends, vol. 63, pp. 19–36. Springer US (2002) 34. Yang, J.-S.: Travel time prediction using the GPS test vehicle and Kalman ﬁltering techniques. In: American control conference, Portland, OR (2005) 35. Anand, A., Ramadurai, G., Vanajakshi, L.: Data fusion-based trafﬁc density estimation and prediction. J. Intell. Transp. Syst. 18(4), 367–378 (2014) 36. Guo, J., Xia, J., Smith, B.L.: Kalman ﬁlter approach to speed estimation using single loop detector measurements under congested conditions. J. Transp. Eng. 135(12), 927–934 (2009) 37. Pamula, T., Krol, A.: The trafﬁc ﬂow prediction using Bayesian and neural networks. In: Intelligent Transportation Systems-Problems and Perspectives, Studies in Systems, Decision and Control, vol. 32, pp. 105–126 (2016) 38. Castanedo, F.: A review of data fusion techniques. Sci. World J. 1–19 (2013) 39. Genetec blufaxcloud trafﬁc engine, [Online]. http://genetec1.blufaxcloud.com/engine/getLinks. hdo. Accessed 02 2018 40. Hinton, G.E.: Deep belief networks. Scholarpedia. 4(5), 5947 (2009) 41. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. Adv. Neural Inf. Proces. Syst. 19, 153–160 (2006) 42. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18, 1527–1554 (2006) 43. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001) 44. Hecht-Nielsen, R.: Theory of the backpropagation neural network. In: Neural Networks for Perception: Computation, Learning, and Architectures, pp. 65–93 (1992) 45. Eric W.: Sigma-point ﬁlters: an overview with applications to integrated navigation and vision assisted control. In: Nonlinear Statistical Signal Processing Workshop (2006) 46. Welch, G., Bishop, G.: An introduction to the Kalman ﬁlter. 2001 [Online]. https://www.cs.unc. edu/~welch/media/pdf/kalman_intro.pdf. Accessed 16 Aug 2016 47. Frank, E., Hall, M.A., Witten, I.H.: The WEKA workbench. Online appendix for “data mining: practical machine learning tools and techniques”, Morgan Kaufmann, Fourth Edition (2016) 48. MATLAB and Statistics Toolbox Release 2012b, The MathWorks, Inc., Natick, Massachusetts, [Online]

Chapter 14

Parallel and Distributed Computing for Processing Big Image and Video Data Praveen Kumar, Apeksha Bodade, Harshada Kumbhare, Ruchita Ashtankar, Swapnil Arsh, and Vatsal Gosar

Abstract This chapter presents two approaches for addressing the challenges of processing and analysis for Big image or video data. The ﬁrst approach exploits the intrinsic data-parallel nature of common image processing techniques for processing large images or dataset of images in a distributed manner on a multi-node cluster. The implementation is done using Apache Hadoop’s MapReduce framework and Hadoop Image Processing Interface (HIPI) which facilitates efﬁcient and high-throughput image processing. It also includes a description of a Parallel Image Processing Library (ParIPL) developed by the authors on this framework which is aimed to signiﬁcantly simplify image processing using Hadoop. The library exploits parallelism at various levels—frame level and intra-frame level. The second approach uses high-end GPUs for efﬁcient parallel implementation of specialized applications with high performance and real-time processing requirements. Parallel implementation of video object detection algorithm, which is the fundamental step in any surveillance-related analysis, is presented on GPU architecture along with ﬁne-grain optimization techniques and algorithm innovation. Experimental results show signiﬁcant speedups of the algorithms resulting in real-time processing of HD and panoramic resolution videos.

14.1

Introduction

In the present era, high-resolution cameras have become inexpensive, compact and ubiquitously present in smartphones and surveillance systems. Size of the images and videos recorded with these high-resolution cameras is very high. As a result, a remarkable portion of Big data is contributed by images and videos. Image processing operations are mostly compute- and memory-intensive in nature. The image processing techniques designed for small size images contained in small datasets do not scale well for large images in equally sized or large datasets with intensive computation and storage requirement. Thus, big image/video data poses fundamental challenges in processing and analysis, requiring novel and scalable data P. Kumar (*) · A. Bodade · H. Kumbhare · R. Ashtankar · S. Arsh · V. Gosar Visvesvaraya National Institute of Technology (VNIT), Nagpur, Maharashtra, India © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_14

337

338

P. Kumar et al.

management and processing frameworks/techniques. High processing needs of image-processing applications are well suited to utilize the power of parallel and distributed computing. To address these challenges, two approaches can be considered: Parallel Image Processing Library (ParIPL) and GPUs for image processing. The ﬁrst approach exploits the intrinsic data-parallel nature of common image processing techniques for processing large images or large dataset of images in a distributed manner on a multi-node cluster. Implementation is done using Apache Hadoop’s MapReduce framework and Hadoop Image Processing Interface (HIPI) which facilitates efﬁcient and high-throughput image processing. After going through existing frameworks, one can say that there is a need to have a library which has implementations of many basic and some commonly used complex image processing and computer vision algorithms for image format FloatImage, in order to completely utilize the capability of HIPI and Apache Hadoop in image processing. Hence, ParIPL, an image processing library is developed which implements many basic as well as complex image processing and CV algorithms in many different image formats including the FloatImage format supported by HIPI. The library exploits parallelism at various levels: inter-frame level parallelism that is exhibited by the distribution of frames on different nodes of a cluster as well as intra-frame level parallelism that is exhibited by algorithms such as sliding window operation for ﬁltering, detection, etc. Video analytic processing of a large amount of data becomes the bottleneck. Video surveillance algorithms represent a class of problems that are both computationally intensive and bandwidth intensive. For example, a benchmark done by a team of researchers in Intel [11] reported that major computationally expensive module-like background modeling and detection of foreground regions consumes 1 billion microinstructions per frame of size 720 576. On a sequential machine with 3.2 GHz Intel Pentium 4 processor, it takes 0.4 s to process one frame. Obtaining the desired frame processing rates of 20–30 fps in real-time for such algorithms is the major challenge faced by the developers [2, 3, 11, 14]. For processing video data from several cameras feeds with high resolution or HD (1920 1080) and beyond, it becomes imperative to investigate approaches for high performance by distributed or parallel processing. We have also seen remarkable advances in computing power and storage capacity of modern architectures. Multi-core architectures and GPUs provide energy and cost-efﬁcient platform. There have been some efforts to leverage these co-processors to meet the real-time processing needs [13, 19] (including our previous work [17, 18]). However, the potential beneﬁts of these architectures can only be realized by extracting data-level parallelism and developing ﬁne-grained parallelization strategies using GPUs. This chapter is organized as follows. Section 14.2 gives background and literature view on parallel and distributed computing for processing Big data including efﬁcient frameworks using MapReduce, Spark, Storm, Hadoop and GPU. The parallel-processing framework for image data using Hadoop is discussed in Sect. 14.3. Section 14.4 discusses the parallel processing of image using GPU and gives experimental results and comparison for various image-processing operations. Some concluding remarks are given in Sect. 14.5.

14

Parallel and Distributed Computing for Processing Big Image and Video Data

14.2

339

Background and Literature Review

In the era of Big data, demands for massive ﬁle processing grow rapidly, in which image data occupies considerable proportion, such as pictures embedded in web pages, photos released in social network, pictures of goods in shopping websites, and so on. High-resolution cameras distributed over wide geographical locations are increasingly being used for surveillance purpose. Many of such images and videos are very large in size up to hundreds of GB. Commonly, these images need to be processed for different kinds of applications, like content-based image retrieval (CBIR), image annotation and classiﬁcation, and image content recognition. Image processing is used for a wide range of applications like medical applications (X-ray imaging, MRI scans, etc.), industrial applications, surveillance applications, remote sensing and astronomical data. The processing of such a large image and video data has high computational and memory requirement, often too high for a single processor architecture. Hence, considering the fact that image-processing operations are inherently parallel in nature, it becomes natural to process these applications on distributed (e.g., Hadoop) and parallel (e.g., GPU) architectures with accelerators to process these massive images. There are four advantages of distributed systems over isolated computers: 1. Data sharing, which allows many users or machines to access a common database. 2. Device sharing, which allows many users or machines to share their devices. 3. Communications, that is, all the machines can communicate with each other more easily than isolated computers. 4. Flexibility, i.e., a distributed computer can spread the workload over the available machines in an effective way. Compared with the single-node environment, by employing distributed systems, we can obtain increased performance, increased reliability, and increased ﬂexibility. To support efﬁcient data processing in distributed systems, there exist some representative programming models, such as MapReduce [20], Spark, Storm, all of which have open source implementations and are suitable for different application scenarios. Some efﬁcient data-processing frameworks for distributed systems are brieﬂy described below. MapReduce The MapReduce framework (Fig. 14.1) is considered as an effective way for Big data analysis due to its high scalability and the ability of distributed processing of non-structured or semi-structured data. The open source implementation of MapReduce framework—Hadoop [6, 15], provides a platform for users to develop and run distributed applications. MapReduce framework usually splits the input dataset into independent chunks which are processed by the map tasks in a completely parallel manner by distributing them among nodes of the cluster. The application developers can concentrate on solving the problem statement, rather than dealing with the details of distributed systems. Hence, MapReduce framework [20] is preferred for distributed computing. Famous MapReduce implementations

340

P. Kumar et al. Map

Shuffle

Reduce

Fig. 14.1 MapReduce framework

have a Distributed File System (DFS) which allows storing a large volume of application data and provides high-throughput access to application data. This leads to reducing the cost of storing large volumes of image datasets. Spark Spark is developed by UC g parallel framework and suitable for iterative computations, such as machine learning and data mining. RDDs that can be persisted in memory across computing nodes are utilized by Spark. It has its architectural foundation the resilient distributed dataset (RDD), a read-only multiset of data items distributed over a cluster of machines that is maintained in a fault-tolerant way. Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear data ﬂow structure on distributed programs. MapReduce programs read input read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Sparks’s RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory. Spark facilitates the implementation of both iterative algorithms that visit their dataset multiple times in a loop and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to a MapReduce implementation (as was common in Apache Hadoop stacks). Storm Storm is an open source distributed real-time computation system, which makes it easy to reliably process unbounded streams of data and does for real-time processing what Hadoop does for batch processing. Storm is suitable for realtime analytics, online machine learning, continuous computation, and more.

14

Parallel and Distributed Computing for Processing Big Image and Video Data

341

To consider the application scenarios that these distributed systems are suitable for, it is more suitable to employ Hadoop for massive image ﬁles processing. GPU Most image-processing algorithms have high complexities and are suitable for accelerators, especially general purpose graphic process unit (GPGPU or GPU for short). In recent years, GPGPU has been widely used in parallel processing. With support of CUDA and other parallel programming models for GPU, such as Brook and Open Computing Language (OpenCL), parallel programming on GPU has become convenient, powerful and extensive. On processing massive image ﬁles with distributed platform and accelerators, two issues that need to be addressed are elaborated below. First, massive image processing is both I/O intensive and computing intensive, which should be managed concurrently through multi-threading with multi-core processors or GPU in-node, while simplifying parallel programming. In addition, to avoid that ﬁle I/O becomes the overall system bottleneck, data transfer between CPU and GPU should be optimized. Second, there exist considerable image-processing algorithms and their variations for different kinds of image-related applications, while new algorithms are still emerging. Most of them were implemented as prototypes when they were proposed, and some of them have GPU-version implementations. A kind of system architecture which can easily integrate the existing CPU/GPU image-processing algorithms can be proposed to deal with the above two issues. In other words, we should use available resource as much as possible rather than write everything by ourselves.

14.2.1 Image Frameworks for Big Data Apache Hadoop is an open source platform for distributed storage and distributed processing of very large datasets on computer clusters built from commodity hardware. Hadoop services provide support for data storage, data processing, data access, data governance, security and operations. Many researchers have tried to use Hadoop in many image processing and computer vision projects in order to reduce the execution time and have come up with different algorithms [6, 15]. However, Hadoop does not have an image-processing library. The application developers have to implement basic image-processing functionalities in Hadoop all by themselves at the pixel level. Recently, there have been attempts to make frameworks which allow developers to integrate famous image-processing libraries with Apache Hadoop. Details about some of the technologies and image-processing frameworks for parallel computing and distributed processing of Big data are discussed next.

342

P. Kumar et al.

MapReduce Image-Processing Framework (MIPr) MIPr [16] provides support for OpenCV Library and OpenIMAJ. For using OpenCV Library or OpenIMAJ, the main disadvantage is that ﬁrst there is a need to convert image format into the ones that are supported in OpenCV or OpenIMAJ, respectively. This image format conversion comes as an overhead as well the application developers have to devote time to write this conversion functions. Hadoop Image-Processing Interface (HIPI) Although Apache Hadoop is primarily used for textual data (text processing), not offering built-in support for image processing, HIPI [4] (Fig. 14.2), an image processing library, designed to be used with Apache Hadoop MapReduce distributed programming framework, imparts it the ability for image processing. Standard Hadoop MapReduce programs struggle in representing image input and output data in a useful format. For example, with current methods, to distribute a set of images over Map nodes, the user needs to pass the images as a String. Each image then needs to be decoded in each map task in order to get access to the pixel information. This technique is not only computationally inefﬁcient but also inconvenient for the programmer. Thus, it involves signiﬁcant overhead to obtain standard ﬂoat image representation with such an approach. HIPI provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efﬁcient distributed processing with MapReduce style programs. Although HIPI imparts Apache Hadoop the capability to do image processing, the main disadvantage is the fact that the functionalities provided by them are limited and do not offer much for image-processing operations. HIPI represents collection of images on HDFS using Hipi Imagebundle (HIB) class (called the HIPI Image Bundle) and uses FloatImage class (Image format) to represent the image in memory. Basically, FloatImage is like an array of pixel values of an image. The application developers have to write all the image processing and computer vision algorithms from scratch using the mathematical formulations on the pixel values. OpenIMAJ OpenIMAJ is a set of Java libraries for image and video analysis. It provides more image processing and computer vision algorithms than HIPI. The images in OpenIMAJ are represented in an array format. Hence, the application

Fig. 14.2 Image processing using HIPI

14

Parallel and Distributed Computing for Processing Big Image and Video Data

343

developer has to convert the array into image and back from the image into an array, in both the Map and Reduce functions. This is the main disadvantage which increases the execution time, hurting the primary purpose of speeding up image processing. OpenCV with Hadoop HIPI allows integration with OpenCV, but it is not a straightforward task and can consume considerable manual efforts to do so. Moreover, OpenCV does not support FloatImage class. All the algorithms and functionalities implemented in OpenCV are in different image formats (e.g., Mat image format). Therefore, the application developer has to write functionalities to convert the FloatImage format to one of the OpenCV Image formats in order to use OpenCV library. This comes as an overhead, which increases the execution time, reducing the speed up gained in image processing. Spark Image-Processing Library The Spark Image-Processing Library (sipl) is a Python module for image processing via Apache Spark. It contains a base image object, image readers and writers, a base algorithm, and HDFS utilities. Though Apache Spark overcame some main problems in Mapreduce, it consumes a lot of Memory, and the issues around memory consumption are not handled in a userfriendly manner. Apache Spark would take large resources. High-performance computing (HPC) refers generally to a computing practice that aims to efﬁciently and quickly solve complex problems. The main tenet of HPC that we are focusing on is Parallel Computing. This is where computer processing units are used simultaneously to perform a computation or solve a task. Most applications are written so-called “sequentially”, which is where the computations of the program happen one after the other. There are some tasks however where the order of computation may not matter, for example if you want to sum up the elements of two separate lists. It doesn’t matter which you sum up ﬁrst. If you could theoretically do two summations simultaneously, then you would theoretically get a two times speed up of your application. This is the idea behind parallel computing. Supercomputers and computer graphics cards have thousands of computing units which allow them to run highly parallelized code. Image processing and image analysis are about the extraction of meaningful information from images. Images on computers are represented by a matrix of so-called pixels, the width*height of this matrix is the resolution. These pixels contain information about the amount of red, green and blue in the image at that point. Image processing and analysis is actually a task highly suited to highperformance computing (HPC) and parallel processing. For example, suppose say the image processing task that we want to perform is detection of face in the image. While processing the image one might want to consider a subset of the image. The face will only take up a small subset of the image; so to detect it, one needs to be focused on that subset or image window. Sequentially, one could iterate over the images and process each subset at a time, but one could also use parallel computing to process each subset simultaneously and get big speedups linear to the number of subsets. This approach is called sliding window technique, which is discussed later in the chapter elaborately.

344

14.3

P. Kumar et al.

Distributed and Parallel Processing of Images on Hadoop

This section focuses on Hadoop MapReduce Implementation. It discusses the idea of utilizing the inter-frame as well as intra-frame parallelism in image processing and computer vision algorithms as well as our approach for implementation on Hadoop MapReduce paradigm.

14.3.1 Inter-Frame Parallelism Inter-frame parallelism is basically executed by distributing images on all nodes. In this case, one image gets completely processed on one node only. Apache Hadoop’s MapReduce programming framework is used along with HIPI. The structure of the framework is given in Fig. 14.3. Apache Hadoop forms the base of the architecture. Over it comes the HIPI library and on top of it comes ParIPL. The application developer can focus on image

Fig. 14.3 Architecture of the framework

14

Parallel and Distributed Computing for Processing Big Image and Video Data

345

Fig. 14.4 Inter-frame parallelism using ParIPL

processing by using the ParIPL library directly, rather than dealing with work distribution among the nodes of a cluster (which is done by Hadoop). First the input image dataset are converted into HIB. The primary input object to HIPI program is a HIB. A HIB is a collection of images represented as a single ﬁle on the HDFS. From this HIB, each image is read in FloatImage format and provided to mappers. All the image-processing operations are done inside the map() method of the mapper class. Thereafter, Apache Hadoop does the shufﬂing and sorting work. Furthermore, the processed images in FloatImage format are given to the reducer where the images are again converted into a text ﬁle. Finally, the text ﬁle is converted back to individual images. This entire process is shown in Fig. 14.4.

14.3.2 Intra-Frame Parallelism Intra-frame parallelism is basically executed by distributing parts of one image on all nodes. In this case, all parts (of one image) get processed on all nodes simultaneously. Image-processing operations have inherent property of parallelism embedded in them. The primary operations in image processing are of two types: point processing and neighborhood processing. In point processing, each pixel is subjected to some transformation function (mathematical) to ﬁnd out the value of the output pixel (new value of pixel) after the operation. Basically, all the pixels in the image are transformed using the same transformation function. The pixel values of the output image after an image-processing operation can be calculated in parallel. In neighborhood processing, the pixel value at location (x,y) of the output image after imageprocessing operation depends on the pixel value at (x,y) and the pixel values in the neighborhood of (x,y), i.e., S(x,y). If the application developers come up with proper algorithms which provide the neighborhood pixel values for a point, then this operation can also be performed in parallel scheme. Researchers have proposed various algorithms to use this intrinsic behavior of parallelism present in the imageprocessing operations [9]. Here, sliding window technique is used to exploit this parallelism. The initial high-resolution input image is split up into n small images depending on the number of nodes in the cluster. The input image is always split up into an even number of small images. Moreover, the number of small images should be less than the number

346

P. Kumar et al.

Fig. 14.5 Intra-frame parallelism using ParIPL

of nodes in the cluster. Hence, the input image is split into ‘n’ small images, considering ‘m’ is the number of nodes in the cluster and ‘n’ is the largest even number less than ‘m’. Conditions in mathematical form: n%2¼¼0, nm Now, from these small images, HIB is created. Once the HIB has been created, computation is done on multiple nodes in a similar fashion as discussed above in inter-frame parallelism. After the processing is done, the images are merged back (combined in the reduce step of MapReduce) to get the processed output image in text format. Finally, the processed output image is created from this text ﬁle. The entire process is illustrated in Fig. 14.5. Pseudo Code for Intra-frame Parallelism Breaking the image to distribute on nodes:

14

Parallel and Distributed Computing for Processing Big Image and Video Data

347

14.3.3 Experimental Results Figures 14.6, 14.7 and 14.8 show the results of processing different sizes of datasets of images. Time taken to process a large dataset of images on Hadoop cluster with only one active node is more than the time taken with two active nodes in the same cluster using ParIPL. The computation work performed in a single node is equally distributed on two nodes by distributing an equal number of images to two nodes in two-node Hadoop cluster resulting in reduced processing time. However, in a two-node Hadoop cluster, communication delay is present due to which time reduced is not exactly half of the time taken by one node cluster. The timing behavior improves as the cluster size grows since the communication delay stays the same with increasing the processing power. Time taken to process a dataset of a single large image on a single-node Hadoop cluster is more than the time taken on a two-node Hadoop cluster using ParIPL. The computation work performed on one node in a single-node cluster is equally distributed on two nodes by equally partitioning the image in a two-node Hadoop cluster resulting in reduced processing time. However, in a two-node cluster, communication delay is present due to which time reduced is not exactly half of one node cluster. The timing behavior improves as the cluster grows since the communication delay stays the same with increasing the processing power.

14.4

Parallel Processing of Images Using GPU

This section discusses high-performance implementation of video object detection on GPU which forms the core of visual computing in a number of applications including video surveillance [1, 2]. The focus is on algorithms like Gaussian mixture model (GMM) for background modeling, Morphological image operations for

20

40

60

1-Node 2-Node

0

Time (secs)

80

MapReduce Results for Image size-->320X240

250

500

Fig. 14.6 Result of inter-frame parallelism

750 No. of Images

1000

2000

348

P. Kumar et al.

50

100

1-Node 2-Node

0

Time (secs)

150

MapReduce Results for Image size-->480X320

250

500

750

1000

No. of Images

Fig. 14.7 Result of inter-frame parallelism

Time (secs)

1-Node 2-Node

0

100 200 300 400 500 600

MapReduce Results for Image size-->720X480

250

500

750

1000

2000

No. of Images

Fig. 14.8 Result of inter-frame parallelism

image noise removal, Connected Component Labeling (CCL) for identifying the foreground objects which are used at successive stages of moving object detection and tracking algorithm. In each of these algorithms, different memory types and thread conﬁgurations provided by the CUDA architecture have been adequately exploited. One of the key contributions of this work is novel algorithmic modiﬁcation for parallelization of the CCL algorithm where parallelism is limited by various data dependencies. The scalability was tested by executing different frame sizes on Tesla C2070 GPU. Speedups obtained for different algorithms are impressive and yield real-time processing capacity for HD and panoramic resolution videos.

14

Parallel and Distributed Computing for Processing Big Image and Video Data

Video Sequence

BG Modeling and FG Detection Mixture of Gaussians

Foreground Mask

Post Processing Consolidation, Filtering etc. Morphological operations

Processed Mask

Blob Extraction

349

Blob list and Analysis

Connected Component Labeling

Fig. 14.9 Overview of moving-object detection algorithms in video surveillance workload

14.4.1 Moving-Object Detection Figure 14.9 shows the different stages in video object detection which are brieﬂy outlined as follows: Background Modeling and Detection of Foreground Regions Pixel-level Gaussians mixture background model has been used in a wide variety of systems because of its efﬁciency in modeling multi-modal distribution of backgrounds (such as waving trees, ocean waves, light reﬂection) and its ability to adapt to a change of the background (such as gradual light change). It models the intensity of every pixel by a mixture of K Gaussian distribution and hence becomes computationally very expensive for large image size and value of K. Furthermore, there is a high degree of data parallelism in the algorithm as it involves independent operations for every pixel. Thus, compute-intensive characteristic and available parallelism makes GMM suitable candidate for parallelizing on multi-core processors. Post-Processing Using Binary Morphology Morphological operations “opening” and “closing” are applied to clean-up spurious responses to detach touching objects and ﬁll in holes for single objects. Opening is applied to remove small spurious ﬂux responses and closing would merge broken responses. Opening is erosion followed by dilation and closing is dilation followed by erosion. There is high degree of parallelism in this step, and it is a computationally expensive step as these operators have to be applied in several passes on the whole image. Connected Component Labeling The foreground regions in the binary regions must be uniquely labeled, in order to uniquely characterize the object pixels underlying each blob. Since there is spatial dependency at every pixel, it is not straightforward to parallelize it. Although the underlying algorithm is simple in structure, the computational load increases with image size and the number of objects—the equivalence arrays become very large and hence the processing time [14].

14.4.2 Implementation of Gaussian Mixture Model on GPU A GMM is a statistical model that assumes that data originates from a weighted sum of several Gaussian distributions. Stauffer and Grimson [3, 22] presented an adaptive GMM method to model a dynamic background in image sequences. If K Gaussian

350 Fig. 14.10 Streaming (Double buffering) mechanism on GPU memory to overlap communication with computation

P. Kumar et al. CPU memory A B C D

CPU memory GPU memory

DMA get

DMA put a b

B

Transformation

distributions are used to describe the history of a pixel, the observation of the given pixel will be in one of the K states at one time [21]. The details of the procedure of foreground labeling using GMM can be found in the above referred papers. GMM offers pixel-level data parallelism which can be easily exploited on CUDA architecture. The GPU consists of multi-cores which allow independent thread scheduling and execution, perfectly suitable for independent pixel computation. So, an image of size m n requires m n threads, implemented using the appropriate size blocks running on multiple cores. Besides this, the GPU architecture also provides shared memory which is much faster than the local and global memory spaces. In fact, for all threads of a warp, accessing the shared memory is as fast as accessing a register as long as there are no bank conﬂicts between the threads. In order to avoid too many global memory accesses, we utilized this shared memory to store the arrays of various Gaussian parameters. Each block has its own shared memory which is accessible (read/write) to all its threads simultaneously, so this greatly improves the computation on each thread since memory access time is signiﬁcantly reduced. In our approach, we have used K (number of Gaussians) as four which not only results in effective coalescing but also reduces the bank conﬂicts. Our approach for GMM involves streaming, i.e., we process the input frame using two streams. As a result, the memory copies of one stream (half the image) to overlap (in time) with the kernel execution of the other stream. By kernel execution of a stream, we mean the application of the GMM approach as discussed above to half the pixels in a frame at a time. This is similar to the popular double buffering mechanism as shown in Fig. 14.10.

14.4.3 Implementation of Morphological Image Operations on GPU After the identiﬁcation of the foreground pixels from the image, there are some noise elements (like salt and pepper noise) that creep into the foreground image. They essentially need to be removed in order to ﬁnd the relevant objects by the connected component labeling method. This is achieved by morphological image operation of erosion followed by dilation [8]. Each pixel in the output image is based on a

14

Parallel and Distributed Computing for Processing Big Image and Video Data

351

Fig. 14.11 Binary morphology examples of dilation and erosion operation. Dark square denotes on pixels

Fig. 14.12 Our approach for erosion and dilation on GPU

comparison of the corresponding pixel in the input image with its neighbors, depending on the structuring element. Figure 14.11 illustrates the effect of dilation and erosion on a sample image A and structuring element B. As the texture cache is optimized for the two-dimensional spatial locality, in our approach we have used the two-dimensional texture memory to hold the input image; this has an advantage over reading pixels from the global memory, when coalescing is not possible. Also, the problem of out-of-bound memory references at the edge pixels is avoided by the cudaAddressModeClamp addressing mode of the texture memory in which out of range texture coordinates are clamped to a valid range. Thus, the need to check out-of-bound memory references by conditional statements does not arise, preventing the warps from becoming divergent and adding a signiﬁcant overhead. As shown in Fig. 14.12, a single thread is used to process two pixels. A half warp (16 threads) has a bandwidth of 32 bytes/cycle and hence 16 threads, each processing 2 pixels (2 bytes) use full bandwidth, while writing back noise-free image. This halves the total number of threads thus reducing the execution time signiﬁcantly. A straightforward convolution was done with one thread running on two neighboring pixels.

352

P. Kumar et al.

14.4.4 Implementation of Connected Component Labeling on GPU The connected component labeling algorithm works on a black and white (binary) image input to identify the various objects in the frame by checking pixel connectivity [7, 12]. The image is scanned pixel-by-pixel (from top to bottom and left to right) in order to identify connected pixel regions, i.e., regions of adjacent pixels which share the same set of intensity values and temporary labels are assigned. The connectivity can be either 4 or 8 neighbor connectivity (8-connectivity in our case). Then, the labels are put under equivalence class, pertaining to their belonging to the same object. Figure 14.13a shows typical labeling window that scans the binary image from left to right and top to bottom. The label B denotes the input binary data while L represents the labeled data of the neighbor pixel. In eight connected labeling assignments, four labeled data must be taken into account for assigning their labels to the input binary pixel. Figure 14.13b shows the cases of label equivalence generation. In case 2, the pixel L4 is shaded to signify that its value is not cared for in this case. The second step resolves equivalence class for the intermediate labels generated in the ﬁrst scan. We explore two algorithms for our parallel implementation. One algorithm uses Union-Find operations that resolve equivalence class labels using a set of trees often implemented as an array data structure with additional operations to maintain a more balanced or shallow tree [5]. We also explored another fundamental algorithm from graph theory known as the Floyd–Warshall (F–W) algorithm that expresses equivalent relations as a binary matrix and then resolves equivalences by obtaining transitive closure of the matrix. Here, the approach for parallelizing CCL on the GPU belongs to the class of divide and conquer algorithms [10]. The proposed implementation divides the image into small tile-like regions. Then labeling of the objects in each region is done independently by parallel threads running on GPU. The result after the divide phase Fig. 14.13 The basic labeling strategy (a) label assign scheme (b) equivalence label cases. In case 2, the pixel L4 is shaded differently to signify that its value is not cared for in this case

14

Parallel and Distributed Computing for Processing Big Image and Video Data

353

is shown in Fig. 14.14a. Then in the conquer phase, the local labels in the regions are merged into global labels such that if any object spans over multiple regions, all the pixels that are part of one object are assigned unique label globally with respect to the entire image. The result after the merge phase is shown in Fig. 14.14b. In the ﬁrst phase, we divide the image into N N smaller regions such that each region has the same number of rows and columns. The value of N, number of rows and columns in each region is chosen according to the size of the image and the number of CUDA cores on the GPU. Each pixel is labeled according to its connectivity with its neighbors as described earlier. In case of more than one neighbor, one of the neighbor’s labels was used and rest were marked under one equivalence class. This was done similarly for all blocks that were running in Parallel. Our procedure for merge phase is inspired from [10]. In this phase, we connect each region with its neighbor regions to generate the actual label within the entire image. We use N N pointers Global_List[i] to point to arrays that maintain the

Fig. 14.14 A sample example of dividing the image into tile-like regions and then merging. (a) Shows the local labels given to each region by independent threads (b) shows the global labels after merging the local labels from all the regions

354

P. Kumar et al.

Fig. 14.15 An example of list of labels in Region[i] transformed into global labels after merge phase

Fig. 14.16 Merge takes place in three ways at Region[i]. (a) Merge at the ﬁrst pixel in Region [i]. (b) Merge at remaining pixels in the ﬁrst column in Region[i] and the last column in region[i-1]. (c) Merge at pixels in the ﬁrst row in Region[i] and the last row in Region[i-N]

global labels with respect to the entire image. Global_List[i] points to the array for Region [i] where each array element is the global label within the entire image and the index for each array element is the local label within Region[i]. Memory allocation for each array pointed to Label_List[i] can be done dynamically according to the maximum local label in Region[i]. Figure 14.15 depicts an example of list of labels for Region[i] which shows that local labels 1, 2 and 5 are equivalent with local label 1 and their global label within the entire image is 11; local label 3, 4 and 6 are equivalent with local label 2 and their global label is 12, when T the total labels reached at the end of Region[i 1], while merging successively, is say 10. In the merge phase, we compare the labels in the ﬁrst row and column of a region with the labels in the adjoining row or column of the adjacent regions. There are three cases to be handled. The ﬁrst case is of the starting pixel of each region whose label has to be merged with the labels of ﬁve adjacent pixels (for 8 connectivity) in Region [i-1], Region [i-N] and Region [i-N-1] as shown in Fig. 14.16a. The second case is of the remaining pixels in the ﬁrst column of the region and third case is of the remaining pixels in the ﬁrst row of the region as shown in Fig. 14.16b and c, respectively.

14

Parallel and Distributed Computing for Processing Big Image and Video Data

355

14.4.5 Experimental Results Results for the parallel implementation of the above algorithms in moving object detection was executed on NVIDIA Tesla C2070 GPU which has 448 cores delivering peak performance of over a teraﬂop for single precision and 515 GFlops of double precision along with a huge 6 GB of dedicated memory facilitating storage of larger datasets in local memory to reduce data transfers. Various image sizes have been used that range from the standard dimension of 320 240 to full HD 1920 1080 and beyond HD 2560 1440 (as in panoramic images). For the experiment purpose, high-resolution video data collection was done using professional quality HD camera and a panoramic camera which 4 channel video were concatenated to give panoramic video (each frame of resolution 10240 1920). The results are compared against sequential execution on Intel® core2 duo and Intel® Xeon 6 core processor. Parallel GMM Performance Figure 14.17 compares the performance of sequential and parallel implementations of GMM algorithm on different frame sizes. The result shows a signiﬁcant speedup for parallel GPU implementation going up to 15.8 when compared to sequential execution on Intel Xeon processor and up to 70.7 in comparison to sequential execution on Intel core2 duo processor. Apart from the image data, GMM stores the background model information of mean (μ), sigma (σ) and weight (w) values in the memory which needs to be transferred to communication with computation time and hence were able to obtain signiﬁcant speedup values. It can be observed that the speedup increases with image sizes because of increase in the total number of CUDA threads (each pixel is operated upon by one CUDA thread) which keeps the cores busy. For optimization, we used shared memory for storing various Gaussian mixture values which reduced the

1000

Execution time in ms

70.7X 77.8X

100 50.4X 44.7X 14.2X

10 22.9X

1

59.0X

15.8X 18.8X

16.2X Core2Duo

10.8X

Xeon 6 core gpu

6.9X

0.1 320X240

720X480

1024X768 1360X768 1920X1080 2560X1440 Frame Size

Fig. 14.17 Comparison of execution time (in milliseconds) of sequential versus parallel GPU implementation of GMM algorithm on different video frame sizes

356

P. Kumar et al.

10000

Execution Time in ms

702.7X 660.4X

1000 513.7X

100

267.7X

236.0X

630.7X

696.7X 253.6X

248.6X

230.3X

196.4X Core2 Duo

103.0X

Xeon 6core

10

gpu

1

0.1 320X240

720X480

1024X768 1360X768 1920X1080 2560X1440 Frame Size

Fig. 14.18 Comparison of execution time (in milliseconds) of sequential versus parallel GPU implementation of Morphology algorithm on different video frame sizes. Time measured is for one dilation and one erosion with 5 5 structuring element

memory access time but it had a side effect of decreasing the occupancy. To balance the use of shared memory, we chose 192 threads to be executed per block and created blocks in 1D grid. It was noted that shared memory can be used with only ﬂoating point and also the data types which are type compatible with ﬂoat (like int, short etc.) and not with unsigned char data types. Parallel Morphology Performance Figure 14.18 compares the performance of sequential and parallel implementations of Morphology algorithm (time measured is for one erosion and dilation operation with 5 5 structuring element) on different frame sizes. As shown in the ﬁgure, we are able to get enormous speedup for parallel GPU implementation going up to 253.6 when compared to sequential execution on Intel Xeon processor and up to 696.7 in comparison to sequential execution on Intel core2 duo processor. We are able to get very high speed up for this because of several factors. First of all, the input data to this algorithm is binary values for each pixel which needs very less memory for storing the entire frame and thus the time for memory copy operation is reduced to minimal. Moreover, we could optimize memory access time during computation by storing the binary image in texture memory which is a read-only memory and also cached. As the morphology operation on neighboring pixel use data, which have spatial locality, this optimization reduces access time considerably without having to use shared memory. Thus, we could also get maximum occupancy of one. We also made the implementation generic such that it will work for structuring element of any size. The speedup increased with increase in image size due to the generation of more number of CUDA threads thus utilizing the 448 cores of Tesla card fully.

14

Parallel and Distributed Computing for Processing Big Image and Video Data

357

Parallel CCL Performance We implemented the ﬁrst pass of CCL labeling and resolving equivalence labels using both Union-Find (UF) and Floyd–Warshall (FW) algorithm, resulting in two different sequential implementations. For the parallel implementation, we use these two algorithms for local labeling on individual GPU core and the same divide-and-merging strategy as discussed in Sect. 14.4.4 (parallel implementation on GPU). Thus, we have two different parallel implementations on GPU for performance measurement against their sequential counterparts. Figures 14.19 and 14.20 compare the performance of sequential and parallel implementations of CCL algorithm using UF and FW, respectively, on different frame sizes. Although CCL algorithm was difﬁcult to parallelize due to many dependencies, we could obtain reasonable speedups values for parallel GPU implementations, especially on large image sizes (up to 7.6–8.0) as can be seen from the Figures. 14.19 and 14.20. This is because for smaller image sizes, the overhead involved in dividing the data and merging the labels overshadows the gain by parallelizing the computation. The overhead ratio reduces when the amount of useful computation increases more than the overhead computation on larger image sizes. Again, we tried to extensively utilize both shared and texture memory for optimizing the memory accesses. We stored the input image pixel values in texture memory which is readonly. We used shared memory for storing equivalence array which had to be frequently accessed during ﬁrst labeling pass and also during local resolution of labels. We tried to reduce the number of branch conditions (which cause thread divergence) by replacing many if-else conditions with switch statements. End-to-End Performance In order to test the performance of the end-to-end system, we connected the individual processing modules in the sequence of video object detection algorithm. Table 14.1 shows the time taken and the frame processing rate that was achieved for sequential (running on Intel Xeon) and parallel GPU implementation on different video resolution. Since we used a typical video surveillance dataset, we used FW algorithm for CCL implementation which was

Execution time in ms

1000 7.6X 6.8X

100

5.8X

6.0X

10

1.4X 3.5X

1.9X 1.6X

7.0X 1.4X

Seq_Core2 Duo Seq_Xeon 6 core GPU

1.3X

0.7X

1 320X240

720X480

1024X768 1360X768 1920X1080 2560X1440 Frame Size

Fig. 14.19 Comparison of execution time (in milliseconds) of sequential versus parallel GPU implementation of CCL Union-Find algorithm on different video frame sizes

358

P. Kumar et al.

Execution Time in ms

1000 6.7x

100 3.1x 2.2x

10

1.1x

3.7x 1.3x

8.0x 2.2x

1.9x

gpu

0.4x

1

Spq_core2duo Spq_Xeon6core

0.8x

1.6x

0.1 320X240

720X480 1024X768 1360X768 1920X1080 2560X1440 Frame Size

Fig. 14.20 Comparison of execution time (in milliseconds) of sequential versus parallel GPU implementation of CCL Floyd–Warshall algorithm on different video frame sizes Table 14.1 Comparison of execution time (in milliseconds) and frame-processing rate of sequential versus parallel GPU implementations of end-to-end system of moving object detection Image size Seq. UF Seq. FW GPU UF

1024 768 T ¼ 28.565 Fps ¼ 35.0 T ¼ 8.716 Fps ¼ 114.7 3.277

1360 768 T ¼ 131.659 Fps ¼ 7.6 T ¼ 15.543 Fps ¼ 64.3 8.4706

1920 1080 T ¼ 299.941 Fps ¼ 3.3 T ¼ 23.368 Fps ¼ 42.8 12.8355

2560 1440 T ¼ 398.765 Fps ¼ 2.5 T ¼ 31.853 Fps ¼ 31.4 12.5189

performing better than UF as showed by our results above. We exclude the time taken for fetching the image frames from stream/disk IO in our performance measurement. As can be observed from the table, the number of frames processed per second for sequential implementation is very low, whereas the parallel GPU implementation can process even HD resolution videos and panoramic videos in real-time at 22.3 and 13.5 frames per second, respectively.

14.5

Conclusion

This chapter presented a library, ParIPL which provides implementations of various image-processing operations and CV algorithms. This library assists application developers in parallel and distributed image processing in Apache Hadoop using HIPI. In addition, we described a parallel implementation of the core video object detection algorithm for video surveillance using GPUs to achieve real-time processing on high-resolution video data. The various algorithms described in this chapter are GMM, morphological operations, and CCL. Major emphasis was given to reduce the memory latency by extensively utilizing shared and texture memory wherever possible and optimizing the number of threads and block conﬁguration for maximum utilization of all the GPU cores. In experimental evaluation, we compared

14

Parallel and Distributed Computing for Processing Big Image and Video Data

359

the performance of sequential and parallel GPU implementation of the GMM, Morphology, and CCL algorithms. The sequential execution was benchmarked on Intel® core2 duo and Intel® Xeon 6 core processor and Parallel GPU implementation was benchmarked on NVIDIA Tesla C2070 GPU. Future works include extending the ParIPL to accommodate all the basic as well as complex image-processing operations which are necessary nowadays and using the same framework for parallel and distributed video processing which provides another dimension for parallelism. GPUs will include testing the current implementation on very high resolution (20 megapixels) panoramic camera output to measure its scalability and efﬁcacy.

References 1. Sankaranarayanan, A.C., Veeraraghavan, A., Chellappa, R.: Object detection, tracking and recognition for multiple smart cameras. Proc. IEEE. 96(10), 1606–1624 (2008) 2. Bibby, C., Reid, I.D.: Robust real-time visual tracking using pixelwise posteriors. In: European Conference on Computer Vision, pages II:831–844 (2008) 3. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking, In: Proceedings CVPR, pp. 246–252 (1999) 4. Sweeney, C., Liu, L., Arietta, S., Lawrence, J.: HIPI for image processing using MapReduce, http://homes.cs.washington.edu/~csweeney/papers/undergrad_thesis.pdf, Site: http://hipi.cs.vir ginia.edu/ (last accessed on 15th October, 2017) 5. Fiorio, C., Gustedt, J.: Two linear time union-ﬁnd strategies for image processing. Theor. Comput. Sci. 154(2), 165–181 (1996) 6. Demir, A.S.: Hadoop optimization for massive image processing: case study face detection. univagora.ro/jour/index.php/ijccc/article/download/285/pdf_142 (last accessed on 15th October, 2017) 7. Chang, F., Chen, C.-J., Lu, C.-J.: A linear-time component-labeling algorithm using contour tracing technique. Comput. Vis. Underst. 93(2), 206–220 (2004) 8. Sugano, H., Miyamoto, R.: Parallel implementation of morphological processing on CELL BE with OpenCV interface. Communications, Control and Signal Processing, 2008. ISCCSP 2008, pp. 578–583 (2008) 9. Squyres, J.M., Lumsdaine, A., Mccandless, B.C., Stevenson, R.L.: Parallel and distributed algorithms for high speed image processing sliding window technique. https://www. researchgate.net/publication/2820345_Parallel_and_Distributed_Algorithms_for_High_ Speed_Image_Processing 10. Park, J.M., Looney, C.G., Chen, H.C.: Fast connected component labeling algorithm using a divide and conquer technique. Computer Science Department University of Alabama and University of Nevada, Reno (2004) 11. Jefferson, K., Lee, C.: Computer vision workload analysis: case study of video surveillance systems. Intel Technol. J. 09(02), (2005) 12. Wu, K., Otoo, E., Shoshani, A.: Optimizing connected component labeling algorithms. In: Proceedings of SPIE Medical Imaging Conference 2005, San Diego, CA (2005). LBNL report LBNL-56864 13. Boyer, M., Tarjan, D., Acton, S.T., Skadron, K.: Accelerating leukocyte tracking using CUDA: a case study in leveraging manycore coprocessors (2009) 14. Manohar, M., Ramapriyan, H.K.: Connected component labeling of binary images on a mesh connected massively parallel processor. Comput. Vis. Graph. Image Process. 45(2), 133–149 (1989)

360

P. Kumar et al.

15. Sonawane, M.M., Pandure, S.D., Kawthekar, S.S.: A Review on Hadoop MapReduce using image processing and cloud computing. IOSR J Comput Eng (IOSR-JCE) e-ISSN: 2278-0661, p-ISSN: 2278-872. http://www.iosrjournals.org/iosr-jce/papers/Conf.17003/Volume-1/13.% 2065-68.pdf?id¼7557 (last accessed on 15th October, 2017) 16. Sozykin, A., Epanchintsev, T.: MIPr Framework, https://www.researchgate.net/publication/ 301656009_MIPr_-_a_Framework_for_Distributed_Image_Processing_Using_Hadoop 17. Kumar, P., Palaniappan, K., Mittal, A., Seetharaman, G.: Parallel blob extraction using multicore cell processor. Advanced concepts for intelligent vision systems (ACIVS) 2009. LNCS 5807, pp. 320–332 (2009) 18. Kumar, P., Mehta, S., Goyal, A., Mittal, A.: Real-time moving object detection algorithm on high resolution videos using GPUs. J. Real-Time Image Proc. 11(1), 93–109 (2016). https://doi. org/10.1007/s11554-012-0309-y) 19. Momcilovic, S., Sousa, L.: A parallel algorithm for advanced video motion estimation on multicore architectures. In: International Conference Complex, Intelligent and Software Intensive Systems, pp. 831–836 (2008) 20. Banaei, S.M., Moghaddam, H.K.: Apache Hadoop for image processing using distributed systems. https://ﬁle.scirp.org/pdf/OJMS_2014101515502691.pdf 21. Toyama, K., Krumm, J., Brumitt, B., Meyers, B., Wallﬂower: Principles and practice of background maintenance. The proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 1, pp. 255–261, 20–25 September, 1999, Kerkyra, Corfu, Greece 22. Zivkovic, Z.: Improved adaptive Gaussian mixture model for background subtraction. In: Proc. ICPR, pp. 28–31 vol. 2, 2004

Chapter 15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data Eugene Ch’ng, Mengdi Li, Ziyang Chen, Jingbo Lang, and Simon See

Abstract The general consensus towards the deﬁnition of Big data is that it is the data that is too big to manage using conventional methods. Yet, the present Big data approaches will eventually become conventional, where non-specialists can conduct their tasks without the need for consultancy services, much like any standard computing platforms today. In this chapter, we approach the topic from a multimodal perspective but are strategically focused on making meaning out of single-source data using multiple modes, with technologies and data accessible to anyone. We gave attention to social media, Twitter particularly, in order to demonstrate the entire process of our multimodal analysis from acquiring data to the Mixed-Reality approaches in the visualisation of data in near real-time for the future of interpretation. Our argument is that Big data research, which in the past were considered accessible only to corporations with large investment models and academic institutions with large funding streams, should no longer be a barrier. Instead, the bigger issue should be the development of multi-modal approaches to contextualising data so as to facilitate meaningful interpretations.

E. Ch’ng (*) · Z. Chen · J. Lang NVIDIA Joint-Lab on Mixed Reality, University of Nottingham Ningbo China, Ningbo, China e-mail: [email protected]; [email protected] M. Li NVIDIA Joint-Lab on Mixed Reality, University of Nottingham Ningbo China, Ningbo, China International Doctoral Innovation Centre, University of Nottingham Ningbo China, Ningbo, China e-mail: [email protected] S. See NVIDIA AI Technology Centre, Galaxis (West Lobby), Singapore e-mail: [email protected] © Springer Nature Switzerland AG 2019 K. P. Seng et al. (eds.), Multimodal Analytics for Next-Generation Big Data Technologies and Applications, https://doi.org/10.1007/978-3-319-97598-6_15

361

362

15.1

E. Ch’ng et al.

Introduction

There is now a general consensus in various literatures spanning insight articles in magazines to scholarly publications that the ‘big’ in Big data refers more to the relationality of data rather than the largeness of it [1–4]. The data being big is only an issue if it presents a problem for storage and processing. ‘Big’ is indeed a relative term as we have come to realise—data considered big in the humanities and social sciences is not necessarily big in the Sciences, and perhaps much lesser than the technology industry’s deﬁnition. But as we will soon learn, storage and processing Big data is not so much a problem today as compared to the need to contextualise data. Data relationality as opposed to the largeness of data is in fact a greater issue at present. This is common sense as the analysis of data for discovering patterns, and the use of a host of present machine-learning techniques for classiﬁcation and/or predictions require that we relate a set of attributes to a set of other attributes from the same or an entirely different dataset. Large corporations have collected huge storage of datasets for decades prior to the need to create the term Big data. Data that are truly big does not necessarily generate value nor proﬁt if these data are not reﬁned [4], i.e. contextualised, no matter how big the distributed storage capacity an organisation may boast, data that are not contextualised (thus, avoiding the use of ‘raw’ data [5]) will only expand your storage capacity and consequently, increase your digital curatorial needs. Besides, the exponential growth of computing power and storage capacity means that what is big today maybe small tomorrow: ‘. . . as storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.”’ [6]. Big claims about the value of Big data are rampant in marketing campaigns, yet at the other end of the spectrum, Small Data [7], referring to data that are at a very personal, local level, data that are too unstructured to be useful to be digitised, can be critical in revealing key information, particularly the egocentric line of research that requires a more observational and interpersonal approach in our investigation [8]. Regardless, the key to creating value with data, large or small, still involves, to a greater extent, human labour and intelligence. Human labour is needed because of the speciﬁcity of the uniqueness of every research inquiry or business target, which includes the need to write codes to process unstructured data and develop machine-learning algorithms for mapping diverse data sources into a cohesive dataset for analytics. That is why Artiﬁcial Intelligence (AI) expertise is a precious human resource that is competed by companies and research labs all over the world [9]. The need for human intelligence and intuition to conceptually link highly disparate datasets across various domains so that data may be prepared for analysis is still mandatory in an age where AI has seen billions of dollars of investments. What about interpretation, without which meaning is unknown and knowledge is not discovered? In fact, the act of selecting and processing data is an interpretation in itself. Reﬁned Big data, even after the process of analytics is not information-transparent, they are not self-evident in terms of information. Human interpretation is needed, and there are deﬁnite differences in one who is experienced and instinctive in the subject area compared to a junior analyst. Perhaps interpretation

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

363

requires much more than a logical mind, for ‘Practitioners of science are different from artists in that they give primacy to logic and evidence, but the most fundamental progress in science is achieved through hunch, analogy, insight, and creativity’ (David Baltimore, The New Yorker, 27 January 1997 in [10]). Machines aren’t taking over the job market just yet, and data scientists are still divining. What are the core technical needs in Big data? To most institutions, and subtracting human resource, the answer is storage and computing power as the core component of any Big data needs today. A quote by Roger Magoulas states that ‘Data is big when data size becomes part of the problem’ (in [11]). Most standard computers are not able to store data that are extremely large without the risk of data corruption and the ‘vertical’ scaling problem of adding more storage devices as data expands. Furthermore, the added complexity of actually querying and getting big datasets out from a conventional database can be a real challenge [12]. On the other hand, most multicore CPUs are not able to process data using a highly parallel approach, including the need for training algorithms used in machine learning. Is bigger better? Not necessarily, although the argument is that the value that we can gain from Big data comes from data manageability and our ability (instincts and intuition) to mine for patterns: ‘Data manageability plays an important role in making Big data useful. How much of data can become useful depends on how well we can mine data for patterns. A collection of data may be in the tera- or petabytes, within which perhaps only 1% (for example) may be useful. But without the 99%, that 1% may never be found. It is therefore important to collect and keep all potentially useful data, and, from those collections, conduct data mining in order to ﬁnd the 1%’ [13]. It is a nice idea to have, to own the data that are in the posessions of large corporations. But this dream will never come through for the majority of research labs and companies. The fact that we are not Google, Twitter or Facebook means that we will never get hold of their entire set of data. As pointed out in [1], even researchers working within a corporation have no access to all the data a company collects. A portion of it may be acquired with a price, but this is all there is to it. Whilst highly established machine learning algorithms are open for use in a multitude of software packages and libraries, the precious data that belongs to these large corporations will perhaps never be truly accessible to us. And since such Big data are inaccessible, our storage and computing capacity need not be that big after all. What is really needed is a scalable system, and, as our collection of data grows, our ‘small Big data’ system can grow along with it. The means to do Big data research should not be accessible only to corporations with large investment models and academic institutions with large funding streams. Rather, managing Big data should be a basic capability of emerging labs and small businesses, which will allow them to concentrate on the skills most needed in Big data research, that is, the entire process of data analytics culminated in this keyword— human interpretation. But how can we enhance human interpretation when it depended on insight, foresight and experience? Traditional interpretation depended on singular and at most one or two methods of looking at a dataset. With the onset of technology, permutations and combinations of ways of looking at data have increased,

364

E. Ch’ng et al.

which could potentially replace the need for highly experienced data analysts. Multimodality in analysing single sources of data could certainly reveal much more. At this point, we wish to dwell a little on the keyword of the present book. The word ‘multimodal’ as an adjective is deﬁned similarly by various dictionaries as a process characterised by several modes of activity or a process ‘having several modes’. Multimodality can be applied to different domains of research. Its historical use was not related to Big data until more recent times. In theory of communications, multimodality uses several modes or media to create a single artefact. In HumanComputer Interaction (HCI), it is used for interactions having several distinct tools for input and output of data. In statistics, a multimodal distribution is a graph having multiple peaks. The term is ambiguous and ﬂexible. The common meaning of the word as used in different areas can refer to something having several distinct modes within a single process of work, or, in the context of this chapter, using several distinct approaches in viewing a source of data. We acknowledged that whilst there is a trend in multimodal data in the literatures referring to algorithms that deal simultaneously with data from multiple sources, this chapter deals with a source of data in several different modes of viewpoints for enhancing interpretation. In this chapter, we present our exploration of developments of multimodal analysis of social media data, focusing on Twitter, one of the largest social media data sources available to the public. Our multimodal analysis of Twitter data involves analysing emojis embedded in tweets (emoji analytics), identifying sentiment polarity of tweets (sentiment analysis), the visualisation of topical data on geographical maps (geo-mapping), social network analysis using graphs, and navigating and examining social data using immersive environments. Our main principle of work differs from other social media analytics [14, 15] focusing only on analysing texts. Our multimodal approaches cover various aspects of Twitter data, from emojis to texts, geo-tagged maps, to graphs and 3D environments. Having several modes of looking at data can help reveal much more information to a greater audience. There are gaps that needed to be bridged in analysing Twitter data. First, the notion that emojis are popularly used on Twitter is only concluded from general assumptions. A large-scale, cross-regional analysis of Twitter data that tests this idea has not been formally evaluated. Second, very minimal work has been done on the systematic analysis of the use of emojis on Twitter, mainly because of the lack of behavioural data at scale. Third, there has been a rarity of previous work in the Twitter Sentiment Analysis community investigating emoji-related features in the development of their research. Most of the state-of-the-art works [16–18] had their focus on the improvement of the performance of Twitter Sentiment Analysis. They ignored the signiﬁcance of visualisation of location-based sentiments on geographical maps. Furthermore, human annotators are employed to manually label tweets (see [19, 20]), which can be resource-intensive and tedious to say the least. In such a domain, most of the stateof-the-art works [19–21] are oriented towards binary classiﬁcation (i.e., classiﬁcation into ‘positive’ and ‘negative’) or ternary classiﬁcation (i.e., classiﬁcation into ‘positive’, ‘neutral’, ‘negative’) of texts with very little work conducted on multi-class sentiment analysis of Twitter posts. Twitter data is often analysed based on textual inputs, and sequentially following the order of timestamps, or seen as an average in

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

365

sentiment and emojis analytics. The visual mapping of Twitter data is an important area which, in the last few years, has seen more development. Twitter data is, at a fundamental level, data about social behaviour and therefore the need for methods to look at social interaction as well as the use of theories from the social sciences (e.g., [22–24]). Mainstream research using Twitter data, used for constructing graphs depicting the follower–followee networks which, within the Twitter sphere, does not represent a community for the fact the majority of followers are not active. The visualisation of graphs in particular ways is an important approach that could significantly contribute to discovering meanings within the data. Our aim in the present chapter explores the development of novel approaches so as to bridge research gaps in our domain. It is to demonstrate that any young lab or small businesses with minimal computing resource and skills in computational sciences could lead to good research ﬁndings, if we learn to manage data and our resources properly so that they lead to sound interpretation of what topical data means.

15.2

Where Do We Start?

Forget data that are too big to acquire and data that are too good to be in our possessions. Begin by asking questions in our focus area, in the areas that interests us. When others talk about Big data, they focus their attention on the data that are in the possessions of large corporations and contemplated on what they could do if only they had access to those data sources. Those are indeed never fully accessible, yet, the idea of doing any data-driven research begins not from coveting; it has always been about asking the right questions and seeking the data which are relevant to our topic of interest. As Marc Bloch rightly says early in 1953 that ‘Even those texts or archaeological documents which seem the clearest and most accommodating will speak only when they are properly questioned’ [25]. However, data are increasingly made open; for example, the graph in the Guardian article [26] shows a clear relationship between Big data and open data, from which one may access freely. We have prepared a categorical list below for such a purpose. A glance at the list will reveal that these data may be used for mapping relationships, provided that you begin by asking the right questions. We wish to give notice to our readers that web links are deliberately not given, as they tend to change quite frequently. Large Open Data Sources 1. Large public government datasets: • National digital forecast database (NDFD) • UNHCR global trends: Forced displacement in 2016 2. Scientiﬁc research, social media, or other non-government sources:

366

E. Ch’ng et al.

• Education statistics from The World Bank Data Centre • Social media APIs Small Open Data Sources (for All Countries by Government Agencies) 1. National Statistics: key national statistics such as demographic and economic indicators (GDP, unemployment, population, etc.) from the National Bureau of Statistics of both the USA and China. 2. Government budgeting (Planned national government expenditure for the upcoming year) for both the USA and China. Large Open China Data 1. Report on the state of the environment in China 2. Weibo social media API Small Open China Data 1. National data on average price of food 2. Main ﬁnancial indicators of industrial enterprises above designated size Data has become so important and accessible at the turn of the century that the humanities and social sciences are looking at how Big data can be used for transforming their research landscapes [2]. In other ﬁelds such as architecture, the demands for data-integrated buildings and the need for building information systems throughout the architectural process [27, 28] are creating new opportunities. Now that everything seems to want to generate some form of digital data, we will never be short of them, it is what we do with them, and how we view and interpret them that matters. Data allows us to see our world clearly, and often in a new light, if and only if there is a way for us to examine the different aspects of it, from several perspectives. A dataset can be viewed rather differently by people from dissimilar disciplines. A scientist and an artist are both open-minded, yet a scientist looks at and explains a subject logically, while the artist seeks the expression of emotions. Both are needed in data-driven research [29]. Henry Ford once said ‘If there is any one secret of success, it lies in the ability to get the other person’s point of view and see things from that person’s angle as well as from your own’. and Albert Einstein quoted ‘To raise new questions, new possibilities, to regard old problems from a new angle, requires creative imagination and marks real advance in science’. Interpretation, using an older reference because of the clarity of it, can be broadly deﬁned as ‘the process of ascertaining the meaning(s) and implication(s) of a set of material’ [25]. Sifting through large sets of contextualised data in order to conduct an interpretation is often a challenging task. It has been discussed early that the nature of interpretation for both quantitative and qualitative research are fraught with challenges and complexities [10, 30]. The entire goal of interpreting data is the understanding of the meaning present within a context: ‘A successful interpretation is one which makes clear the meaning originally present in a confused, fragmentary, cloudy form. But how does one know that this interpretation is correct? Presumably because it makes sense of the original text: what is strange, mystifying, puzzling, contradictory is no longer so, is accounted for’ [31].

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

367

Our chapter sees the issue of Big data storage and processing as a minimal problem; what we think as important is a means to contextualise data through multimodal analytics so as to facilitate the process of interpretation. We believe that data mining for information from datasets with a multimodal approach is essential for knowledge discovery. A multimodal approach is needed as the opportunity to viewing data from multiple perspectives gives a more complete picture—there is a good reason why there are multiple aspects of 2D graphs which allows us to view a dataset differently. Now that scalable storage and processing technology is able to resolve the common denominator of our Big data problems, perhaps we should be conducting our research using different modes of analysis, so as to enable us to view our data from different perspectives. Here, we provide two scarcely cited Big data deﬁnitions to conclude the section: Big data is a term that describes large volumes of high velocity, complex and variable data that require advanced techniques and technologies to enable the capture, storage, distribution, management, and analysis of the information [32]. Big data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation [33].

15.3

Multimodal Social Media Data and Visual Analytics

This section describes our multimodal approach in making sense of data. In summary, in the multimodal aspect of our research: 1. We acquire, process and structure large datasets so as to store and curate data which may become important later, and to examine them through asking the right questions, and to interpret them in order to extend knowledge. 2. We conduct sentiment analysis and emoji analytics so as to observe sentiment trends and directions in disparate cultural backgrounds towards particular topics. 3. We visualise location-based user-generated contents on geographical maps so as to understand the distribution of users and the context of the contents. 4. We map activity interactions termed ‘social information landscapes’ within social media so as to understand relationships between agents, groups and communities; the diffusion of information across social networks; and to identify major inﬂuencers in topical subjects. 5. We process and visualise large amounts of data via parallel and distributed processing algorithms we develop so as to observe trends and to assist in decision making in time-critical topics, such as political elections, market trends, attitudes of city dwellers in events, etc. 6. We restructure and partition spatial-temporal (4D) data within virtual environments and develop Virtual Reality and Mixed Reality interfaces for users to view, analyse, and interpret data that are too large for a 2D display: the notion of which

368

E. Ch’ng et al.

is to allow interpreters to be part of the data, living within them by taking a ‘phenomenological approach’ to data analysis.

15.3.1 Data Sources and Distributed Storage Our scalable distributed database uses MongoDB a cross-platform open source freeform document database that stores document in dynamic schemas as JavaScript Object Notation (JSON). Our edge and data processing nodes and server application used for acquiring, processing and storing data uses Node.JS and associated modules. We developed multiple Node.JS applications within our servers for acquiring, processing, structuring and storing data. Node.JS is also used as a web front-end for data access and geographical mapping applications. Redis, an open source, networked, in-memory, advanced key-value store data structure server is used for fast access of structured data for quick memory access when real-time application requires it. Our virtual environment for visualisation uses native OpenGL but also the Unity3D integrated environment which connects to multiple devices. The Oculus Rift and the Leap Motion controller are our main devices. Our scalable hardware is deployed as a physical cluster (built from inexpensive commodity machines and Raspberry PIs), with virtual machines within dedicated server hardware, and virtual machines in Cloud services (Amazon Web Services, Digital Ocean, and our UK university Cloud servers). We felt that it is not necessary to list our commodity and server specs here as they are of very different capacities. What is important is our ability to deploy our applications across different hardware based on project needs because of our streamlined design—we conﬁgured all our hardware-software architecture to be scalable. We built our own servers and workstations for housing both professional and gaming NVIDIA GPGPUs (Tesla K80 and K40s, and Quadros such as GP100, M6000s, and gaming cards GTX1080Ti). These are used for parallel, distributed processing and real-time visualisation work. We have a host of human interface devices, from handheld controllers, in-house development of sensor-based wearables, and head-mounted devices for virtual reality and augmented reality applications. Our data capture equipment includes 3D scanners, drones, and a suite of professional camera system setups. A high-level graphical view of our integrated hardware-software architecture is shown in Fig. 15.1.

15.3.2 Emoji Analytics In contrast to face-to-face communication, online textual communication were found to be lacking in non-verbal cues which are important for providing readers with contextual information such as the speaker’s intention or emotional states [34]. To compensate for the lack of facial cues, various non-standard orthographies, such as

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

369

Fig. 15.1 Scalable hardware-software architecture, using small lab-friendly, affordable technologies

expressive lengthening and emoticons have been used on social media platforms to communicate emotions [35, 36]. The advent of emojis has brought about a dramatic shift in the effectiveness of online communication. This has replaced user-deﬁned linguistic affordances with predeﬁned graphical symbols. A recent study [34] revealed that Twitter users who adopt emojis tend to reduce their usage of emoticons. Emojis can be said to have replaced emoticons as catalyst for expressions of written emotion as it has now been widely adopted for simplifying the expression of emotions and for enriching communications. Emojis are pictographs that ﬁrst appeared on Japanese mobile phones in the late 1990s. It became popular globally in text-based communication with the introduction of smartphones which support the input and rendering of emoji characters [37]. Most commonly used emojis are represented by Unicode characters [37]. With the introduction of new character sets in new Unicode versions, the number of emojis introduced into the sets has increased. As of July 2017, there are 2623 emojis in the Unicode standard. For each set, the mapping of a code and description (e.g., Uþ1F60F for ‘smirking face’) are assigned by the Unicode Consortium. Emojis not only contain faces but also concepts and ideas such as animals (e.g., ), weather (e.g., ), vehicles (e.g., ), or activities such as

370

E. Ch’ng et al.

swimming ( ). They are widely used in social media sharing (e.g. emojis occur more frequently in posted tweets), smartphone texting, advertising, and more. In addition to expressing emotions, emojis can be used for various purposes in mediated textual communication, assuming a more playful sort of interactions. It is used for maintaining a conversational connection and for creating a shared and secret uniqueness within a particular relationship [38]. Emojis have signiﬁcant advantage over plain texts, as they are universally understood. They are easy to input and can convey rich emotions within a single 8bit character. Emojis have been used by the White House [39] to communicate with millennials. The ‘face with tears of joy’ emoji , also known as LOL emoji or laughing emoji, was even elected as the 2015 ‘word of the year’ by Oxford Dictionaries, because it best represents the mood, ethos, and the preoccupation of the world [40]. Emojis are an exciting evolution of the way we communicate, and some treat it as an emerging ‘language’ and claim that they could soon compete with other languages such as English and Spanish in their global usage. By far, emojis have been popularly used in Twitter from different countries, with embedded demographic characteristics and diverse cultural meanings. Due to the ubiquitous adoption of emojis, research in analysing emojis have great opportunities for development: 1. Emojis make the conversations between different language users possible, demonstrating its critical role in overcoming geographical and language barriers. This allows researchers to map tweets across regions with more depth. 2. The popularity of emojis has generated huge volumes of behavioural data, which can be used for answering research questions which previously relied only on small-scale user surveys. 3. Emojis are compact and convey clear semantics when attached together with a sentence, which makes it a good complement to the issues of NLP. The popular use of emojis has prompted a demand for emoji analytics, which can automate the discovery of information such as intentions and meanings behind their use within extremely large datasets. Very few studies (see [41–43]) have been done so far to systematically analyse and compare the usage of emojis. This is probably owing to the lack of behavioural data at scale. As emojis supplement, disambiguate and even enhance the meaning of messages, it is an important area of study complementing purely textual sentiment analysis. Whilst NLP algorithms are relatively mature in terms of the textual analysis, emoji analytics is rather new. Our view is that emojis are naturally compact and convey clearer semantics; therefore, it is possible to facilitate research with much more robust approaches than with the present issues and insufﬁciency of NLP. Here, we report on our development of emoji analytics on Twitter, which we considered to be a highly important aspect of our multimodal analysis of Twitter data. We analyse Twitter data from the viewpoint of emoji analytics, differing from other social media analytics focusing only on analysing texts [14, 15, 44]. This helps reveal much more information in the interpretation of Twitter data. We conducted an empirical analysis using a large-scale, cross-regional emoji usage dataset from

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

371

Twitter to explore how users from different countries/regions use emojis on Twitter. We also examined how cross-cultural differences inﬂuence the use of emojis. Various studies have investigated the differences in online and ofﬂine behaviours. However, it is of greater signiﬁcance to explain the reasons behind such differences. Literatures in sociology indicated that such behavioural variations are often explained by the differences in culture rather than nationality [45]. Lim et al. [46] investigated national differences in mobile App user behaviours and compared the behavioural differences with Hofstede’s culture index applied. Reinecke and Bernstein [47] designed culturally adaptive systems which automatically generate personalised interfaces that suit cultural preferences, which immensely enhance user experience. Park et al. [45] examined how cross-cultural differences inﬂuence the use of emoticons on Twitter. Jack et al. [48] discovered that the East and the West behave differently in the decoding of facial expression signals, suggesting that cultural variations may inﬂuence the ways in which people distinguish one facial expression from another. Lu et al. [49] discovered that smartphone users from different countries have great differences in preferences for emojis. We have not found any work that systematically analyse and compare the use of emojis on Twitter and examine how such variance is related to cultural differences. We used a Big data architecture [13] to retrieve a large-scale Twitter dataset consisting of 673 million tweets posted by more than 2,081,542 unique users within two months. In our initial analysis, we observed that 7.57% (50.9 million) of tweets include at least one emoji. This sample provides a quantiﬁed evidence of the popularity of emojis within Twitter. In Novak et al.’s work [43], conducted in 2015, 4% of the tweets in their corpus contained emojis. Our percentage of 7.57% indicated that Twitter users have increased their use of emojis since 2015. We also identiﬁed the source of each tweet, either by country or region based on the exact GPS coordinates or information gleaned from the Twitter location data ﬁeld. To facilitate the comparison of emoji meaning and usage between countries, we generated our joint vector space models of tweets texts and emojis by employing the skip-gram neural embedding model introduced by Mikolov et al. [50]. In our joint-vector space models, emoji characters are replaced by a unique emojiCODE (e.g., emoji1f602 is ). From Hofstede Insights [51], we obtained the culture indexes for each country, and then investigated the correlations between the culture index and the usage of emotional emojis. Figure 15.2 illustrates the top 20 most-used emojis on Twitter, and Fig. 15.3 displays the most frequently used emojis in different countries. Our ﬁndings indicated that variance of emoji usage exists in different countries. These ﬁndings show that countries such as the USA, France and Russia are more likely to insert emojis within their tweets. For instance, 12.6% of tweets from the USA contained at least one emoji, which is greater than other countries. It is intriguing to know that the overwhelmingly popular emoji , while topping the list in other countries, is not the one most used in France and Spain. Japanese and Indonesians have a preference for emojis with faces, while the French are more likely to use heart symbols such as and . We can also see that users from Brazil, Spain, and the UK prefer hand emojis as compared to other nations. In addition, people from Japan, the Philippines tend to use sadder emojis such as in their tweets compared to the USA and France.

372

E. Ch’ng et al.

Fig. 15.2 Top 20 most-used emojis on Twitter

Fig. 15.3 Comparative frequencies of the use of emojis in different countries

We also investigated whether the meaning of the emojis changes across different countries, and if pairs of emojis are used in the same way in each country. We selected the USA, Japan, UK and Mexico amongst the top 10 countries for this study, as they necessarily represent North America, Asia, Europe, and South America, respectively. We built four skip-gram models for each country. Our results suggest that some emojis are preserved differently across different countries, and there are some signiﬁcant differences in the use of certain emojis across the countries. Table 15.1 shows Pearson’s correlation of all sentiment categories and across all culture indexes. The results show that people from high PDI countries are more likely to tweet emojis with negative scores. For instance, people from countries such as Malaysia (PDI 100), Saudi Arabia, and Philippines (PDI 94) are more likely to use negative (NEG) emojis compared to Spain (PDI 57), Italy (PDI 50), and the UK (PDI 35). People form strong IDV and LTO countries are more likely to use positive (POS) emojis but less likely to tweet emojis with negative scores. People from strong

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

373

Table 15.1 Pearson’s correlation of emoji sentiment and culture index Index Power Distance (PDI) Individualism (IDV) Masculinity (MAS) Uncertainty avoidance (UAI) Long-term orientation (LTO) Indulgence versus restraint (IVR)

POS 0.058 0.468 0.022 0.132 0.389 0.412

NEU 0.038 0.167 0.159 0.146 0.179 0.038

NEG 0.301 0.532 0.172 0.062 0.462 0.316

IVR countries are less likely to use tweets with positive scores and more likely to use emojis in the NEG category. From the table, we can see that IDV, LTO and IVR can explain both the positive and negative factors. IDV seems to be an indicator of most of the three emoji sets, as there is a striking correlation between the individualism index and the emojis used. In societies such as the USA and UK, the expression of emotions is encouraged. Societies with a high degree in LTO view adaptation, circumstantiality, and pragmatic problem-solving as a necessity and thus have good economic development. We assume that people in such societies have less economic burdens and are more likely to express positive emotions. People from countries with high IVR believe themselves to be in control of their own life and emotions and tend to have less negative expressions. It is noted that PDI only explains one side of the emotion. In high PDI societies, hierarchy is clearly established and executed. From this interpretation, we may make the assumption that people in such societies cannot question authority and tended to express more negative emotions. Finally, MAS and UAI tended to have little correlations with the use of emojis. This work examined how emojis are used and how cross-cultural differences inﬂuence their use of emojis in Twitter communication. The reported results suggest that the usage of emojis demonstrates varied patterns across different countries, which to a certain extent complies with Hofstede’s Cultural Dimensions Model. Our work holds a number of signiﬁcant implications and novelties. First of all, it demonstrated how emoji analytics can allow us to pursue research questions and interpret results previously deemed challenging. Our study is, to the best of our knowledge, a very ﬁrst cross-cultural comparison of the use of emoji based on largescale Twitter data (673 million tweets). Previous works by other research groups were mainly based on small-scale user surveys. Our approach has revealed how Twitter users use emojis across different cultures. Though Twitter users may not yet represent the given country’s population, the data we have generated can serve as a great resource for explaining cross-cultural differences in prior studies. Our empirical analysis also demonstrates that emoji usage can be a useful signal to distinguish and identify users with different cultural backgrounds. The approach has great potentials for application areas such as those working in digital outreach and citizen engagement. Government bodies and marketing communications departments could, for example, analyse emojis to better understand semantics and sentiments. Scholars, on the other hand, can infer attitudes and behaviours of communications

374

E. Ch’ng et al.

within social and cultural studies. App designers and developers can better proﬁle and categorise users into cultural and demographic groups and provide targeted personalised services and user experience enhancements by accurately inferring their moods, status, and preferences through machine learning approaches. Similar to text, emojis have become an important component of online communication, escalated by the need to communicate with speed and effectiveness in social media. This work dealt with Twitter data from the emoji perspective, facilitating the discovery of information behind emojis and enhancing interpretation of actual Twitter sentiments. The actual meaning of a Twitter post and its sentiment can be interpreted with greater clarity using emoji analytics.

15.3.3 Sentiment Analysis The combination of mobility via smartphones, a platform to interact via social media, and the human need to connect and share have contributed to the Big data of user-generated contents. A barrage of useless opinions, argumentative comments, trolling, rumours, coupled with emotional reactions and posts on disparate subjects from individuals to groups, and from businesses to political rhetoric have contributed to the build up towards an exabyte era. Useless or otherwise, human behaviours, thoughts and interactions are recorded digitally and are made accessible to the curiosity of academics. As a result, researchers are eyeing the opportunity provided by such large volumes of data for gaining insights and perhaps advancing knowledge in their ﬁelds. The massive amounts of social media data have become an important source of study enabled by automated sentiment analysis. Sentiment analysis is an area of research which applies NLP and text analysis methods for studying attitudes of the masses, as well as emotions and opinions towards an entity, which could be a product, a target individual, an event or some important topics [52]. In essence, sentiment analysis aims to obtain a quantiﬁed assessment of user-generated contents in either discrete, or degrees of polarities, automatically. Traditionally, the main sources of data targeted by sentiment analysis are product reviews, which are crucial to businesses and marketers as they provide insightful evaluation of user opinions on their products [52]. Big open social media textual data such as blogs, microblogs, comments and postings have also become a larger area for the development and perfection of sentiment analysis algorithms. Sentiment analysis can also be applied in the stock market [53] and for news articles [54]. Opinions posted within social media platforms have helped reshape businesses and inﬂuence public sentiments in newsworthy events, used as a channel for voices and for mitigation, they have contributed to impacting both the political [55] and social systems [55]. The few short paragraphs above have demonstrated the importance of sentiment analysis as a core area requiring automated Big data approaches, which can contribute to diverse beneﬁts where manual processing and analysis of data has become

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

375

impossible. The Big data of user-generated contents are a mandatory source of information critical to decision makers in the twenty-ﬁrst century. Twitter sentiment analysis attempts at automating the classiﬁcation of sentiments of messages posted on Twitter by confronting and resolving a number of challenges including character limitations on tweets, colloquial and informal linguistic style, high frequency of misspellings and use of slang, data sparsity, etc. so that deeper insights can be interpreted. While various Twitter sentiment analysis techniques have been developed in recent years, most of the state-of-the-art work (see [16–18]) paid little attention to emojis. As the meaning of a post and its sentiment can be identiﬁed with greater clarity using emojis, its combination with textual sentiment analysis can provide a clearer means of accurately interpreting information presented within a message. There is presently a heightened awareness of the need for novel approaches that handle emojis and tackle current open issues in Twitter sentiment analysis, one of which is multi-class sentiment analysis, which is not well-developed for classifying tweets. Twitter sentiment analysis is another important aspect of our multimodal analysis of Twitter data. Twitter sentiment analysis assists data interpretation by revealing the attitudes and opinions behind Twitter posts (texts, in most cases). Here, we report on our works in constructing our own sentiment classiﬁer, which has been used for understanding social media datasets in our research. In this section, we ﬁrst report on our model in real-world tweets, 7.5 million of them over two months, generated within the New York City area. Our model demonstrates how agencies can track the ﬂuctuation of moods of city dwellers using geo-mapping techniques. In our work, we developed methods to automatically prepare a corpus for machine learning-based training data using emoticons, which replaces human annotators from needing to label tweets, a task which is highly tedious. In order to deal with the informal and unstructured language used in the 140 character Twitter posts, we also created novel data pre-processing methods such as the utilisation of a sentiment-aware tokeniser which have assisted in the detection of emoticons and handle lengthening. We also constructed our own spelling checker to handle errors or loosely spelled words. Our work differs from traditional approaches by considering speciﬁc areas of the microblogging feature emoji and generates several emoji-related features to build our sentiment classiﬁcation system for Twitter data. The novel feature has proved to be useful for sentiment analysis in our large context of social media analytics [20]. In a further development [21], we investigated the feasibility of an emoji training heuristic for Twitter sentiment analysis of the 2016 U.S. Presidential Election, and improved upon our methodological framework by improving our pre-processing techniques, enhancing feature engineering (feature extraction and feature selection) and optimising the classiﬁcation models. In another work, we also considered the popularity of emojis on Twitter and investigated the feasibility of an emoji training heuristic for multi-class sentiment classiﬁcation of tweets. Tweets from the ‘2016 Orlando nightclub shooting’ were used as a source of study. We proposed a novel approach that classiﬁes tweets into multiple sentiment classes. We limited our scope to ﬁve different sentiment classes, extensible to more classes as our approach is scalable. The sentiment model constructed with the automatically annotated training

376

E. Ch’ng et al.

sets using an emoji approach and selected features performs well in classifying tweets into ﬁve different sentiment classes, with a macro-averaged F-measure of 0.635, a macro-averaged accuracy of 0.689 and the MAEM of 0.530. Our results are signiﬁcant when compared to experimental results in related works [56], indicating the effectiveness of our model and the proposed emoji training heuristic. To the best of our knowledge, our work appears to be a pioneer in multi-class sentiment classiﬁcation on Twitter with automatic annotation of training sets using emojis. There has been little attention paid to applying Twitter sentiment analysis to monitor public attitudes towards terror attacks and gun policies. At present, the maturity of our research in this particular area has brought us into some important open issues for Twitter sentiment analysis. This includes the use of deep learning techniques, sarcasm detection, etc.

Fig. 15.4 Snapshot of the distribution of emotional tweets in Manhattan Island on the morning of 26th December 2015

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

377

Figure 15.4 illustrates some results of our sentiment analysis in [20]. The ﬁgure shows a snapshot of the distribution of emotional tweets in Manhattan Island during the morning of 26th December 2015. Positive tweets are represented by green dots, and negative tweets by red dots. The size of dot represents the number of tweets posted at their speciﬁc coordinates. The combination of sentiments and map can help with the identiﬁcation of citizen sentiments at speciﬁc times and locales. There was a cluster of negative tweets near the 7th Avenue for example; indicating a particular issue related to the area at the time. We are also able to dynamically track the moods of citizens by comparing the distribution of emotional tweets on maps with different times. Whilst the map may represent a sample of citizen sentiments, it can be a promising auxiliary tool for smart city governance, as a sentiment thermometer, as a communications outreach, targeted marketing campaigns and for social sciences research. The maps we created provide a much easier to understand visual representation of our data and make it more efﬁcient to interpret results when used for monitoring citizen sentiments.

15.4

Social Media Visual Analytics

The last decade has seen a rise in the use of visualisation within research processes, partly due to the availability of high-end GPUs, high-resolution displays and 3D graphics. Ch’ng, Gaffney and Chapman noted that visualisation has become a process rather than a product [57]. Visualisation has become part of the process of research rather than an illustration at the end of a research report. It has become increasingly important to have tools of visualisation, visualisation expertise and visual communicators within the process of research [58].

15.4.1 Social Information Landscapes Some form of large datasets can present a challenge if they are not visualised in a particular way. For example, our discovery of the # FreeJahar community [59] in Twitter representing a group of teenage female supporters of the younger Boston bomber depended mainly on interpreting a network map in various visual conﬁgurations before the group is found. Textual and emoji contents may reveal moods, emotions and trends within large social media data; however, they may be inadequate for discovering communities. This presents a good reason why multimodality is important in Big data research as far as the need for interpretation is concerned. The application of network centrality measures may provide adequate information with regard to the core agents within a network in terms of their degree, closeness, and betweenness centralities, but for the present, the identiﬁcation of communities does require visual interpretation. Visualisation also contributes to the research

378

E. Ch’ng et al.

Fig. 15.5 A log-log plot of the top 10 degree centrality ranked Twitter agents (left), and the visualisation of a Social Information Landscape (right) with more information

process of investigating the birth, growth, evolution, establishment and decline of online communities. Figure 15.5 shows one of the larger datasets of the # FreeJahar Twitter social network constructed as an activity network and termed Social Information Landscapes [60]. Social Information Landscape is deﬁned as and constructed by ‘the automated mapping of large topological networks from instantaneous contents, sentiments and users reconstructed from social media channels, events and user generated contents within blogs and websites, presented virtually as a graph that encompasses, within a timescale, contextual information, all connections between followers, active users, comments and conversations within a social rather than a physical space’. Here, the keyword ‘automated mapping’ is essential, as the future mapping of extremely large spatial–temporal networks will make it possible to study changes in centrality as complex networks evolve. The Social Information Landscapes approach tended to map activity networks or time-stamped interactions rather than follower–followee networks which are usually static, in that they remain relatively unchanged. The method maps activity networks and conﬁgures its multimodal structure with nodes representing agents and their contents and directional edges as interactions and communications between nodes. The beneﬁts of visualising a multimodal graph as one of the modes of our multimodal approach can be seen here. In Fig. 15.5 (left), the power-law degree centrality of the # FreeJahar small-world network [61] is represented as a log-log-plot. This provides some good information. But when visualised in a particular way using the social information landscape approach (right), the patterns become clearer as information is presented as relationships. The ﬁgure shows a large cluster of unconnected nodes located centrally; these are insigniﬁcant individual tweets and Twitter users. The large round clusters of connected nodes scattered around the space are popular Twitter accounts such as

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

379

CNN and BreakingNews listed as top 10 Degree Rankings on the graph to the left of Fig. 15.5. Twitter followers of these accounts are grouped with the accounts, with their retweets represented as black dots. The effective identiﬁcation of the two groups due to the way the graph is conﬁgured leaves us with a small cluster ﬂanked by two large interacting clusters at the top of the landscape. A closer inspection led to the discovery of the # FreeJahar community mentioned above [59]. This brief paragraph reiterated the importance of visualisation as a process of research. The need to relate multimodal data together into a network and visualise their development over time can have signiﬁcant implications in the way we approach a subject matter. Human interpretation is a necessary aspect of any research, and the continued development of social information landscapes as one of the multimodal aspects of Big data-oriented research will become apparent in the future.

15.4.2 Geo-Mapping Topical Data Geo-tagged contents are one of the important features of social media. Data with location-based information can provide useful information for identifying the source of contents. Knowing the context of user-generated contents can be helpful for institutions such as government agencies, product marketers, to pinpoint their audience and user base in order to take action and expand resources. Knowing the ‘where’ is as important as the ‘what’ and the ‘when’. Reading social media data without context can be difﬁcult for interpretation. Visualising contextualised data on maps will, on the other hand, allow us to digest information. Our aim is to visualise location-based user-generated contents on geographical maps so as to understand the distribution of users and the context of the contents which they generated. A social media map can be understood as a graphical representation of where social media conversations and comments occurred. It can also act as an organised social media reference guide with categorical views showing sites and applications on the landscape. Social media maps have signiﬁcant beneﬁts, in that it is a much easier visual representation of data. Social media maps can also provide contexts which leads to better ways of prioritising, planning and execution of goals and objectives. Furthermore, when the content of conversations is combined with the precise locations, a more complete picture can emerge. Maps make it efﬁcient for monitoring social media trends and distribution, which ultimately lead to better interpretation of topical data. People make associations with different geographies, and we can tell a story by putting data on the map. Most of the state-of-the-art works in the ﬁeld of Twitter sentiment analysis merely focus on improving the performance of classiﬁcation model [16–18] but ignore the signiﬁcance of visualisation of location-based sentiments on geographical maps. Our works merged geo-mapping into our Twitter sentiment analysis framework. In our previous work [21], topical maps were used for illustrating attitudes towards candidates of the 2016 U.S. Presidential Election across the American nation. Figure 15.6

Fig. 15.6 Snapshot of the distribution of Trump-related emotional tweets on 24th May 2016

380 E. Ch’ng et al.

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

381

demonstrates a snapshot of the distribution of Trump-related emotional tweets on 24th May 2016. Positive, neutral and negative sentiments are represented by red, orange and blue dots, respectively. The size of the dots indicates the amount of tweets posted at a speciﬁc location. The map is indicative that the number of negative sentiments surpasses that of positive ones. From this, we may infer that one of the candidates has probably done or said something extreme. Such data would be difﬁcult to interpret if they were displayed in a tabulated format. We have further developed geo-location processing algorithms in our work. The majority of social media posts do not have explicitly geo-tagged contents as not every user enables their location-based service, perhaps for concerns about privacy. An extremely small percentage (~1%) of Twitter accounts published geo-tagged contents. As such, we wrote algorithms to automatically obtain and translate using a gazetteer, the location string (city/states) obtained from the user proﬁles into geographical coordinates. The mapping of social media contents can be useful, in that marketing as a key aspect of businesses could beneﬁt from visualising customer opinions on social media maps. By identifying regional issues with products, resources could be better managed; marketing strategies in the regions could be ﬁne-tuned for customer satisfaction. Researchers in sociology are increasingly using social media maps to study societal problems. Relationships between social media data and geographic information could be mapped. For instance, Mitchell et al. [62] plotted the average word happiness for geo-tagged tweets of all states in the USA on the national map so as to examine how happiness varied in different states. Visualisation of sentiments using maps is an important mode within the greater framework of multimodality. For example, our work here is applied to visualise the relationships between sentiments and beverages with different levels of alcohol strength (beer, wine, gin) at different time periods in order to examine and interpret the different cultures of drinking across the world [63].

15.4.3 GPU Accelerated Real-Time Data Processing Data comes in many unstructured formats, requiring a tedious process of cleaning, patching, ﬁlling gaps and structuring as a very ﬁrst phase of work, prior to it becoming even equipped for an initial inspection. When decisions need to be made in real-time, and based on real-world user-generated data, data processing time becomes a challenge. The accessibility of GPGPUs for distributing resources and parallelising tasks has become mandatory for any Big data research. In this section, we present a foundational system for testing real-time visualisation of information as another mode within our multimodal framework for assisting data interpretation. We use native code and APIs, and drawing from our collection of textual data related to the 2016 U.S. Presidential Election. We integrated NVIDIA Compute Uniﬁed Device Architecture (CUDA) with OpenGL for achieving our aims. We used CUDA for GPU accelerated textual processing, and OpenGL for

382

E. Ch’ng et al.

visualising information in real-time. We distributed our large-scale dataset into CUDA device memory and parallelised a simple sentiment analysis algorithm across device blocks and threads. Processed data were returned to the host and pushed through the OpenGL rendering pipeline. As far as we know, our work may be the ﬁrst to implement sentiment processing and real-time visualisation of social media sentiments using GPU acceleration. The majority of past and present GPU accelerated works are numerical in nature (e.g., visual computing and AI). Text-related research is rare. GPU works related to text processing began about a decade ago and were developed for early GPUs. For example, the Parallel Cmatch algorithm was used for string matching on genomes and chromosome sequences [64]. Other GPU string matching algorithms using a simpliﬁed version of the Knuth–Morris–Pratt online string matching was tested with the Cg programming language, used for ofﬂoading an intrusion detection system onto the GPU with marginal performance [65]. Intrusion detection on a set of synthetic payload trace and real network trafﬁc with the Aho-Corasick, Knuth– Morris–Pratt and Boyer–Morris algorithms in early versions of CUDA [66] yielded signiﬁcant increase in performance. Around 2009, various GPU-accelerated textual processing works using CUDA were evaluated. Zhang et al. [67] was the ﬁrst to develop an optimised text search algorithms for GPUs using the term frequency/ inverse document frequency (TFIDF) rank search, exploiting their potential for massive data processing. Onsjo et al. [68] used CUDA for string matching for the DNA of the fruit ﬂy, achieving 50 times that of an AMD Opteron 2.4 GHz as compared to the NVIDIA T1 GPU using the nondeterministic ﬁnite automaton (NFA) with binary states encoded into a small number of computer words. Kouzinopoulos et al. [69] was the ﬁrst to evaluate the performance of the Naive, Knuth–Morris–Pratt, Boyer–Moore–Horspool and Quick Search online exact string matching algorithms in CUDA on early GPGPUs (NVIDIA GTX 280 with 30 multiprocessors and 250 cores), used for locating all the appearances of a pattern on a set of reference DNA sequences. Past researches exploring GPU acceleration for text or string-related processing via CUDA has not been explored properly and studies were conducted a decade ago. Furthermore, the variants in the topic of research whilst centered on text strings are related to either genomes or Internet documents, we felt that the need to explore social media data with GPU acceleration has become important, with many application areas, especially when GPU acceleration can be integrated with real-time visualisation. The old English idiom ‘A picture is worth a thousand words’ can be meaningfully applied here when quick decisions are to be made, in real-time, and based on consummable information facilitated by visualisation. An overview of the ﬂow of data in our GPU-accelerated sentiment processing and visualisation system is shown in Fig. 15.7. We originally intended to use the CPU to pre-process our dataset by deleting any non-English words or characters and emojis but opted not to as it will slow down our system and thus defeat our aims of achieving a real-time visualisation with incoming streams in future work. For this work, we prepared a set of data containing:

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

383

Fig. 15.7 An overview of the ﬂow of data in our GPU accelerated sentiment processing and visualisation system

• 3,353,366 lines of tweets • Each tweet containing 21 attributes • We set a continent list with numbers representing large cities – – – – – –

Africa: 126 Asia: 167 Europe: 142 North America: 205 South America: 45 Oceania: 89

• A positive word list containing 290 words • A negative word list containing 271 words • A list containing the length of the char arrays above Each data above were loaded into a 2D char-array in the host (CPU) and copied with cudaMalloc to the global memory of the device (GPU). The device does not support C/Cþþ String libraries or text-processing functions (e.g., length, indexing, etc.), as such the char arrays were used, and our text processing algorithms were written natively. Within the Device, we distributed our data with N lines into a number of blocks and threads within our GPU (Tesla K80) with 24GB memory. We used our natively written code for each non-atomic operation within the Device in order to extract 21 tweet attributes from our dataset, out of which we selected three main attributes which consisted of time, tweet, time zone. From the tweet attribute we extracted all keywords. We moved on to parallelisation by distributing our algorithms across Device blocks and threads based on the number of line of texts. Each parallel algorithm compares the tweet keywords with our negative and positive keywords. Each sentiment is calculated and summed. Another Device algorithm extracted the time-zone attribute of each tweet and assigned to the locale of the country for our OpenGL visualisation at the Host. Figure 15.8 illustrates our Host-Device-Host-OpenGL kernels and processes. Structured data are loaded into the Host memory from either our database or streamed from social media API

384

E. Ch’ng et al.

Fig. 15.8 An illustration of our Host-Device-Host-OpenGL kernels and processes

(1). The data is parsed and allocated to Device memory. Kernel n are each Device code distributed across blocks and threads. Process n are Host processes. A kernel can be a non-atomic task, e.g., kernel 0 parses and compares keywords, and also assigns sentiments to a locale. The calculated data are then passed back to the Host Process 0 for OpenGL rendering (2), if there are no more kernel-process interaction. There will be additional Kernels and Processes involved in future, more complex scenarios. Results of our parallel and distributed development are very promising [70]. We achieved a speed that is capable of receiving real-time inputs from Twitter, the world’s most active, and largest social media platform. Our GPU accelerated processing reached a speed of 43,057 lines per second when only 6000 tweets are generated every second [71] on average, and this includes visualisation via our OpenGL graphing interface (Fig. 15.9). The volume and velocity of social media data are highly unpredictable and depended on both global and local events. Viral contents can potentially reach >40,000 tweets per second, which will not likely slow down our real-time processing and rendering. Our GPU parallel algorithm achieved signiﬁcant results when the amount of data pushed through to the GPU is measured against the time it took to process the data. Figure 15.10 illustrates the performance comparison. The ability to process and automatically analysis large amounts of data and present them in a human consumable format has implications in quick decision making for various agencies.

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

Fig. 15.9 An OpenGL interface rendering activity data, sentiment comparisons, and sentiment by continents (from [70])

15 385

386

E. Ch’ng et al.

Fig. 15.10 GPU textual processing time (y axis) and the density of textual contents (lines of texts as volume in the x axis) (from [70])

15.4.4 Navigating and Examining Social Information in VR Information can be communicated very efﬁciently in the 2D mode using graphs and illustrations. Even 3D data can be looked at from a 2D perspective, i.e., as pixels on displays, or by allowing for an added dimension of interaction using interactive 3D. However, these modes of viewing and interpreting data are, in a sense viewed as a spectator from the ‘outside’. As we have come to realise, as data becomes larger than our screens can ﬁt, there may be a need get ‘inside’ the data through immersive displays. In this section, we report on a pilot development of our immersive data environment. We restructured and partitioned spatial-temporal Twitter data within a virtual environment and develop Virtual Reality and Mixed Reality interfaces for users to view and navigate through data that are too large for 2D displays, the notion of which is to allow interpreters to be part of the data, living within them using a ‘phenomenological approach’ to data analysis. In this mode of work to assist with interpretation of spatial data, we believe that having an extra dimension in the analysis of data within an unlimited and extensible virtual 3D space can be beneﬁcial. We hypothesise that better interpretation can be made possible by allowing a viewer to navigate and interact with data within the space and time of a given subject. With this in mind, we set out to restructure Twitter data in a way which will allow them to be viewed and interacted within a virtual environment, and provided an immersive display using the Oculus Rift and the LeapMotion controller as an input and navigation device, essentially allowing the hand and gestures to be used for manipulating data. More features of a dataset can deﬁnitely be represented within a 3D space provided that large datasets are managed as chunks. With this in mind, we constructed an Octree data structure and assigned 3D coordinates to each of the

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

387

Fig. 15.11 The immersive virtual environment for interpreting data using the Oculus Rift and LeapMotion Controller

tweets. We tested two methods for streaming tweet data into a virtual environment using the Unity3D IDE. The ﬁrst method directly accesses our database whilst the second pre-loaded data with a local Web Server. The local Web Server was used as the speed reaches 20MB/s whereas the direct database access peaked at 8MB/s. As real-time visualisation maybe needed, the second approach is more viable. We represented each tweet as a sphere, loaded initially into the virtual environment with equal distribution (Fig. 15.11a) across a space together with arbitrary Twitter proﬁle attributes. Depending on our interpretation we are able to manipulate them by selecting, rearranging and scaling each tweet as nodes into clusters based on, for example, gender, geo-location or followers. The idea is to allow users to navigate, manipulate and visualise data within an immersive virtual environment without being conﬁned to a 2D space. An additional mini map provided users with a reference point with regards to their position within the environment (Fig. 15.11b). Predeﬁned gestures were programmed for navigation and interaction using the LeapMotion controller, i.e., selecting a data node by pinching with the index ﬁnger and the thumb. Nodes that are pinched produce a sound to indicate that it has been selected. Nodes which have been examined, similar to a web link, changes colour.

388

E. Ch’ng et al.

Fig. 15.12 VR headset and gesture-based navigation and interaction with social media data during one of the development session

After a node is picked, a database query follows and the tweet information is shown within the virtual environment (Fig. 15.11c). If the nodes have relationships such as connected users, tweets or retweets, they will be clustered together (Fig. 15.11d). We also provided a simple gesture-based menu system which appears on the virtual ﬁnger tips for easy access to tools which we may add in our future development (Fig. 15.11e, f). Figure 15.12 shows the use of the Oculus VR headset and hand gestures during the development of the system.

15.5

Conclusion

In this chapter we provided an argument on the relativity of the term ‘Big’ in the context of Big data research and discussed the perception and attitude of the community in relation to our experience. We also discussed the ambiguity of the use of the term ‘multimodality’ and elaborated on why we approach multimodal analytics for Big data in such a way. Our chapter focuses not on the issue of storage and processing, but on the need to provide a multimodal approach to examining the same set of data from different modes so as to facilitate human interpretation. We demonstrated through six related modes of work showing that a single source of social media data could be examined with very different modalities, each mode being a new computational approach requiring different degrees of scalability. The methods which we have developed, the data we have acquired and used, and the software and hardware we assembled are inexpensive, they are highly accessible and do not present a barrier to small labs and businesses. Acknowledgments The author acknowledges the ﬁnancial support from the International Doctoral Innovation Centre, Ningbo Education Bureau, Ningbo Science and Technology Bureau, and the University of Nottingham. This work was also supported by the UK Engineering and Physical Sciences Research Council [grant number EP/L015463/1]. The equipment provided by NVIDIA is

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

389

greatly appreciated, without which the freedom of exploratory research and innovation may be constraint to only funded projects.

References 1. Manovich, L.: Trending: the promises and the challenges of big social data. Debate. Digit. Humanit. 460–475 (2011) 2. Boyd, D., Crawford, K.: Six provocations for big data. A decade in internet time: symposium on the dynamics of the internet and society, Soc. Sci. Res. Netw., New York (2011) 3. Marr, B.: The complete beginner’s guide to big data in 2017. Forbes Technology (2017) 4. Carter, J.: Why big data is crude oil – while rich data is reﬁned, and the ultimate in BI. Techradar.com (2015) 5. Gitelman, L.: Raw data is an oxymoron. MIT Press, Cambridge (2013) 6. Loukides, M.: What is data science? O’Reilly Media, Inc., Newton (2011) 7. Lindstrom, M.: Small data: the tiny clues that uncover huge trends. St. Martin’s Press, New York (2016) 8. Veinot, T.C.: ‘The eyes of the power company’: workplace information practices of a vault inspector. Libr. Q. 77(2), 157–179 (2007) 9. Avoyan, A.H.: Machine learning is creating a demand for new skills. Forbes Technology (2017) 10. Peshkin, A.: The nature of interpretation in qualitative research. Educ. Res. 29(9), 5–9 (2000) 11. Roy, A.K.: Applied big data analytics. Paperback. Create space Independent Publishing Platform, ISBN-10, vol. 1512347183 (2015) 12. Jacobs, A.: The pathologies of big data. Commun. ACM. 52(8), 36–44 (2009) 13. Ch’ng, E.: The value of using big data technology in computational social science. In: The 3rd ASE big data science 2014, Tsinghua University 4–7 August, pp. 1–4 (2014) 14. Mergel, A.: A framework for interpreting social media interactions in the public sector. Gov. Inf. Q. 30(4), 327–334 (2013) 15. Stieglitz, S., Dang-Xuan, L.: Social media and political communication: a social media analytics framework. Soc. Netw. Anal. Min. 3(4), 1277–1291 (2013) 16. Koto, F., Adriani, M.: A comparative study on twitter sentiment analysis: which features are good?, vol. 6177 (2010) 17. Barbosa, L., Junlan, F.: Robust sentiment detection on twitter from biased and noisy data. no. August, pp. 36–44 (2010) 18. Kouloumpis, E., Wilson, T., Moore, J.: Twitter sentiment analysis: the good the bad and the omg!. Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 11), pp. 538–541 (2011) 19. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. LREc. 1320–1326 (2010) 20. Li, M., Ch’ng, E., Chong, A., See, S.: The new eye of smart city: novel citizen sentiment analysis in Twitter. In: The 5th International Conference on Audio, Language and Image Processing (2016) 21. Li, M., Ch, E., Chong, A., See, S.: Twitter sentiment analysis of the 2016 U. S. Presidential Election using an emoji training heuristic. In: Applied Informatics and Technology Innovation Conference (AITIC) (2016) 22. McMillan, D.W., Chavis, D.M.: Sense of community: a deﬁnition and theory. J. Community Psychol. 14(1), 6–23 (1986) 23. McMillan, D.: Sense of community: an attempt at deﬁnition. George Peabody College for Teachers, Nashville (1976)

390

E. Ch’ng et al.

24. Maloney-Krichmar, D., Preece, J.: A multilevel analysis of sociability, usability, and community dynamics in an online health community. ACM Trans. Comput. Hum. Interact. 12(2), 201–232 (2005) 25. Bloch, M.: The historian’s craft. Manchester University Press (1992). 26. Gurin, J.: Big data and open data: what’s what and why does it matter? [Online]. https://www. theguardian.com/public-leaders-network/2014/apr/15/big-data-open-data-transform-govern ment (2014). Accessed 28 Aug 2017 27. Davis, D.: How big data is transforming architecture. architecturemagazine.com (2015) 28. Khan, A., Hornbæk, K.: Big data from the built environment. In: Proceedings of the 2nd international workshop on research in the large, pp. 29–32 (2011) 29. Maeda, J.: Artists and scientists: more alike than different. Sci. Am. 2016 (2013) 30. Kritzer, H.M.: The data puzzle: the nature of interpretation in quantitative research. Am. J. Polit. Sci. 40, 1–32 (1996) 31. Taylor, A.: Philosophical Papers: Volume 2, Philosophy and the Human Sciences, vol. 2. Cambridge University Press (1985) 32. T. F. F. B. D. Commission: Demystifying big data: a practical guide to transforming the business of government [Online]. https://bigdatawg.nist.gov/_uploadﬁles/M0068_v1_ 3903747095.pdf (2012) 33. Gartner: What is big data [Online]. https://research.gartner.com/deﬁnition-whatis-big-data? resId¼3002918&srcId¼1-8163325102 (2017) 34. Pavalanathan, U., Eisenstein, J.: Emoticons vs. emojis on twitter: a causal inference approach. AAAI Spring Symposium on Observational Studies Through Social Media and Other HumanGenerated Content (2016) 35. Kalman, Y.M., Gergle, D.: Letter repetitions in computer-mediated communication: a unique link between spoken and online language. Comput. Hum. Behav. 34, 187–193 (2014) 36. Dresner, E., Herring, S.C.: Functions of the nonverbal in CMC: emoticons and illocutionary force. Commun. Theory. 20(3), 249–268 (2010) 37. Miller, H., Thebault-Spieker, J., Chang, S., Johnson, I., Terveen, L., Hecht, B.: ‘Blissfully happy’ or ‘ready to ﬁght’: varying interpretations of emoji. GroupLens Research, University of Minnesota (2015) 38. Kelly, R., Watts, L.: Characterising the inventive appropriation of emoji as relationally meaningful in mediated close personal relationships. Experiences of technology appropriation: unanticipated users, usage, circumstances, and design (2015) 39. Mosendz, P.: Why the white house is using emoji [Online]. http://www.theatlantic.com/tech nology/archive/2014/10/why-the-white-house-is-using-emojis/381307/ (2014). Accessed 05 Aug 2017 40. Steinmetz, K.: Oxford’s 2015 word of the year is this emoji. TIME [Online]. http://time.com/ 4114886/oxford-word-of-the-year-2015-emoji/ (2016) Accessed 12 Jun 2016 41. Chen, Z., Lu, X., Shen, S., Ai, W., Liu, X., Mei, Q.: Through a gender lens: an empirical study of emoji usage over large-scale android users, pp. 1–20 (2017) 42. Vidal, L., Ares, G., Jaeger, S.R.: Use of emoticon and emoji in tweets for food-related emotional expression. Food Qual. Prefer. 49, 119–128 (2016) 43. Novak, P.K., Smailovic, J., Sluban, B., Mozetic, I.: Sentiment of emojis. PLoS One. 10(12), 1–19 (2015) 44. He, W., Wu, H., Yan, G., Akula, V., Shen, J.: A novel social media competitive analytics framework with sentiment benchmarks. Inf. Manag. 52(7), 801–812 (2015) 45. Park, J., Baek, Y.M., Cha, M.: Cross-cultural comparison of nonverbal cues in emoticons on twitter: EVIDENCE from big data analysis. J. Commun. 64(2), 333–354 (2014) 46. Lim, S.L., Bentley, P.J., Kanakam, N., Ishikawa, F., Honiden, S.: Investigating country differences in mobile app user behavior and challenges for software engineering. IEEE Trans. Softw. Eng. 41(1), 40–64 (2015) 47. Reinecke, K., Bernstein, A.: Improving performance, perceived usability, and aesthetics with culturally adaptive user interfaces. ACM Trans. Comput. Hum. Interact. 18(2), 8 (2011)

15

Multimodal Approaches in Analysing and Interpreting Big Social Media Data

391

48. Jack, R.E., Blais, C., Scheepers, C., Schyns, P.G., Caldara, R.: Cultural confusions show that facial expressions are not universal. Curr. Biol. 19(18), 1543–1548 (2009) 49. Lu, X., Ai, W., Liu, X., Li, Q., Wang, N., Huang, G., Mei, Q.: Learning from the ubiquitous language: an empirical analysis of emoji usage of smartphone users. In: Proceedings of Ubicomp, pp. 770–780. ACM (2016) 50. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efﬁcient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013) 51. Hofstede, G.: Hofstede insights [Online]. https://www.hofstede-insights.com (2017). Accessed 05 Aug 2017 52. Medhat, W., Hassan, A., Korashy, H.: Sentiment analysis algorithms and applications: a survey. Ain Shams Eng. J. 5(4), 1093–1113 (2014) 53. Yu, L.C., Wu, J.L., Chang, P.C., Chu, H.S.: Using a contextual entropy model to expand emotion words and their intensity for the sentiment classiﬁcation of stock market news. Knowl.Based Syst. 41, 89–97 (2013) 54. Xu, T., Peng, Q., Cheng, Y.: Identifying the semantic orientation of terms using S-HAL for sentiment analysis. Knowl.-Based Syst. 35, 279–289 (2012) 55. Liu, A.: Sentiment analysis and opinion mining. Synth. Lect. Hum. Lang. Technol. 5(1), 1–167 (2012) 56. Nakov, P., Ritter, A., Rosenthal, S., Stoyanov, V., Sebastiani, F.:{SemEval}-2016task 4: sentiment analysis in {T}witter. Proceedings of the 10th International Workshop on Semantic Evaluation, pp. 1–18 (2016) 57. Ch’ng, V., Gaffney, L., Chapman, H. P.: From product to process: new directions in digital heritage. In: Din, H., Wu, S. (eds.) Digital Heritage and Culture: Strategy and Implementation, 1st ed. World Scientiﬁc, pp. 219–243. (2014) 58. Frankel, F., Reid, R.: Big data: distilling meaning from data. Nature. 455(7209), 30 (2008) 59. Ch’ng, E.: The bottom-up formation and maintenance of a twitter community: analysis of the # FreeJahar twitter community. Ind. Manag. Data Syst. 115(4), 612–624 (2015) 60. Ch’ng, E.: Social information landscapes: automated mapping of large multimodal, longitudinal social networks. Ind. Manag. Data Syst. 115(9), 1724–1751 (2015) 61. Ch’ng, E.: Local interactions and the emergence and maintenance of a twitter small-world network. Soc. Networking. 4(2), 33–40 (2015) 62. Mitchell, L., Frank, M.R., Harris, K.D., Dodds, P.S., Danforth, C.M.: The geography of happiness: connecting twitter sentiment and expression, demographics, and objective characteristics of place. PLoS One. 8(5), e64417 (2013) 63. Li, M., Ch’ng, E., Li, B., Zhai, S.: Social-cultural monitoring of smart cities using big data methods: alcohol. In: The Third International Conference on Smart Sustainable City and Big Data (ICSSC) (2015) 64. Schatz, M., Trapnell, C.: Fast exact string matching on the GPU. Center for Bioinformatics and Computational Biology (2007) 65. Jacob, N., Brodley, C.: Ofﬂoading IDS computation to the GPU. In: Proceedings of the 22nd Annual Computer Security Applications Conference IEEE ACSAC’06, pp. 371–380 (2006) 66. Vasiliadis, A., Antonatos, S., Polychronakis, M., Markatos, E.P., Ioannidis, S.: Gnort: high performance network intrusion detection using graphics processors. In: Proceedings of the 11th International Symposium on Recent Advances in Intrusion Detection, pp. 116–134 (2008) 67. Zhang, Y., Mueller, F., Cui, X., Potok, T.: GPU-accelerated text mining. In: Workshop on exploiting parallelism using GPUs and other hardware-assisted methods, pp. 1–6 (2009) 68. Onsjo, M., Aono, Y., Watanabe, O.: Online approximate string matching with CUDA, [200912-18]. http://pds13.egloos.com/pds/2009 07/26/57/pattmatch-report.pdf (2009) 69. Kouzinopoulos, A.S., Margaritis, K.G.: String matching on a multicore GPU using CUDA. In: Informatics. PCI’09. 13th Panhellenic Conference on, 2009, pp. 14–18 (2009) 70. Ch’ng, E., Chen, Z., See, S.: Real-time GPU-accelerated social media sentiment processing and visualization. In: The 21st IEEE International Symposium on Distributed Simulation and Real Time Applications (DS-RT). October 18–20, 2017 71. ILS: Twitter usage statistics. InternetLiveStats.com (2017)