161 45 98MB
English Pages 909 Year 2021
Lecture Notes in Networks and Systems 294
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1
Lecture Notes in Networks and Systems Volume 294
Series Editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Advisory Editors Fernando Gomide, Department of Computer Engineering and Automation—DCA, School of Electrical and Computer Engineering—FEEC, University of Campinas— UNICAMP, São Paulo, Brazil Okyay Kaynak, Department of Electrical and Electronic Engineering, Bogazici University, Istanbul, Turkey Derong Liu, Department of Electrical and Computer Engineering, University of Illinois at Chicago, Chicago, USA; Institute of Automation, Chinese Academy of Sciences, Beijing, China Witold Pedrycz, Department of Electrical and Computer Engineering, University of Alberta, Alberta, Canada; Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland Marios M. Polycarpou, Department of Electrical and Computer Engineering, KIOS Research Center for Intelligent Systems and Networks, University of Cyprus, Nicosia, Cyprus Imre J. Rudas, Óbuda University, Budapest, Hungary Jun Wang, Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
The series “Lecture Notes in Networks and Systems” publishes the latest developments in Networks and Systems—quickly, informally and with high quality. Original research reported in proceedings and post-proceedings represents the core of LNNS. Volumes published in LNNS embrace all aspects and subfields of, as well as new challenges in, Networks and Systems. The series contains proceedings and edited volumes in systems and networks, spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output. The series covers the theory, applications, and perspectives on the state of the art and future developments relevant to systems and networks, decision making, control, complex processes and related areas, as embedded in the fields of interdisciplinary and applied sciences, engineering, computer science, physics, economics, social, and life sciences, as well as the paradigms and methodologies behind them. Indexed by SCOPUS, INSPEC, WTI Frankfurt eG, zbMATH, SCImago. All books published in the series are submitted for consideration in Web of Science.
More information about this series at http://www.springer.com/series/15179
Kohei Arai Editor
Intelligent Systems and Applications Proceedings of the 2021 Intelligent Systems Conference (IntelliSys) Volume 1
123
Editor Kohei Arai Faculty of Science and Engineering Saga University Saga, Japan
ISSN 2367-3370 ISSN 2367-3389 (electronic) Lecture Notes in Networks and Systems ISBN 978-3-030-82192-0 ISBN 978-3-030-82193-7 (eBook) https://doi.org/10.1007/978-3-030-82193-7 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Editor’s Preface
We are very pleased to introduce the Proceedings of Intelligent Systems Conference (IntelliSys) 2021 which was held on September 2 and 3, 2021. The entire world was affected by COVID-19 and our conference was not an exception. To provide a safe conference environment, IntelliSys 2021, which was planned to be held in Amsterdam, Netherlands, was changed to be held fully online. The Intelligent Systems Conference is a prestigious annual conference on areas of intelligent systems and artificial intelligence and their applications to the real world. This conference not only presented the state-of-the-art methods and valuable experience, but also provided the audience with a vision of further development in the fields. One of the meaningful and valuable dimensions of this conference is the way it brings together researchers, scientists, academics, and engineers in the field from different countries. The aim was to further increase the body of knowledge in this specific area by providing a forum to exchange ideas and discuss results, and to build international links. The Program Committee of IntelliSys 2021 represented 25 countries, and authors from 50+ countries submitted a total of 496 papers. This certainly attests to the widespread, international importance of the theme of the conference. Each paper was reviewed on the basis of originality, novelty, and rigorousness. After the reviews, 195 were accepted for presentation, out of which 180 (including 7 posters) papers are finally being published in the proceedings. These papers provide good examples of current research on relevant topics, covering deep learning, data mining, data processing, human–computer interactions, natural language processing, expert systems, robotics, ambient intelligence to name a few. The conference would truly not function without the contributions and support received from authors, participants, keynote speakers, program committee members, session chairs, organizing committee members, steering committee members, and others in their various roles. Their valuable support, suggestions, dedicated commitment, and hard work have made IntelliSys 2021 successful. We warmly thank and greatly appreciate the contributions, and we kindly invite all to continue to contribute to future IntelliSys. v
vi
Editor’s Preface
We believe this event will certainly help further disseminate new ideas and inspire more international collaborations. Kind Regards, Kohei Arai
Contents
Late Fusion of Convolutional Neural Network with Wavelet-Based Ensemble Classifier for Acoustic Scene Classification . . . . . . . . . . . . . . . Cheng Siong Chin and Jianhua Zhang Deep Learning and Social Media for Managing Disaster: Survey . . . . . Zair Bouzidi, Abdelmalek Boudries, and Mourad Amad
1 12
A Framework for Adaptive Mobile Ecological Momentary Assessments Using Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . Lihua Cai, Laura E. Barnes, and Mehdi Boukhechba
31
Reputation Analysis Based on Weakly-Supervised Bi-LSTM-Attention Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Xiang and Akihiro Fujii
51
Multi-GPU-based Convolutional Neural Networks Training for Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Imen Ferjani, Minyar Sassi Hidri, and Ali Frihida
72
Performance Analysis of Data-Driven Techniques for Solving Inverse Kinematics Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vijay Bhaskar Semwal and Yash Gupta
85
Machine Learning Based H2 Norm Minimization for Maglev Vibration Isolation Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Ahmet Fevzi Bozkurt, Barış Can Yalçın, and Kadir Erkan A Vision Based Deep Reinforcement Learning Algorithm for UAV Obstacle Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 Jeremy Roghair, Amir Niaraki, Kyungtae Ko, and Ali Jannesari Detecting and Fixing Nonidiomatic Snippets in Python Source Code with Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Balázs Szalontai, András Vadász, Zsolt Richárd Borsi, Teréz A. Várkonyi, Balázs Pintér, and Tibor Gregorics vii
viii
Contents
BreakingBED: Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Manoj-Rohit Vemparala, Alexander Frickenstein, Nael Fasfous, Lukas Frickenstein, Qi Zhao, Sabine Kuhn, Daniel Ehrhardt, Yuankai Wu, Christian Unger, Naveen-Shankar Nagaraja, and Walter Stechele Parallel Dilated CNN for Detecting and Classifying Defects in Surface Steel Strips in Real-Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 Khaled R. Ahmed Selective Information Control and Network Compression in Multi-layered Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184 Ryotaro Kamimura DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning . . . . . . . . . . . . . . . . . 205 Si Lu and Ruisi Li Enhancing LSTM Models with Self-attention and Stateful Training . . . 217 Alexander Katrompas and Vangelis Metsis Domain Generalization Using Ensemble Learning . . . . . . . . . . . . . . . . . 236 Yusuf Mesbah, Youssef Youssry Ibrahim, and Adil Mehood Khan Research on Text Classification Modeling Strategy Based on Pre-trained Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 Yiou Lin, Hang Lei, Xiaoyu Li, and Yu Deng Discovering Nonlinear Dynamics Through Scientific Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 Lei Huang, Daniel Vrinceanu, Yunjiao Wang, Nalinda Kulathunga, and Nishath Ranasinghe Tensor Data Scattering and the Impossibility of Slicing Theorem . . . . . 280 Wuming Pan Scope and Sense of Explainability for AI-Systems . . . . . . . . . . . . . . . . . 291 A.-M. Leventi-Peetz, T. Östreich, W. Lennartz, and K. Weber Use Case Prediction Using Deep Learning . . . . . . . . . . . . . . . . . . . . . . . 309 Tinashe Wamambo, Cristina Luca, Arooj Fatima, and Mahdi Maktab-Dar-Oghaz VAMDLE: Visitor and Asset Management Using Deep Learning and ElasticSearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 318 Viswanathsingh Seenundun, Balkrishansingh Purmah, and Zahra Mungloo-Dilmohamud Wind Speed Time Series Prediction with Deep Learning and Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330 Anibal Flores, Hugo Tito-Chura, and Victor Yana-Mamani
Contents
ix
Evaluation for Angular Distortion of Welding Plate . . . . . . . . . . . . . . . 344 Shigeru Kato, Shunsaku Kume, Takanori Hino, Fujioka Shota, Tomomichi Kagawa, Hironori Kumeno, and Hajime Nobuhara A Framework for Testing and Evaluation of Operational Performance of Multi-UAV Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355 Mrinmoy Sarkar, Xuyang Yan, Shamila Nateghi, Bruce J. Holmes, Kyriakos G. Vamvoudakis, and Abdollah Homaifar Addressing Consumer Demands: A Manufacturing Collaboration Process Using Blockchain for Knowledge Representation . . . . . . . . . . . 375 Ricardo Barbosa, Ricardo Santos, and Paulo Novais Cellular Formation Maintenance and Collision Avoidance Using Centroid-Based Point Set Registration in a Swarm of Drones . . . . . . . . 391 Jawad N. Yasin, Huma Mahboob, Mohammad-Hashem Haghbayan, Muhammad Mehboob Yasin, and Juha Plosila The Simulation with New Opinion Dynamics Using Five Adopter Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409 Makoto Fujii and Akira Ishii Intrinsic Rewards for Reinforcement Learning Within Complex 2D Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 Nathaniel Grabaskas and Zhizhen Wang Analysis of Divided Society at the Standpoint of In-Group and Out-Group Using Opinion Dynamics . . . . . . . . . . . . . . . . . . . . . . . 438 Nozomi Okano and Akira Ishii Simulation of Intragroup Alignment Using a New Model of Opinion Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 Nozomi Okano, Hitoshi Yamamoto, Masaru Nishikawa, and Akira Ishii Random Forest Classification with MapReduce in Holonic Multiagent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 Michéle Cullinan and Duncan Coulter Monitoring Goal Driven Autonomy Agent’s Expectations Generated from Durative Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Noah Reifsnyder and Hector Munoz-Avila Sublinear Regret with Barzilai-Borwein Step Sizes . . . . . . . . . . . . . . . . 499 Iyanuoluwa Emiola Fluid Dynamics of a Pandemic in a Spatial Social Network: A Reflective Measure of the Spreading . . . . . . . . . . . . . . . . . . . . . . . . . . 513 Saad Alqithami
x
Contents
Affective Story-Morphing: Manipulating Shelley’s Frankenstein under Program Control using Emotionally Intelligent Agents . . . . . . . . 526 Clark Elliott Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning in Gin Rummy Game . . . . . . . . . . . . . . . . 543 Yuexing Hao and Mark Vaysiberg Wireless Sensor Network Smart Environment for Precision Agriculture: An Agent-Based Architecture . . . . . . . . . . . . . . . . . . . . . . . 556 AbdulMutalib Wahaishi and Raafat Aburukba Autonomy Reconsidered: Towards Developing Multi-agent Systems . . . 573 Michael A. Goodrich, Julie A. Adams, and Matthias Scheutz A Real-Time Intelligent Intra-vehicular Temperature Control Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 Daniel Jacuinde-Alvarez, James Dols, and Shahab Tayeb Intelligent Control of a Semi-autonomous Assistive Vehicle . . . . . . . . . . 613 David Sanders, Giles Tewkesbury, Malik Haddad, Ya Huang, and Boriana Vatchova One Shot Learning Approach to Identify Drivers . . . . . . . . . . . . . . . . . 622 Malik Haddad, David Sanders, Martin Langner, and Giles Tewkesbury Facial Recognition Software for Identification of Powered Wheelchair Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 Giles Tewkesbury, Samuel Lifton, Malik Haddad, David Sanders, and Alex Gegov Intelligent User Interface to Control a Powered Wheelchair Using Infrared Sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 640 Malik Haddad, David Sanders, Giles Tewkesbury, Martin Langner, and Sarinova Simandjuntak A Classification Based Ensemble Pruning Framework with Multi-metric Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650 Ya-Lin Zhang, Qitao Shi, Meng Li, Xinxing Yang, Longfei Li, and Jun Zhou Customs Risk Assessment Based on Unsupervised Anomaly Detection Using Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 668 Dion T. Oosterman, Wouter H. Langenkamp, and Ellen L. van Bergen Best Next Preference Prediction Based on LSTM and Multi-level Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 682 Ivett Fuentes, Gonzalo Nápoles, Leticia Arco, and Koen Vanhoof
Contents
xi
Achieving Trust in Future Human Interactions with Omnipresent AI: Some Postulates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 700 Peer Sathikh, Zong Rui Dexter Fang, and Guan Yi Tan A Decentralized Explanatory System for Intelligent Cyber-Physical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719 Étienne Houzé, Jean-Louis Dessalles, Ada Diaconescu, David Menga, and Mathieu Schumann Construction Control Organization with Use of Computer and Information Technologies in Context of Sustainable Development Providing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739 Zalina Ruslanovna Tuskaeva and Zaurbek Valerievich Albegov Computational Rational Engineering and Development: Synergies and Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 744 Ramses Sala QPSetter: An Artificial Intelligence-Based Web Enabled, Personalized Service Application for Educators . . . . . . . . . . . . . . . . . . . 764 Mohammad Ali Kadampur and Sulaiman Al Riyaee Is It Possible to Recognize a Philosophical Zombie and How to Do It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 778 R. V. Dushkin Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791 Jesús Jaime Moreno Escobar, Oswaldo Morales Matamoros, Ana Lilia Coria Páez, and Ricardo Tejeida Padilla Are Human Drivers a Liability or an Asset? . . . . . . . . . . . . . . . . . . . . . 805 David Sanders, Malik Haddad, Giles Tewkesbury, Alex Gegov, and Mo Adda Negative Emotions Induced by Non-verbal Video Clips . . . . . . . . . . . . . 817 Flavia De Simone, Simona Collina, and Manuela Nuzzo Automatic Recognition of Key Modulations in Symbolic Musical Pieces Using Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 823 Michele Della Ventura Increasing Robustness for Machine Learning Services in Challenging Environments: Limited Resources and No Label Feedback . . . . . . . . . . 837 Lucas Baier, Niklas Kühl, and Jörg Schmitt Development Support for Intelligent Systems: Test, Evaluation, and Analysis of Microservices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 857 Charline von Perbandt, Matthias Tyca, Arne Koschel, and Irina Astrova
xii
Contents
An Analysis with Dynamics Between Human Motivation and Messaging on Social Networking Services . . . . . . . . . . . . . . . . . . . . 876 Hidehiro Matsumoto and Akira Ishii Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895
Late Fusion of Convolutional Neural Network with Wavelet-Based Ensemble Classifier for Acoustic Scene Classification Cheng Siong Chin1(B) and Jianhua Zhang2 1 Faculty of Science, Agriculture, and Engineering, Newcastle University Singapore, Singapore 599493, Singapore [email protected] 2 School of Information and Control Engineering, Qingdao University of Technology, Qingdao 266525, Shandong, China
Abstract. Log-Mel spectrogram for the convolutional neural network (CNN) and wavelet time scattering for Ensemble of subspace discriminant classifiers is used for classifying acoustic scenes with human speech. The Tampere University of Technology (TUT) Acoustic Scenes dataset is used to demonstrate the feasibility of the proposed model. Comparisons are performed with the baseline model in the TUT 2017 dataset used for Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 Challenge-Task 1. The fused model shows good acoustic classification accuracy of 79.43%. The proposed late fusion of multi-model using CNN and ensemble classifiers exhibits 18.4% higher accuracy than the baseline model with just CNN. Keywords: Acoustic scene classification · Time scattering · Acoustic classification accuracy · Convolutional Neural Network · Wavelet multi-model late fusion system
1 Introduction Acoustic scene classification (ASC) [1–4] classifies audio signals into a pre-selected list of scene types such as car parks, parks, meeting rooms, etc. The problem can resemble speech recognition. The main difference is the target classes are more diversified. They are various applications of ASC. For example, it can be used for acoustic event recognition using the mobile device that detects an individual is having a meeting. It would trigger the device into silent mode automatically. ASC has been used in robots [5, 6], mobile devices [7–9], traffic [10, 11], and medical systems [12, 13]. One of the standard scientific challenges in ASC is Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge. The primary scope of ASC involved obtaining the best acoustic classification accuracy in assigning audio recordings to a specific recorded environment.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 1–11, 2022. https://doi.org/10.1007/978-3-030-82193-7_1
2
C. S. Chin and J. Zhang
Many ASC used Convolutional Neural Network (CNN) [14], Recurrent Neural Networks (RNN) [15], Support Vector Machines (SVM) [16], Gaussian Mixture Models [17], and Multilayer Perceptron [18, 19]. Recurrent network architectures such as convolutional (CRNN), Bi-Long Short Term Memory (LSTM), and LSTM [20] were also used. However, LSTM has inherently gradient vanishing and exploding problems. As observed in DCASE Challenge, the best-performing systems used CNN. An ensemble of neural networks [21] and ensemble classifiers [22] were used. The former approach using CNN has outperformed other ASC task approaches [23–25]. The latter has also demonstrated good acoustic classification accuracy with shorter computation time than CNN. To improve the generalization, Mel-frequency cepstral coefficients (MFCCs) [26] and other signal representations such as Constant Q Transform (CQT) [27] and wavelet time scattering [28] to extract the acoustic features of the raw data. Multiple spectrograms [29, 30] such as MFCC, short-time Fourier transform (STFT), and CQT were also utilized to increase the number of features for training. It has shown positive results as more timefrequency characteristics could be extracted. However, the computation time increases as more features are required to be processed. In this paper, a multi-model late fusion is used for ASC. CNN seems to be a reasonable choice. They are provided a time-frequency representation of audio to capture spectro-temporal modulation patterns for identifying various acoustic scenes. The timefrequency representation is used for CNN. It relates to the width and height dimensions of the convolutional filters, respectively. To reduce the overfitting of the dataset in CNN, a Mixup algorithm [31] is used. The original and mixed datasets are combined to train a CNN [14] using the log-Mel spectrograms. It is followed by an ensemble random subspace discriminant classifier using wavelet scattering [28]. The Tampere University of Technology (TUT) 2017 dataset [32] used for DCASE2017 Challenge-Task 1 will be used for both training and evaluation.
2 Proposed Methodology There are 4680 and 1620 labeled audio files for training and evaluation, respectively. The original TUT-2017 datasets are obtained from different environmental scenes at other recording locations with some human speech recorded. There are not more than 5-min audio recordings at each site. The original recordings are split into 10s segments where each audio segment is included as sound files. The details of the outdoor (both open or enclosed areas) and indoor acoustic scenes can be seen in the TUT-2017 dataset [32]. The following 15 acoustic scenes are as follows. • • • • • • •
bus forest path home city center cafe lakeside beach (outdoor) library (indoor)
Late Fusion of CNN with Wavelet-Based Ensemble Classifier
• • • • • • • •
3
car grocery store urban park (outdoor) office: multiple persons (indoor) metro station (indoor) train (traveling, vehicle) residential area (outdoor) tram (traveling, vehicle)
The recordings are recorded from different streets, homes, and parks [32]. Sound recordings were performed via different devices at 24-bit resolution and 44100 Hz sampling rate. The microphones are worn during recording. 2.1 Pre-processing and Feature Extraction The brief descriptions of the pre-processing steps for log-scale Mel-spectrogram can be seen below. • The acoustic signal is sampled at 44100 Hz. The audio clips are then normalized. • The audio is converted to mid-side encoded [14] data to obtain good spatial information for CNN to detect moving sources. • The signal is then divided into 1s segments with an overlap of 0.5 s. It helps to train the network easier and reduces overfitting for certain acoustic events. The overlap increases the data for subsequent data augmentation. • The window size of 2048 samples using short-time Fourier transform with a hop size of 1024 samples are used. The samples overlap is 1024. The spectrogram has 128 bin mel-scale. The Mel-spectrogram is then converted into a logarithmic scale. • The log-Mel spectrogram data is reshaped before they are used as an image for training CNN. The first two dimensions are the height and width of the image, followed by the channels and the segments. For example, the size is 128 × 42 × 2 × 19. • The training labels are replicated to correspond with the 19 segments. • The dataset is augmented via Mixup [31]. It mixes the features of two different classes in equal proportion, as shown. x˜ = λxi + (1 − λ)xj
(1)
y˜ = λyi + (1 − λ)yj
(2)
where xi and xj are from dissimilar classes. The corresponding class labels are denoted by yi and yj , respectively. The mixing value of λ = 0.5 is used. 2.2 Convolutional Neural Network The CNN’s architecture can be seen in Table 1. The Batch Normalization (BN) and rectified linear unit (ReLU) [33] are used. The ReLU increases the non-linearity in
4
C. S. Chin and J. Zhang
the images. The batch normalization learning is used as a regularization to prevent overfitting. The activation function and BN are located before the convolution layer to improve the acoustic classification accuracy. The max-pooling layers come after the convolution process. The feature map that includes a prominent feature is obtained from the output of the max-pooling layer. The average pooling reduces the activation by combining the non-maximal activations. The last few layers consist of a dropout layer that removes 50% of the visible and hidden units to reduce overfitting. The fully connected layer is compiled the data to form the output for the last second layer that uses the softmax activation function to obtain probabilities of the input from the 15 classes. Lastly, the last classification layer produces the final classification. Table 1. CNN architecture. Description of each layer imageInputLayer- 128 × 42 × 2 batchNormalizationLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 32 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding convolution2dLayer- 128 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 128 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer maxPooling2dLayer- pool size 3 × 3, stride 2 × 2 and zero padding (continued)
Late Fusion of CNN with Wavelet-Based Ensemble Classifier
5
Table 1. (continued) Description of each layer convolution2dLayer- 256 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer convolution2dLayer- 256 filters (3 × 3) and zero padding batchNormalizationLayer reluLayer averagePooling2dLayer-pool size 16 × 6 dropoutLayer(0.5) fullyConnectedLayer(15) softmaxLayer classificationLayer
2.3 Wavelet Scattering The next step involves feature extraction using wavelet scattering for subsequent ensemble classifiers. It provides a good representation [28] of the time-frequency content of a signal. The first and second-order coefficients are used as most of the signal energy can be captured. The parameters of the transform are the filter-bank (using 1D Morlet wavelets) resolutions Q1 = 1 and Q2 = 4. The duration 0.75s of the averaging filter (or invariance scale) is used for the modulation structure duration. The sampling frequency is 44100 Hz. The first filter bank has a resolution of 4, and the second filter bank has a resolution of 1. 2.4 Ensemble Classifiers The proposed ensemble classifiers include different discriminant analysis learners, such as linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), and Regularized linear discriminant analysis (RDA) with other predictors covariance treatments. The random subspace learning method is used to increase the acoustic classification accuracy. In the random subspace, the feature subspaces are chosen randomly from the original feature space. The final prediction of these individual classifiers is then obtained using majority voting. 2.5 Fusion of CNN and Classifiers The fusion of the CNN and classifier prediction results indicates the relative confidence of their prediction. Multiplying the responses and determining the maximum response creates a late fusion system that inherent in the merits of each method. (3) class_pred i = argmax probiCNN , probiensem_class
6
C. S. Chin and J. Zhang
where probiCNN and probiensem_class are the probabilities of sound recording i from CNN and ensemble classifiers, respectively.
3 Results and Discussion The configurations of the proposed model are as follows. • • • • • • •
Stochastic gradient descent with momentum optimizer with a learning rate: 0.05 s Size of the mini-batch for each training iteration: 128 Momentum: 0.9 Maximum number of epochs: 8 Factor for L2 regularization: 0.005 Number of epochs for dropping the learning rate: 2 Multiplicative factor applied to the learning rate for each epoch: 0.2
The training data are shuffled before each training epoch. The entire experiment, including the pre-processing, took not more than three hour. The short audio segments (see Fig. 1) provide less information, thus making ASC difficult. A segment sample of the extracted Mel-spectrograms audio clip for the "lakeside beach" scene is shown in Fig. 1. The frequency along the y-axis, time is displayed along the x-axis, and the signal’s energy at a particular time and frequency is shown as the color map. The Intel® Core i7 CPU and Geforce RTX 2060 are used.
Fig. 1. Example of segments of Mel-spectrogram from the lakeside beach scene.
Late Fusion of CNN with Wavelet-Based Ensemble Classifier
7
The acoustic classification accuracy can be seen in Table 2. The ensemble classifiers have a higher acoustic classification accuracy than CNN. Compared to the baseline model (consists of 2 layers × 50 hidden units, 20% dropout), the fused model exhibits 18.4% higher accuracy. The details of the baseline model can be found in the reference [32]. Table 2. Acoustic classification accuracy of different models. Scenes
Acoustic classification accuracy (%) Baseline model [32]
CNN model
Ensemble classifiers model
Fused model
Beach
40.7
73.1
37.9
50.9
Bus
38.9
58.3
92.5
87.9
cafe/restaurant
43.5
74.0
82.4
82.4
Car
64.8
100
76.8
88.8
city-center
79.6
88.8
91.6
93.5
forest path
85.2
97.2
96.2
98.1
grocery store
49.1
70.3
79.6
79.6
Home
76.9
89.8
76.8
91.6
Library
30.6
49.0
36.1
40.7
metro station
93.5
100
95.3
100
Office
73.1
80.5
83.3
84.2
Park
32.4
20.3
68.5
60.1
residential area
77.8
63.8
77.7
81.4
Train
72.2
76.8
85.1
82.4
Tram
57.4
57.4
64.8
69.4
Average
61.0
73.3
76.3
79.4
Although the result of the scene (i.e., beach) using ensemble classifiers (37.96%) is quite poor as compared to CNN (73.14%), the fused model managed to increase the acoustic classification accuracy to 50.92%. Conversely, the scene (i.e. park) using CNN model is relatively low compared to the ensemble classifiers. The fused model increases it to 60.18%. The confusion matrix of CNN, ensemble classifiers, and the fused model are shown in Fig. 2. The confusion chart of the multi-model late fusion system shows better acoustic classification accuracy for city-center, forest path, and metro station than other scenes. The average acoustic classification accuracy of the fused model is computed as 79.43%. The false-negative for the residential area is around 51.6% with false discovery rate of 18.5%. The false negative is quite negligible for the classes.
8
C. S. Chin and J. Zhang
Fig. 2. Confusion charts of CNN (top), Ensemble classifiers (Middle), and Multi-model late fusion model (Bottom).
Late Fusion of CNN with Wavelet-Based Ensemble Classifier
9
4 Conclusion A multi-model late fusion system model consisting of the log-Mel spectrogram for convolutional neural network and wavelet time scattering for ensemble of subspace discriminant classifiers was proposed. The acoustic scene classification aims to classify the acoustic scenes in a different environment such as the park, car park, beach, citycenter, etc. Based on the dataset from the TUT Acoustic Scenes, it demonstrated that the fused model gives good acoustic classification accuracy of 79.43%. The proposed multi-model late fusion system exhibits 18.4% higher acoustic classification accuracy than the baseline model despite relatively low performance in a few scenes such as the beach and library. Nevertheless, the multi-model late fusion system shows good acoustic classification accuracy for most of the scenes. For future works, an adaptive type of hyperparameter tuning and advanced feature extraction methods will be used to improve the performance further.
References 1. Mesaros, A., et al.: Detection and classification of acoustic scenes and events: outcome of the DCASE 2016 challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 26(2), 379–393 (2018) 2. Mesaros, A., Diment, B., Elizalde, T., Heittola, E., Vincent, B., Raj, T.: Virtanen, sound event detection in the DCASE 2017 challenge. IEEE/ACM Trans. Audio, Speech Lang. Process. 27(6), 992–1006 (2019) 3. Rakotomamonjy, A.: Supervised representation learning for audio scene classification. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1253–1265 (2017) 4. Trowitzsch, I., Mohr, J., Kashef, Y., Obermayer, K.: Robust detection of environmental sounds in binaural auditory scenes. IEEE/ACM Trans. Audio Speech Lang. Process. 25(6), 1344– 1356 (2017) 5. Ribeiro, P.O.C.S., et al.: Underwater place recognition in unknown environments with triplet based acoustic image retrieval. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, pp. 524–529 (2018) 6. Aziz, S., Awais, M., Akram, T., Khan, U., Alhussein, M., Aurangzeb, K.: Automatic scene recognition through acoustic classification for behavioral robotics. Electronics 8(5), 483–500 (2019) 7. Kojima, R., Sugiyama, O., Hoshiba, K., Suzuki, R., Nakadai, K.: HARK-Bird-Box: a portable real-time bird song scene analysis system. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, pp. 2497–2502 (2018) 8. Xu, X., Yu, J., Chen, Y., Zhu, Y., Qian, S., Li, M.: Leveraging audio signals for early recognition of inattentive driving with smartphones. IEEE Trans. Mob. Comput. 17(7), 1553–1567 (2018) 9. Song, X., Wang, M., Qiu, H., Li, K., Ang, C.: Auditory scene analysis-based feature extraction for indoor subarea localization using smartphones. IEEE Sens. J. 19(15), 6309–6316 (2019) 10. Jiang, D., et al.: An audio data representation for traffic acoustic scene recognition. IEEE Access 8, 177863–177873 (2020) 11. Wang, L., Roggen, D.: Sound-based transportation mode recognition with smartphones. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, United Kingdom, pp. 930–934 (2019)
10
C. S. Chin and J. Zhang
12. Li, Y., Chen, F., Sun, Z., Ji, J., Jia, W., Wang, Z.: A smart binaural hearing aid architecture leveraging a smartphone APP with deep-learning speech enhancement. IEEE Access 8, 56798–56810 (2020) 13. Vivek, V.S., Vidhya, S., Madhanmohan, P.: Acoustic scene classification in hearing aid using deep learning. In: 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, pp. 0695–0699 (2020) 14. Han, Y., Park, J., Lee, K.: Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 15. Pham, L., Doan, T., Ngo, D.T., Nguyen, H., Kha, H.H.: CDNN-CRNN joined model for acoustic scene classification. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 16. Jimenez, A., Elizalde, B., Raj, B.: DCASE 2017 Task 1: acoustic scene classification using shift-invariant kernels and random features. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 17. Fraile, R., Reina, J.C., Arriola, J.G., Blanco, E.: Classification of acoustic scenes based on modulation spectra and position-pitch maps. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 18. Bilot, V., Duong, N.Q.K., Ozerov, A.: Acoustic scene classification with multiple instance learning and fusion. In: Detection and Classification of Acoustic Scenes and Events 2019 (DCASE2019) Challenge, Technical Report (2019) 19. Foleiss, J., Tavares, T.: MLP-based feature learning for automatic acoustic scene classification. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), Munich, Germany (2017) 20. Hao, W., Zhao, L., Zhang, Q., Zhao, H., Wang, J.: DCASE 2018 TASK 1A: acoustic scene classification by Bi-LSTM-CNN-net multichannel fusion. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Surrey, UK (2018) 21. Sakashita, Y., Aono, M.: Acoustic scene classification by Ensemble of spectrograms based on adaptive temporal divisions. In: Proceedings DCASE2018, Woking, Surrey, UK, (2018) 22. Maka, T.: Auditory scene classification using ensemble learning with small audio feature space. In: Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), Technical report (2018) 23. Vafeiadis, A., et al.: Acoustic scene classification: from a hybrid classifier to deep learning. In: Proceeding of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2017), Munich, Germany (2017) 24. Zhang, T., Liang, J., Ding, B.: Acoustic scene classification using deep CNN with fineresolution feature. Expert Syst. Appl. 143, 34 (2020) 25. Valenti, M., Squartini, S., Diment, A., Parascandolo, G., Virtanen, T.A.: Convolutional neural network approach for acoustic scene classification. In: Proceedings IJCNN, Anchorage, Alaska, pp. 1547–1554 (2017) 26. Ghodasara, V., Waldekar, S., Paul, D., Saha, G.: Acoustic scene classification using block based MFCC features. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), Budapest, Hungary (2016) 27. Hong, L.: Acoustic scene classification using Mel-spectrum and CQT based neural network ensemble. In: Detection and Classification of Acoustic Scenes and Events 2020 (DCASE2020) Challenge, Technical Report (2020)
Late Fusion of CNN with Wavelet-Based Ensemble Classifier
11
28. Chin, C.S., Kek, X.Y., Chan, T.K.: Wavelet scattering based gated recurrent units for binaural acoustic scenes classification. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), pp. 1–5 (2020) 29. Zheng, W., Mo, Z., Xing, X., Zhao, G.: CNNs-based Acoustic Scene Classification using Multi-Spectrogram Fusion and Label Expansions. CoRR abs/1809.01543 (2018) 30. Zheng, W., Yi, J., Xing, X., Liu, X., Peng, S.: Acoustic scene classification using deep convolutional neural network and multiple spectrograms fusion. In: Proceeding of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop, Munich, Germany (2017) 31. Ferenc, H.: Mixup: Data-Dependent Data Augmentation. InFERENCe (2017). https://www. inference.vc/mixup-data-dependent-data-augmentation/. Accessed 15 Jan 2019 32. Mesaros, A., et al.: DCASE 2017 challenge setup: tasks, datasets and baseline system. In: Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92 (2017) 33. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
Deep Learning and Social Media for Managing Disaster: Survey Zair Bouzidi1(B) , Abdelmalek Boudries2 , and Mourad Amad1 1 LIMPAF Laboratory, Computer Science Department, Science and Applied Science Faculty,
Bouira University, Bouira, Algeria 2 Laboratory LMA, Commercial Science Department, Faculty of Economics, Business and
Management, Bejaia University, Béjaïa, Algeria
Abstract. The broad dissemination and scope of social networks enables individuals to exchange information in real-time. This active involvement of societies plays a major role in reducing disaster risk and alleviating at-risk populations. While any operation needs accurate information in crisis management to enable a rapid response to decrease the potential loss of life. The timely retrieval of information from various regions of a disaster-affected area is a demanding task. A catastrophe relief and response method’s effectiveness depends largely on a prompt and accurate assessment of the disaster’s crisis. This knowledge is primarily collected on site by first responders and can be updated later on. Several technics have been built to automate this need through the extraction and analysis of appropriate content from social media. These approaches are not, however, well incorporated into the mechanism of relief. For more advancement, it would be important to reveal them. Keywords: Alert · Assessment · Awareness · Collaboration · Crowdsourcing · Deep learning · Disaster management models · Neural learning · Social networks · Relevant information
1 Introduction Increasing attention is paid to crisis management field from various research disciplines [1]. Scientists have played a key role in developing ways to handle and analyze data created in catastrophe management situations, especially the first-hand information of social network. We plan to survey and coordinate existing data management and analysis expertise in emergency situations in this paper, as well as present issues and future research directions. As a result of a detailed bibliography survey and our hands-on background from developping an Environment of Automated Learning [2–4] for managing emergencies in LIMPAF laboratory and for improving marketing corporate, business strategies, fraud detection and financial time series prediction [5]. This is a survey of emergency management applications using social networks. It provides a taxonomy of all characteristics of the models of catastrophe management, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 12–30, 2022. https://doi.org/10.1007/978-3-030-82193-7_2
Deep Learning and Social Media for Managing Disaster
13
social networking contributions and classification algorithms of extracting relevant content, from statistical algotithms to automated learning, from social networks and provides the reader with knowledge of existing and emerging tendencies in crisis management application research and area of focus for researchers. In addition, the research raises problems in implementing catastrophe management, contributions, results comparison and methods of criticism. Most social media contents exchanged during emergencies communicate timely, actionable data. However, analyzing social networking contents to acquire such data includes solving many problems, including: parsing short and informal messages, handling information overload, and prioritizing various types of information found in messages. Classical information processing tasks such as filtering, classification, rating, aggregating, extracting, and summarizing can be mapped to these challenges. We discuss the state of the art in emergency management models with different stages, social networking and various methods of retrieving relevant contents to process information from social networking and illustrate both their contributions and shortcomings. In addition, we analyze their details and methodically examine a set of key subproblems ranging from the events identification to the useful and actionable summaries development. The paper rest is set out as follows. Section 2 presents the background and recent surveys. Section 3 introduce different models crisis management. In Sect. 4, we show how content can be gathered from websites to all social media. Section 5 shows diffenet technics used for retrieving relevant information from social media, follow-up of the discussion, explaining and claasifying the different architectures of Deep learning. Finally, we conclude and give some future works.
2 Background and Related Works By analyzing and classifying the recent reviews, we studied concepts of catastrophes and all emergency management models. 2.1 Recent Surveys Table 1 displays the Latest Disaster Management Surveys Classification. We note that there are many information system surveys, particularly in the area of Integrated Communication [6–8, 10]. In artificial intelligence, especially in the fields of automatic learning, machine learning, deep learning, [1, 8, 9, 11–14] and in Collaboration in Volunteered geographic information quality assessment methods [8, 15–18] but even more studies in Social Media [1, 11–13, 19–22]. Risk assessment and mitigation have been discussed [23–25]. Street floods, perceptions of environmental risk and areas of disaster preparedness. In Big Data [14, 26], on the other hand, Crowdsourcing [27, 28] even Crowdtasking [28], they were restricted just for Natural Disaster. Some studies also touched Disaster Education [29, 30], Forecasting [31, 32] only in Forest Fire Danger Prediction field and Post-Disaster Coordination and Response [12, 21] in Super-cyclone Amphan Field. The recent research on Situational Awareness and Damage Assessment [29] only in the field of thermal agent disaster and fire disaster is also available. All these research dealt only
14
Z. Bouzidi et al.
with work performed on Twitter alone. Only our paper will review applications from all data sources on (Twitter, Facebook, Instagram, and so on). Table 1. Classification of recent disaster management surveys DM tasks
Fields
Surveys
Information System
Integrated Communication
[6–8, 10]
Artificial Intelligence
Automated Learning, Machine Learning, Deep Learning
[1, 8, 9, 11–14]
Social Media
/
[1, 11–13, 19–22]
Big Data
Natural Disaster
[14, 26]
Collaboration
Volunteered geographic information quality assessment methods
[8, 15–18]
Crowdsourcing
Natural Disaster
[27, 28]
CrowdTasking]
Natural Disaster
[28]
Education
/
[29, 30]
Forcasting
Forest Fire Danger Prediction
[34, 35]
Risk assessment/reduction
Street oods, Environmental Risk Perceptions And Disaster Preparedness
[23–25]
Situational Awareness
Thermal agent disaster and Fire disaster
[29]
Damage Assessment
Thermal agent disaster and fire disaster
[29]
Post-Disaster Coordination and Response
Super-cyclone Amphan
[12, 21]
2.2 Disaster Disasters such as earthquakes, flooding, fires, terrorist attacks and tsunamis result in disastrous human suffering, property destruction and other adverse effects. In addition to existing disasters, several anthropogenic disasters have arisen over the past two decades, primarily due to globalization, interconnected networks and substantial technological growth (see Table 2). Product forgery, biological risks, terrorism and ecological terrorism include anthropogenic disasters [1, 34]. The planet has undergone several significant natural and/or anthropogenic catastrophes of all time in recent years. Biological, geological, seismic, hydrological or natural processes such as cyclones, earthquakes, tsunamis, floods, forest fires, landslides, sandstorms and volcanic eruptions or hydro-meteorological paroxysms (exceptional precipitation), pandemics (pandemic of the coronavirus such as Covid’19) [32] and its variants or human processes such as simple precipitation) are often modified by species, have
Deep Learning and Social Media for Managing Disaster
15
reportedly identified the variant. From these cases, we find that we have 160,000 deaths and 60 million injured in 27 years (from 1980 to 2017), although we have serious damage only in the five years (from 2012 to 2017) with 32,454 dead, 3,355 injured, 6,639 missing, more than 83,000 hectares burned, 350 homes destroyed and other significant damage. Table 2. Latest catastrophic events No
Catastrophic event
Period
Damage
1
Forest Fire Haiti
Oct 2007
230,000, 220,000
2
Earthquake California
Jan 2010
203 deaths, 6,152.9 Km2 ravaged lands
3
Floods Thailand
Jul 2011
815 Deaths
4
Tsunami earthquake Japan
Apr 2011
15,896 deaths, 6,157 injuries, 2,537 missing
5
Hurricane Sandy USA
Oct 2012
220 Deaths
6
Typhoo Haiyan Philippines
Nov 2013
26,626 injuries
7
Flood of the elbe Germany
Jun 2013
25 deaths
8
Subway bombing Russia
Apr 2017
9
Suicide bombing England May 2017
22 Deaths, 116 injuries
10
Three explosions Indonesia
May 2018
9 deaths, 40 injuries
11
Japan Floods Japan
Jun 2018
235 deaths, 13 missing
12
Indonesian earthquake
Sep 2018
2,000 deaths, 1.5 million injuries
13
Earthquake Fire Haiti
Oct 2018
18 Deaths, 548 injuries
14
Terrorist Attack Strasbourg
Dec 2018
5 deaths, 10 injuries
15
Kivu Ebola epidemic Congo
Aug 2018–Jun 2020
14,739,450 affected, 1,162 healed, 2,299 deaths
16
Coronavirus Pandemic COVID-19
Jan 21st –Jul 23rd , 2020
14,739,450 affected, 8,332,461 healed, 610,776 deaths
15 deaths, 50 injuries
The urgency and significance of loss estimation and the need for decision support resources have been reaffirmed by recent threats from these disasters. In order to fulfill these needs and requirements, various models for disaster management [2–5, 29, 34–37] have been studied, designed and developed. The following are some of the major disaster management activities, including hazard evaluation, risk management, mitigation, preparedness, response and recovery.
16
Z. Bouzidi et al.
When disaster strikes, people seek information and ways [1–4, 35] to provide data and assist others. Disasters inspire altruism, where individuals support those who are in distress or suffering from the disaster. Information on the protection of people and goods, as well as sources of aid, are among the most common forms of online assistance in the event of a disaster. Catastrophe is defined [34] as a complex problem that must be addressed using a multidimensional and multiplatform framework to collect information. It is characterized as a severe disruption to the functioning of society [22] involving extensive losses to humans, materials or the environment. There are two main types of disasters: simple where the structure of the community remains intact and composed where the structure and function of the community are disrupted. Catastrophes are events that are fast-paced. Slow and chronic social disruptions [35], however, are important to theorize as catastrophes because they can have a greater effect than rapidly caused disasters. 2.3 Disaster Management The following stages are included in the disaster recovery cycle: warning, planning, action, prevention, mitigation and restoration (see Fig. 1). In catastrophe management, there are at least six key elements [34]: Prevention, Mitigation, Planning, Response (Relief), Restoration and Reconstruction. The emergency management process, however, is defined in four phases, namely: mitigation (before disaster) [45], preparedness (before disaster) [22], response (during disaster) [45] and recovery (after disaster) [22, 34].
Fig. 1. Disaster management cycles.
3 Disaster Management Models In this conceptual framework and theoretical chaos, discrepancies and some variations between different models of disaster management have resulted in complications. While the scope of disaster management calls for templates to be used [22, 34]. Well-formed [34, 36, 37] typology can be very useful in maintaining discipline and eliminating complications in a chaotic environment.
Deep Learning and Social Media for Managing Disaster
17
There are some various Disaster Management Models, namely: the Classical Model, Computer Model and Disaster Management Social Networking Model, which are different but complementary. Thus, Classical Disaster Management Model, is based on preventative measures, which can reduce the seismic risk, starting with the citizen’s knowledge by teaching him the attitude to take before, during and after the earthquake, then reducing the seismic vulnerability of buildings, which can restrict the damage, without forgetting the cooperation between all the volunteers (solidarity action). Part of the folk wisdom of disaster management is to use personal familiarity to facilitate communication and collaboration, but not just through institutional contact. Collaboration is an essential base, developing into a more collaborative enterprise to become a more complex and versatile network model [46] that promotes multi-organizational cooperation (see Fig. 2). Because of the use of volunteerism and community participation [46], collaboration has always been a skill. Volunteers provide community services with leading-edge capability and connections. Organizational and individual volunteer mobilization often serves a social and psychological purpose: to bring people together and give them a sense of effectiveness. More adaptive leadership will enable organizational learning and make adaptation and improvisation easier. Coordination is strengthened by regular engagement, including involvement in planning and training exercises. Channels of communication developed during the mitigation process serve as a basis [42, 47] for meaningful coordination and contribute to improving resilience and cooperation, playing a big role in risk reduction. However, a multitude of social and behavioral research poses coordination as a significant obstacle for people, associations and organizations responding to disasters [47]. Transmit and/or exchange relevant information containing daily updates, such as accurate and timely warnings (for instance weather updates, traffic alerts and news), instead of a warning about a disaster. These types of data help [47] to keep people aware of their climate. In all phases of disaster management, contact between community members remains important in terms of communication. They interact with each other during the mitigation process, either to keep in contact or to help each other planning [35], while knowing that, there is evidence that local communities and local authorities affected are the best ones to respond immediately. Individuals are actively seeking media emotional support [35], provided to isolated members of the group. In Computer Model, damage evaluation is one of the main criteria of understanding of the situation in order to consider the nature of the devastation and also to prepare the relief accordingly. It is just important to integrate humanitarian principles [48] into the design requirements of an information system. First of all, it must promote the production of disaster management skills with, for instance, disaster education or simulations. Education plays a critical role in motivating community members to improve disaster management skills [4]. In schools, industries and neighborhoods, evacuation exercises are also performed. There are also Games Based-Evacuation Drill (GBED) [33] evacuation drills operating with motion hazard mapping (Motion Hazard Map: MHM) on a tablet with a GPS receiver and smart devices (for example, tablets and smart glasses), games with virtual children [49], while adults (ie, HMD carriers) providing them with sufficient evacuation instructions accordingly to virtual disasters situations. There are, game-based evacuation exercises (GBED) as Disaster Education Based Services [4,
18
Z. Bouzidi et al.
50], such as Paradigmatic Tourism integrating Games Based-Evacuation Drill (GBED). Black Tourism is a place-based disaster education, such as Penumbral Tourism [51], that uses disaster simulation in the real world. Disaster fantasy game based on Tangible Bits [52] and tower defense game [53] improve ability to prepare themself for floods, to evacuate a three-dimensional virtual (3D) world [54], immersive environments of virtual reality [55], Head-Mounted Displays (HMD) and other platforms [56], as Geo-fencing MRG [57] that learn how to organize disaster response, view digital documents on a portable computer when traveling to evacuation location, with electronic tablets and smart glasses. Advanced models and broad data analyses have led to innovative disaster management methods being developed by visualizing disaster incidents not realized, as Motion Danger Map (MHM) and smart devices, the tsunami evacuation drill (TED) framework [33], built for simulation by configuring it using Google Maps. By the way, the causes of street flooding have been discovered by observations, road profiles and flood simulations [23] and suitable solutions have been suggested. Flood simulations are found relatively inexpensive solutions to the traffic problems created by the floods have been analyzed and other variables. In Social Networking Model, the media plays a very significant role in disaster management. The didactic role of the media differs only in content. Audiences seek information about risk, not preparedness, during the planning stage. During the impact process (scary moment), they get emotional support from the media, and connections to the outside world breaking the isolation. The media focuses on the most affected areas after the disaster, providing estimates of harm and loss, and assisting communities in their recovery efforts. For recovery, after impact, they seek to know the conditions of other communities. Crowdsourcing, crowdtasking and Collaborative Disaster Management improve the difficult task of understanding voluminous and high velocity data.
Fig. 2. Collaboration.
In Collaborative Disaster Management, large paper charts retain a distinct advantage in some situations, such as disaster response, in their combination of high resolution and portability: it is called geo-collaboration [8], a community work enabled by geo-spatial information technology on the problems of geographical scale. To promote visualization, asynchronous and online interaction between actors, promoting distributed spatial and temporal cognition, a geo-collaborative, Web-enabled framework is designed to target the unique characteristics of mobile and ubiquitous computing environments.
Deep Learning and Social Media for Managing Disaster
19
As for Crowdsourcing and Crowdtasking, there is an evaluation of the advantages of work-sharing networks and social networking models (information collection, quasijournalistic editing and crowdsourcing) in disaster management. Using motivational analysis to assess the most likely essential app features that will optimize continued user interaction, a modern method of developing a community-based computing environment acts as a real-time dashboard for government agencies responsible for monitoring populations during disasters. The continued engagement of users is measured by the performance of community-based computer systems such as eBayanihan [28]. Crowdsourcing [28] can be a feasible production instrument that shows that intrinsic motives far outweigh extrinsic motivations (such as monetary reward), as shown by the merits of unaffiliated volunteers. Tools such as Ushahidi [28] allow people to quickly access relevant information, such as reports on crisis situations and needs in their community, based on their geographical position, showing the signifiance of volunteers motivation in a serious gambling scenario, simulating involvement in crisis events. Computerized application guidance systems [6], known as public safety systems, are used to rapidly organize public emergency services and save lives and property. In Management P2P Model, it is possible to exploit the adaptability of P2P networks [61] to meet the characteristics of disaster situations. Indeed, Peer-to-peer (P2P) is a decentralized computer network model: transactions take place between equally accountable nodes [61]. The peer-to-peer network is used to interconnect field staff to maintain and/or perpetuate the disaster management system [62] using a single active link between a peer and the control room, as geo-collaborative implementations [62], thanks to P2P principles. There is also an Android or iOS application and an Android chat application [63] using Wi-Fi peer communication allowing communication in disaster situations with others [64]. 3.1 Discussion About Disaster Management Models Different models for disaster recovery have been suggested by academics and organizations. Despite their success in some areas, catastrophes still pose a major challenge to sustainable growth. The strategic management [65], showed that the comprehensive model should include all three listed models due to the complementarity between disaster management models. The quantitative method may sound like more accurate compared to qualitative method. But qualitative risk analysis is ideal for assessing probability and prioritizing risk in a simple way to understand, by rating severity in broader terms. It also makes it easier to recognise areas needing special attention, and being used to manage risk in real-time at any point of the project. There is an undeniably stronger combined approach. As for Disaster Education [4, 33, 49, 50, 52–59], Simulation [51] and Crowdsourcing [66] on Twitter only, there are several applications for crisis management, also in the Alert/Mitigation process [3] on Twitter and Facebook for Forecasting and [4, 5] on all the Web and finally for Collaboration [46]. We also have in Preparedness phasis for Situational Awareness [66–70] and for Damage Assessment [68–71] on only Twitter and in Response phasis for Post-Disaster Coordination [60, 70]. But no application in Recovery phasis (see Table 3).
20
Z. Bouzidi et al. Table 3. Examples of social media-based disaster management in different phases.
Disaster management actions
Disaster management phasis with social media applications
1. Warning/mitigation Disaster education
[4, 33, 49, 50, 52–59]
Simulation
Social Networking Model via a: only Twitter: [51] b: Twitter & Facebook: c: All the Web: /
Forcasting
Social Networking Model via: a: only Twitter: b: Twitter & Facebook: [3] c: All the Web: [4, 5]
Collaboration
Social Networking Model: [46]
Crowdsourcing
Social Networking Model via: a: only Twitter: [66] b: Twitter & Facebook: / c: All the Web: /
2. Preparedness Risk assessment and reduction
Social Networking Model via: a: only Twitter: / b: Twitter & Facebook: / c: All the Web: /
Situational Awareness
Social Networking Model via: a: only Twitter: [66–70, 72] b: Twitter & Facebook: / c: All the Web: /
Damage Assessment
Social Networking Model via a: only Twitter: [68–72] b: Twitter & Facebook: / c: All the Web: /
3. Response Post-Disaster Coordination and Response
Social Networking Model via a: only Twitter: [60, 70] b: Twitter & Facebook: / c: All the Web: /
4. Recovery Normal Activities Resumption
Social Networking Model via: a: only Twitter:/ b: Twitter & Facebook: / c: All the Web:/
Deep Learning and Social Media for Managing Disaster
21
4 Social Media Table 4. Examples of disaster management applications using various social media Ref
Identification methods
[58] Flood Disaster Game-based Learning
Used OSN Twitter
[59] Educational Purposes in Higher Education with Special Reference Twitter [68] Social-temporal context summarization
Twitter
[69] Capitalizing on a TREC Track to Build a Tweet Summarization Dataset
Twitter
[70] Semi-automated artificial intelligence-based classifier for Disaster Twitter Response [72] Summarizing situational tweets: An extractive-abstractive methodology
Twitter
[2]
Based on Artificial NN (ANN)
Twitter & Facebook
[3]
Based on Artificial NN (ANN)
All the Web
[4]
Based on FeedForward NN (FFNN)
All the Web
[5]
Based on LSTM
All the Web
These are sites where people share feelings, whether they’re Twitter, Facebook, Viber, Messenger, any forum on the Internet (see Table 4). The knowledge available on social networks varies from other web sources (press articles, for example) in several respects. Such messages use less formal language, may include words from more than one language, may have different errors in grammar and spelling, and are, for the most part, unstructured, fuzzy and short-lived. Their length and content vary considerably [11]. From all online platforms automatically monitored by the Online Listening Tool, namely Radian6 or one of its rivals [11], content can be gathered from websites to all social media. In fact, via Application Programming Interface (API) [11], many networking platforms allow access to their data. The model, which fairly represents the essentials, is generated by online listening instruments, namely: harvesting contents, cleaning the data of noninformative information, enabling relevance thanks with the learning corpus obtained because of to the tagged messages, and analyzing the results.
5 Retrieving Relevant Information from Social Media We will study all artificial learning methods, from machine learning to deep learning, after an overview of techniques for retrieving relevant knowledge on social networks. 5.1 Classification Algorithms We will study all the techniques for retrieving relevant knowledge on social networks, from Support Vector Classification to Neural Learning, including the Random Forest Classification.
22
Z. Bouzidi et al.
5.1.1 Support Vector Classification To solve regression problems, the approach used for support vector classification can also be expanded. Training points beyond a certain boundary are not taken into account in the cost function for building the support vector classification model. Therefore building a support vector classification model only depends on a subset of training data [73]. 5.1.2 Random Forest Classification In order to control over-fitting and to increase predictive precision, Random forest generates a lot of decision tree based on random collection of data and variables and takes the notion of averaging. Every tree in the lot is developed from the training set using bootstrap sampling. In addition, when a node is split during tree creation, the selected split is not the best split between all the features, but it is the best split between a random subset of features. The bias of the forest usually increases because of this, but also decreases due to techniques such as averaging its variance, which compensates for more than an increase in bias [74]. 5.1.3 Neural Learning Neural Learning (NL) is an artificial intelligence technology that enables computers to learn without having been explicitly programmed to do so. To learn and increase, however, computers need data to analyze and train on [75]. Abiodun et al. (2018) [75] recommend that future research can focus on combining, into one network-wide application, various Automated Neural Networks (ANN) models, namely Machine Learning and Deep Learning. 5.2 Machine Learning (ML) Despite the fact that machine learning is not a new concept, many people are still uncertain what it entails. It is a modern science that uses statistics, data mining, pattern recognition, and predictive analysis to identify patterns and make data predictions. At the end of the 1950s, the first algorithms were developed. The best known of these is none other than the Perceptron (see Table 5). Table 5. Neural learning architecture. Type
Architecture
Model - Training - Algorithm - Application
Ref
Neural Network
Machine Learning
Discriminative-Supervised-Gradient Descent based Backpropagation-Classification
[2, 3]
The perceptron is an algorithm for binary classifiers’ supervised learning. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class, making its predictions based on a linear predictor function, that combines a set of weights with the feature vector.
Deep Learning and Social Media for Managing Disaster
23
5.3 Deep Learning (DL) Neural learning is carried out by Feedforward or Feedback neural network. In Feedforward, we have supervised learning such as Feedforward neural network itself for classification [4], convolutional neural network [37–39] for image recognition/classification or Residual neural network (ResNets) [40] for image recognition. Tables 6, 7 and 8 show, respectively, the classification of FeedForward, FeedBackward, Radial Basis Function and Kohonen Self Organizing Neural Network architectures of Deep Learning. Table 6. Classification of deep learning architectures FeedForward neural network. Architecture
Advantages
Limitations
FFNN (FFNN)
Supervised-Binary classification-Gradient Descent based Backpropagation
No Extrapolation [44]
ConvNets (CNN)
Discriminative-Supervised-Gradient Descent based Backpropagation-Image recognition/classification
Temporal modeling-No increasing accuracy with stacking layers-No coding objects position/orientation [76, 77]
ResNets
Discriminative-Supervised-Gradient Descent based Backpropagation-Image recognition
Increased complexity-BatchAdding skip level connections
Autoencoder
Generative-Unsupervised-Backpropagation-Dimensionality Reduction-Encoding
not discover slow modes [80]
Generative A (GAN)
Generative-Discriminative-Unsupervised-Backpropagation-Fake realistic-Image
distribution learning poorly madea
Restricted Boltzmann Machine (RBM)
Supervised/unsupervised-Generative with Discriminative finetuning-Unsupervised-Gradient Descent based Contrastive divergence-Dimensionality Reduction-Feature learning-Classification-Collaborative filtering
difficult training; tricky partition function making computing log likelihood infeasible
a https://simons.berkeley.edu/news/research-vignette-promise-and-limitations-generative-advers
arial-nets-gans.
For unsupervised learning, we have Autoencoder [41] for Dimensionality reduction and encoding, Generative Adversarial Network [42] for generating realistic fake data, reconstruction of 3D models or image improvement and with supervised or unsupervised learning such as Restricted Boltzmann Machine [41] for dimensionality reduction, feature learning, topic modeling, classification, collaborative filtering or many body quantum mechanics. Neural learning can also be trained in either supervised/unsupervised ways by Radial Basic Function Network [81] for M-means clustering, Least square function, function
24
Z. Bouzidi et al. Table 7. Classification of deep learning architectures FeedBackward neural network.
Architecture
Advantages
Limitations
RNN
Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation
Difficult time series inference-unsupervised in negative time [79]
Bidirectional RNN (BRNN)
Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation
Trained with input information limitation up to preset future frame [79]
LSTM
Discriminative-Supervised-Gradient Descent & Backpropagation - Natural Language Processing-Translation
No obtaining well-defined temporal information [77, 78]
Fully Connected-LSTM
Discriminative-Supervised-Gradient Descent & Backpropagation through Time-Natural Language Processing-Language Translation
No obtaining well-defined temporal information [43, 77, 78]
Bi-Directional LSTM Discriminative-Supervised-Gradient Descent-Backpropagation-Natural Language Processing-Translation
Bad Presentation with multi-level features [76]
Table 8. Classification of deep learning architectures Radial Basis Function Neural Network and Kohonen Self Organizing Neural Network. Architecture
Advantages
Limitations
Radial Basis Function Neural Network Radial Basis Fct NN
Discriminative-Supervised/Unsupervised-K-means Clustering-Least Square Fct-Fct approximation-Time series prediction
Slow classification due to RBF fct computation
Kohonen Self Organizing Neural Network Kohonen SO NN
Generative-Unsupervised-Competitive Learning-Dimensionality Reduction- Optimization problems- Clustering analysis
SOM algorithm Problemsa
a https://pdfs.semanticscholar.org/c93a/e9ffeda90c9ea4cd951989a00a0afde8845b.pdf.
approximation and time series prediction or unsupervised ways by Kohonen Self Organizing Netowork [81] for dimensionality reduction, optimization problems or clustering analysis. In Feedback, we have only supervised leaning such as Recurrent Neural Network [5, 41], Bidirectional Recurrent Neural Network [42], Long Short-Term Memory [5, 41], Fully Connected-LSTM [42] and Bi-Directional-LSTM [38] through time-natural language processing and language translation.
Deep Learning and Social Media for Managing Disaster
25
6 Conclusion and Future Works By defining and conceptualising concepts of catastrophes and crisis management, proposing catastrophe classification, exploring and analyzing recents surveys, proposing Classification of Recent Disaster Management Surveys, exploring and analyzing social media-based crisis management packages in different phases, this work aims to explore the potential of social networking in managing disasters and shows the impact of social networking paradigm on the improvement of the catastrophe management process where interactions involving communities are discussed. They have their specific functional necessity to act during the different stages of the crisis management process. In addition, the role of the communication means in the attenuation, response and recovery phases is also presented. We have explored the potential of P2P networks in managing catastrophes. The adaptability of P2P networks [61] should be exploited to respond to the characteristics of crisis situations. New wireless applications are also possible in mobile networks, especially with Web 3.0. Thus, Sensor technology holds great promise for disaster-prone regions, which need comprehensive and effective warning models to protect lives and property. We studied all information retrieving techniques from all social media, starting with Support Vector Classification, Random Forest Classification to Neural Learning. We reviewed all forms of neural learning, from simple neural learning to Deep Learning. We proposed a classification of all Deep Neural learning architectures. Future works will be devoted to Web 3.0 with Deep Learning, Big Data and even Supercomputing. Acknowledgments. We acknowledge support of “Direction Générale de la Recherche Scientifique et du Développement Technologique (DGRSDT)”. MESRS, Algeria.
References 1. Hui, L.H.D., Tsang, P.K.E.: Everyday knowledge and disaster management: the role of social media. In: Robertson, M., Tsang, P.K.E. (eds.) Everyday Knowledge, Education and Sustainable Futures. EARICP, vol. 30, pp. 107–121. Springer, Singapore (2016). https://doi.org/10. 1007/978-981-10-0216-8_8 2. Bouzidi, Z., Boudries, A., Amad, M.: A new efficient alert model for disaster management. In: Proceedings of Conference AIAP 2018. Artificial Intelligence and Its Applications, El-Oued, Algeria (2018) 3. Bouzidi, Z., Amad, M., Boudries, A.: Intelligent and real-time alert model for disaster management based on information retrieval from multiple sources. Int. J. Adv. Media Commun. 7(4), 309–330 (2019). https://doi.org/10.1504/IJAMC.2019.111193 4. Bouzidi, Z., Boudries, A., Amad, M.: Towards a smart interface-based automated learning environment through social media for disaster management and smart disaster education. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) SAI 2020. AISC, vol. 1228, pp. 443–468. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-52249-0_31
26
Z. Bouzidi et al.
5. Bouzidi, Z., Amad, M., Boudries, A.: Deep learning-based automated learning environment using smart data to improve corporate marketing, business strategies, fraud detection in financial services and financial time series forecasting. In: International Conference on “Managing Business Through Web Analytics - (ICMBWA 2020)”. Khemis Miliana University, Algeria (2020). Accepted 6. Leitinger, S.H.: Comparison of GIS-based public safety systems for emergency management. In: Proceedings of 24th Urban Data Management Symposium (2004) 7. Hristidis, V., Chen, S.-C., Li, T., Luis, S., Deng, Y.: Survey of data management and analysis in disaster situations. J. Syst. Softw. 83, 1701–1714 (2016) 8. Benali, M., Ghomari, A.R.: Information and knowledge driven collaborative crisis management: a literature review. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 9. Ogie, R.I., Rho, J.C., Clarke, R.J.: Artificial intelligence in disaster risk communication: a systematic literature review. In: Proceedings of the 5th International Conference on Information and Communication Technologies for Disaster Management, (ICT-DM 2018) (2018). https://doi.org/10.1109/ict-dm.2018.8636380 10. Meissner, A., Luckenbach, T., Risse, T., Kirste, T., Kirchner, H.: Design challenges for an integrated disaster management communication and information system. In: The First IEEE Workshop on Disaster Recovery Networks (DIREN 2002), in Conjunction with IEEE INFOCOM, New York, USA (2002) 11. Imran, M., Ofli, F., Caragea, D., Torralba, A.: Using AI and social media multimodal content for disaster response and management: opportunities, challenges, and future directions. Inf. Process. Manag. 57(5), 1–9 (2020). http://sci-hub.tw/10.1016/j.ipm.2020.102261 12. Nazer, T.H., Xue, G., Ji, Y., Liu, H.: Intelligent disaster response via social media analysis a survey. SIGKDD Explor. Newslett. 19(1), 46–59 (2017) 13. Imran, M., Castillo, C., Diaz, F., Vieweg, S.: Processing social media messages in mass emergency: a survey. ACM Comput. Surv. (CSUR) 47(67), 1–38 (2015). https://doi.org/10. 1145/2771588 14. Arinta, R., Emanuel, A.: Natural disaster application on big data and machine learning: a review (2019). https://doi.org/10.1109/ICITISEE48480.2019.9003984 15. Senaratne, H., Mobasheri, A., Ahmed Loai, A.A., Cristina, C., Mordechai, (M.)H.: A review of volunteered geographic information quality assessment methods. Int. J. Geogr. Inf. Sci. 31(1), 139–167 (2017). https://doi.org/10.1080/13658816.2016.1189556 16. Haworth, B., Bruce, E.: A review of volunteered geographic information for disaster management. Geogr. Compass J. 9(5), 237–250 (2015). https://doi.org/10.1111/gec3.12213 17. Klonner, C., Marx, S., Uson, T., Porto de Albuquerque, J., Hofle, B.: Volunteered geographic information in natural hazard analysis: a systematic literature review of current approaches with a focus on preparedness and mitigation. ISPRS Int. J. Geo-Inf. 5(7) (2016). https://doi. org/10.3390/ijgi5070103 18. Haworth, B.T.: Emergency management perspectives on volunteered geographic information: opportunities, challenges and change. Comput. Environ. Urban Syst. 57, 189–198 (2016). https://doi.org/10.1016/j.compenvurbsys.2016.02.009 19. Saroj, A., Pal, S.: Use of social media in crisis management: a survey. Int. J. Disaster Risk Reduction (2020). https://doi.org/10.1016/j.ijdrr.2020.101584 20. Ruggiero, A., Vos, M.: Social media monitoring for crisis communication: process, methods and trends in the scientific literature. Online J. Commun. Media Technol. 4(1), 105–130 (2014) 21. Poddar, S., Mondal, M., Ghosh, S.: A survey on disaster: understanding the after-effects of super-cyclone amphan and helping hand of social media. Computer Science, Computers and Society (2020)
Deep Learning and Social Media for Managing Disaster
27
22. Knuth, D., Szymczak, H., Kuecuekbalaban, P., Schmidt, S.: Social media in emergencies, how useful can they be. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM) (2016) 23. Lagmay, A.M., et al.: Street floods in Metro Manila and possible solutions. J. Environ. Sci. 59, 39–47 (2017) 24. Kirschenbaum, A.: Preparing for the inevitable: environmental risk perceptions and disaster preparedness. Int. J. Mass Emerg. Disasters 23(2), 97–127 (2005) 25. Jabareen, Y.: Planning the resilient city: concepts and strategies for coping with climate change and environmental risk. Cities 31, 220–229 (2013). https://doi.org/10.1016/j.cities. 2012.05.004 26. Yu, M., Yang, C., Li, Y.: Big data in natural disaster management: a review. Geosciences 8(5) (2018). https://doi.org/10.3390/geosciences8050165 27. Poblet, M., García-Cuesta, E., Casanovas, P.: Crowdsourcing tools for disaster management: a review of platforms and methods. In: Casanovas, P., Pagallo, U., Palmirani, M., Sartor, G. (eds.) AICOL -2013. LNCS (LNAI), vol. 8929, pp. 261–274. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-45960-7_19 28. Middelhoff, M., et al.: Crowdsourcing and crowdtasking in crisis management. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 29. Torani, S., Majd, P.M., Maroufi, S.S., Dowlati, M., Sheikhi, R.A.: The importance of education on disasters and emergencies: a review article. J. Educ. Health Promot. 8(85) (2019). https:// doi.org/10.4103/jehp.jehp_262_18 30. Lin, C.K., Nifa, F.A.A., Musa, S., Shahron, S.A., Anuar, N.A.: Challenges and opportunities of disaster education program among UUM student. In: Proceedings of the 3rd International Conference on Applied Science and Technology (ICAST 2018), Georgetown, Penang, Malaysia (2018). https://doi.org/10.1063/1.5055440 31. Satoh, K., Weiguo, S., Yang, K.T.: A study of forest fire danger prediction system in Japan. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) , Zaragoza, Spain, pp. 598–602 (2004). https://doi.org/10.1109/DEXA.2004. 1333540 32. Kohyu, S., Weiguo, S., Yang, K.T.: A study of forest fire danger prediction system in Japan. In: Proceedings of the 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) (2004) 33. Kawai, J., Mitsuhara, H., Shishibori, M.: Tsunami evacuation drill system using motion hazard map and smart devices. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM), pp. 13–15 (2016) 34. Ashir, A.: Use of social media in disaster management. In: ICITE 2012 Conference, Hong Kong, vol. IPEDR, no. 39 (2011) 35. Lamsal, R.: Design and analysis of a large-scale COVID-19 tweets dataset. Appl. Intell. 51(5), 2790–2804 (2020). https://doi.org/10.1007/s10489-020-02029-z 36. Asghar, S., Alahakoon, D., Churilov, L.: A dynamic integrated model for disaster management decision support systems. Int. J. Simul. Syst. Sci. Technol. 6(10) (2005) 37. Alam, F., Imran, M., Ofli, F.: Image4Act: online social media image processing for disaster response. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, (ASONAM 2017), pp. 601–604 (2017). https://doi. org/10.1145/3110025.3110164 38. Kabir, M.Y., Madria, S.K.: A deep learning approach for tweet classification and rescue scheduling for effective disaster management. In: Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, (SIGSPATIAL 2019), pp. 269–278 (2019). https://doi.org/10.1145/3347146.3359097
28
Z. Bouzidi et al.
39. Nguyen, D.T., Al-Mannai, K., Joty, S.R., Sajjad, H., Imran, M., Mitra, P.: Robust classification of crisis-related data on social networks using convolutional neural networks. In: ICWSM, pp. 632–635 (2017) 40. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778. IEEE (2016). https://doi.org/10.1109/CVPR.2016.90 41. Wu, Q., Ding, K., Huang, B.: Approach for fault prognosis using recurrent neural network. J. Intell. Manuf. 31(7), 1621–1633 (2018). https://doi.org/10.1007/s10845-018-1428-5 42. Canon, M.J., Satuito, A., Sy, C.: Determining disaster risk management priorities through a neural network-based text classifier. In: 2018 International Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, pp. 237–241 (2018). https://doi.org/10.1109/ IS3C.2018.00067 43. Zhao, J., Deng, F., Cai, Y., Chen, J.: Long short-term memory - fully connected (LSTM-FC) neural network for PM2.5 concentration prediction. Chemosphere 220 (2018). https://doi. org/10.1016/j.chemosphere.2018.12.128 44. Haley, P.J., Soloway, D.: Extrapolation limitations of multilayer feedforward neural networks. In: Proceedings of the 1992 IJCNN International Joint Conference on Neural Networks, Baltimore, MD, USA, vol. 4, pp. 25–30 (1992). https://doi.org/10.1109/IJCNN.1992.227294 45. Chikoto, G.L., Sadiq, A.-A., Fordyce, E.: Disaster mitigation and preparedness comparison of nonprofit, public, and private organizations. Nonprofit Volunt. Sect. Q. 42(2), 391–410 (2013) 46. Waugh Jr, W.L., Streib, G.: Collaboration and leadership for effective emergency management. Public Adm. Rev. 66(s1) (2006). https://doi.org/10.1111/j.1540-6210.2006.00673.x 47. Yates, D., Paquette, S.: Emergency knowledge management and social media technologies: a case study of the 2010 Haitian earthquake. Int. J. Inf. Manage. 31, 6–13 (2011) 48. Coletti, P.G.S., Mays, R.E., Widera, A.: Bringing technology and humanitarian values together: a framework to design and assess humanitarian information systems. In: International Conference on Information and Communication Technologies for Disaster Management, Munster, Germany, vol. 4 (2017). https://doi.org/10.1109/ICT-DM.2017.827 5687 49. Iguchi, K., Mitsuhara, H., Shishibori, M.: Evacuation instruction training system using augmented reality and a smartphone-based head-mounted display. In: 3rd International Conference on Information and Communication Technologies for Disaster Management (ICT-DM), Vienna, Austria (2016) 50. Mitsuhara, H., et al.: Penumbral tourism: place-based disaster education via real-world disaster simulation. In: 3rd International Conference on ‘Information and Communication Technologies for Disaster Management (ICT-DM)’, Vienna, Austria (2016) 51. Tobita, J., Fukuwa, H., Mori, M.: Integrated disaster simulator using WebGISand its application to community disaster mitigation activities. J. Nat. Disaster Sci. 30(2), 71–82 (2009) 52. Kobayashi, K., Narita, A., Hirano, M., Tanaka, K., Katada, T., Kuwasawa, K.: DIGTable: a tabletop simulation system for disaster education. In: Proceedongs of Sixth International Conference on Pervasive Computing (Pervasive 2008), pp. 57–60 (2008) 53. Tsai, M.-H., Chang, Y.-L., Kao, C., Kang, S.-C.: The effectiveness of a flood protection computer game for disaster education. Vis. Eng. 3(1), 1–13 (2015). https://doi.org/10.1186/ s40327-015-0021-7 54. Dunwell, I., Petridis, P., Arnab, S., Protopsaltis, A., Hendrix, M., Freitas, S.: Blended gamebased learning environments: extending a serious game into a learning content management system. In: Proceedings of Third International Conference on Intelligent Networking and Collaborative Systems (INCoS), pp. 830–835 (2011)
Deep Learning and Social Media for Managing Disaster
29
55. Smith, S., Ericson, E.: Using immersive game-based virtual reality to teach fire-safety skills to children. Virtual Reality 13(2), 87–99 (2009). https://doi.org/10.1007/s10055-009-0113-6 56. Wang, B., Li, H., Rezgui, Y., Bradley, A., Ong, H.N.: BIM based virtual environment for fire emergency evacuation. Sci. World J. 2014 (2014) 57. Fischer, J.E., Jiang, W., Moran, S.: AtomicOrchid: a mixed reality game to investigate coordination in disaster response. In: Herrlich, M., Malaka, R., Masuch, M. (eds.) ICEC 2012. LNCS, vol. 7522, pp. 572–577. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3642-33542-6_75 58. Zaini, N.A., Noor, S.F.M., Zailani, S.Z.M.: Design and development of flood disaster gamebased learning based on learning domain. Int. J. Eng. Adv. Technol. (IJEAT) 9(4), 679–685 (2020). https://doi.org/10.35940/ijeat.C6216.049420 59. Vivakaran, M.V., Neelamalar, M.: Utilization of social media platforms for educational purposes among the faculty of higher education with special reference to Tamil Nadu. High. Educ. Future 5(1), 4–19 (2018). https://doi.org/10.1177/2347631117738638 60. Qiu, L., Du, Z., Zhu, Q., Fan, Y.: An integrated flood management system based on linking environmental models and disaster-related data. Environ. Model. Softw. 91, 111–126 (2017). https://doi.org/10.1016/j.envsoft.2017.01.025 61. Androutsellis-Theotokis, S., Spinellis, D.: A survey of peer-to-peer content distribution technologies. ACM Comput. Surv. 36(4), 335–371 (2004) 62. Bortenschlager, M., Leitinger, S., Rieser, H., Steinmann, R.: Towards a P2P-based geocollaboration system for disaster management. In: Probst, F., Keler, C. (eds.) GI-Days (2007) 63. Sonawane, R., Doge, S., Vatti, R.: WiFi peer-to-peer communication in disaster management. Int. J. Electr. Electron. Comput. Sci. Eng. (IJEECSE) 4(6) (2017) 64. Geibig, J.: Peer-to-peer algorithms in wireless ad-hoc networks for Disaster Management. Fach Informatik eingereicht an der Mathematisch-Naturwissenschaftlichen Fakultat der Humboldt-Universitat zu Berlin, Berlin (2015) 65. Nojavan, M., Salehi, E., Omidvar, B.: Conceptual change of disaster management models: a thematic analysis. Jamba J. Disaster Risk Stud. 10 (2018). https://doi.org/10.4102/jamba.v10 i1.451 66. Rogstadius, J., Vukovic, M., Teixeira, C.A., Kostakos, V., Karapanos, E., Laredo, J.A.: CrisisTracker: crowdsourced social media curation for disaster awareness. IBM J. Res. Dev. 57(5) (2013). https://doi.org/10.1147/jrd.2013.2260692 67. Clerveaux, V., Spence, B., Katada, T.: Promoting disaster awareness in multicultural societies: the DAG approach. Disaster Prev. Manag. Int. J. 19(2), 199–218 (2010). https://doi.org/10. 1108/09653561011038002 68. He, R., Liu, Y., Yu, G., Tang, J., Hu, Q., Dang, J.: Twitter summarization with social-temporal context. World Wide Web 20(2), 267–290 (2016). https://doi.org/10.1007/s11280-016-0386-0 69. Dussart, A., Pinel-Sauvagnat, K., Hubert, G.: Capitalizing on a TREC track to build a tweet summarization dataset. In: Text Retrieval Conference (TREC 2020) (2020) 70. Lamsal, R., Kumar, T.V.V.: Classifying emergency tweets for disaster response. Int. J. Disaster Response Emerg. Manag. (IJDREM) 3(1), 14–29 (2020). https://doi.org/10.4018/IJDREM. 2020010102 71. Kakooei, M., Baleghi, Y.: Fusion of satellite, aircraft, and UAV data for automatic disaster damage assessment. Int. J. Remote Sens. 38(8–10) (2017). https://doi.org/10.1080/01431161. 2017.1294780 72. Rudra, K., Goyal, P., Ganguly, N., Imran, M., Mitra, P.: Summarizing situational tweets in crisis scenarios: an extractive-abstractive approach. IEEE Trans. Comput. Soc. Syst. 6(5), 981–993 (2019). https://doi.org/10.1109/tcss.2019.2937899 73. Curtin, R.R., et al.: MLPACK: a scalable C++ machine learning library. J. Mach. Learn. Res. 14, 801–805 (2013)
30
Z. Bouzidi et al.
74. Liaw, A., Wiener, M.: Classification and regression by randomForest. R News 2(3), 18–22 (2002) 75. Abiodun, O.I., Jantan, A., Omolara, A.E., Dada, K.V., Mohamed, N.A., Arshad, H.: State-ofthe-art in artificial neural network applications: a survey. Heliyon 4(11) (2018). https://doi. org/10.1016/j.heliyon.2018.e00938 76. Nguyen, N.K., Le, A.-C., Pham, H.T.: Deep bi-directional long short-term memory neural networks for sentiment analysis of social data. In: Huynh, V.-N., Inuiguchi, M., Le, B., Le, B.N., Denoeux, T. (eds.) IUKM 2016. LNCS (LNAI), vol. 9978, pp. 255–268. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-49046-5_22 77. Sainath, T., Vinyals, O., Senior, A., Sak, H.: Convolutional, long short-term memory, fully connected deep neural networks, pp. 4580–4584 (2015). https://doi.org/10.1109/ICASSP. 2015.7178838 78. Roshan, S., Srivathsan, G., Deepak, K., Chandrakala, S.: Violence detection in automated video surveillance: recent trends and comparative studies, pp. 157–171 (2020). https://doi. org/10.1016/B978-0-12-816385-6.00011-8 79. Berglund, M., Raiko, T., Honkala, M., Karkkainen, L., Vetek, A., Karhunen, J.: Bidirectional Recurrent Neural Networks as Generative Models. MIT Press, Cambridge (2015) 80. Chen, W., Sidky, H., Ferguson, A.: Capabilities and limitations of time-lagged autoencoders for slow mode discovery in dynamical systems. J. Chem. Phys. 151(6) (2019). https://doi. org/10.1063/1.5112048 81. Pouyanfar, S., Tao, Y., Tian, H., Chen, S.-C., Shyu, M.-L.: Multimodal deep learning based on multiple correspondence analysis for disaster management. World Wide Web 22(5), 1893– 1911 (2018). https://doi.org/10.1007/s11280-018-0636-4
A Framework for Adaptive Mobile Ecological Momentary Assessments Using Reinforcement Learning Lihua Cai(B) , Laura E. Barnes, and Mehdi Boukhechba University of Virginia, Charlottesville, VA 22904, USA {lc3cp,lb3dp,mob3f}@virginia.edu
Abstract. Mobile ecological momentary assessments (mEMAs) require substantial user efforts to complete, resulting in low user compliance. One major source of incompliance is triggering mEMAs at inopportune moments. In this work, we propose a framework for implementing adaptive mEMAs using reinforcement learning (RL) to address the timing and context challenge, aiming to improve long term response compliance. To effectively model user state, we also propose a two-level user model with both momentary and routine state features. A novel k-routine mining algorithm is developed to extract routine state from passive sensing data. Using real mobile sensing data collected from 220 participants for over two weeks, we show that our proposed RL strategies consistently outperform the baseline methods including a random strategy and a supervised strategy in user compliance. Keywords: EMA · Ecological momentary assessment sensing · Reinforcement learning
1
· Mobile
Introduction
Mobile ecological momentary assessment (mEMA) is a digital surveying method that attempts to collect critical measurements of user behaviors and mental states in situ on mobile devices, most popularly on personal smartphones. Most recently, mEMA has also been implemented on wearable devices such as smartwatches [12]. Unlike traditional retrospective survey methods (e.g., telephone, paper, web surveys), mEMA frequently collects self-reports to capture the dynamics of human experiences, while reduces recall bias and enhances ecological validity [33]. mEMA has become the typical choice of data collection in areas such as clinical assessment [8], psychology/cognitive process and their mechanisms [32,37], and mobile health [14], owing to the increasing ownership of smartphones and accessibility of wireless network in the past decade [5]. Many EMA studies also captured passive sensing data while collecting EMAs, thereby enabling contextaware mEMA [3]. Although becoming more convenient, active participation in c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 31–50, 2022. https://doi.org/10.1007/978-3-030-82193-7_3
32
L. Cai et al.
mEMAs still demands substantial efforts from users, and poses significant compliance challenges over time [33]. Low response compliance in mEMAs can be attributed to declining user motivation over time. Existing research has applied human behavior theories to engage and motivate users in mobile sensing applications (e.g., substance use logging [28] and weight management [36]). While motivation has been an important challenge to address in mEMAs, low compliance can also result from another significant challenge, inopportune timings and contexts, which could be caused by 1) unavailability at the moment of sensing requests, and 2) interruptions that distract user’s attention from his/her current more prioritized task(s) [3]. Underlying these causes are the different user contexts and cognitive states (e.g., activity, location, time, and stress level). At each data collection moment, the user may not be available and interruptible in certain contexts, failing to attend and respond to the sensing request (e.g. student in a class). Our goal is to identify opportune moments to trigger EMAs to the users, while not interrupt them in unsuitable moments, thereby achieving higher compliance in the long term. Adaptive mEMAs leverage passive sensing to understand user’s context, and based on this understanding, adapt the trigger timings to those moments that are more likely free of interruption and convenient for the user to respond. In addition to being context-aware, adaptive mEMAs also need to avoid bias in the collected data that is coming from being selective in trigger timings [15]. In this work, we design adaptive mEMA strategies using the reinforcement learning (RL) framework under a formulation that reduces bias in the collected mEMAs. Our contributions are threefold: 1) We propose a generalizable framework for the implementation of adaptive mEMAs using RL. 2) We propose a two-level user state model to capture both momentary user state and higher level routine state of the user. A concept called k-routine and its associated learning algorithm are developed to represent user’s high level contextual state. And 3) we demonstrate the feasibility of our proposed approach in a set of RL algorithms using real world mEMA data, and compare their performance with two baseline strategies.
2
Related Work
Response compliance problem in mEMA has been studied by different groups within the human and computer interaction (HCI) community. Most of the existing works focus on understanding different sets of factors that may influence mEMA response compliance. Serre et al. [29], Sokolovsky et al. [30], and Broda et al. [6] studied impacts of demographic and self-reported contextual factors on mEMA response compliance. Vhaduri et al. systematically investigated the impacts of various design factors on response compliance and quality of the collected data in mEMAs [34]. Comparing to our current work, these studies did not leverage passive sensing capabilities to understand users’ contextual states but relied on self-reports and pre-specified triggering schedules from mEMAs. In addition, they also did not intervene with any strategies to improve user response compliance.
Adaptive Mobile EMAs
33
Vhaduri et al. [35], Markopoulos et al. [18], and Hofmann et al. [10] investigated the impacts of delivery timing and reminders on EMA response compliance. Their strategies using user chosen delivery times and regularly dispersed reminders are not adaptive to users’ changing contexts. Intille et al. proposed a microinteraction-based EMA method called μEMA using smartwatches. They leveraged the quick and convenient interface interaction in smartwatches, traded off higher intensity with more frequent interruptions in EMAs, and found significantly higher compliance rate in this new approach [12]. However, smartwatches are still far less pervasive than smartphones among electronic consumers, and this limits large scale EMA deployments for data collection and in various applications. Rabbi et al. proposed a context-assisted evening recall approach as an alternative to mEMA [27]. They showed a 5.6% increase in recall accuracy and 27.8% increase in overall recall completion rate. In this work, context is applied to provide hints to users to reduce recall bias, not as delivery conditions to improve response compliance. We consider this approach as complementary rather than a replacement to our proposed adaptive mEMAs approaches. Of relevance to response compliance in mEMAs is interruption management, which aims to identify opportune moments of users’ routine lives to avoid disrupting their ongoing tasks. A number of researches have been conducted on when to deliver emails [11], text messages [25], and phone calls [2]. For mobile notifications, researchers found that contents, social relationship, and physical activity level [19], location and time [23], current task [24], current activity [7,9,22], psychological traits [20] obtained from both passive sensing and self-reports can be leveraged to predict opportune moments for interruptions. Similar to our current work, these works leveraged both passive sensing data and self-reports to learn users’ contextual states and predict whether a moment is interruptible. Though in contrast to their supervised and rule-based methods, we propose to leverage the reinforcement learning framework to implement adaptive mEMAs.
3 3.1
Adaptive Mobile EMA An Unbiased Formulation for Mobile EMA
Adaptive mEMA leverages passive sensing data to understand users’ context, and interacts with them to collect subjective data. The main goal of adaptive mEMA is to improve user’s response compliance in the long term. To achieve this goal, adaptive mEMA can be formulated as selection of timings for EMA within given trigger budget to obtain maximum user compliance. Trigger budget refers to the allowable number of mEMAs that we can trigger on a given time frame. For example, the trigger budget is three when three mEMAs are delivered daily. Imposing trigger budget is important to avoid over burdening users and maintain user compliance [16,17]. We will also need to spread the mEMAs as evenly as possible across a given time window to avoid ‘contextual dissonance’, which biases the collected data due to context selection [15]. We follow a classical approach to split each day into some number of blocks as shown in Fig. 1, and within
34
L. Cai et al.
Fig. 1. An unbiased formulation of random time mobile EMAs with fixed trigger budget. Scenario 1: when the remaining mEMA windows is equal to the remaining trigger budget, EMA must be delivered regardless of user’s context. Scenario 2: early termination when the trigger budget is met before the end of the episode. Scenario 3: if it gets to the last mEMA window, the action decision is always ‘Trigger’.
each block randomly select a time for mEMA delivery decision [17]. Figure 1 illustrates an example with a daily budget of 3 mEMAs, and 6 2-h windows from 9am to 9pm. In order to guarantee triggering exactly 3 mEMAs daily, we take into consideration the opportunity costs and incorporate it into the decision process. For example, if we decide not to trigger mEMAs in the first three opportunities, we have no choice but to trigger them in the remaining three opportunities in order to meet the budget (scenario 1 in Fig. 1). When three mEMAs have been triggered before the end of the daily cycle, later assessment moments will not be considered any more (scenario 2 in Fig. 1). When the assessment time in the last window is considered, it will always be choosing the trigger decision (scenario 3 in Fig. 1). 3.2
Using Reinforcement Learning Framework for Adaptive Mobile EMA
We propose to use reinforcement learning to implement adaptation at each randomly selected time as shown in Fig. 2. RL is a natural fit for implementing adaptive mEMAs owing to its learning through interactions with the application environments for making optimal action decisions. It has been proposed as a framework for a special type of digital behavior change interventions namely Just-in-time Adaptive Intervention (JITAI), which adapts timings for intervention delivery, and contents in intervention [21]. Adaptive mEMAs can be formulated as a discrete time episodic sequential controlling problem. An episode is often chosen to be a targeted time frame (e.g., from 9am to 9pm) within a day. In each episode, we follow the above formulation, and apply the RL framework to develop sensing policies that assess value of each randomly selected moment for mEMA trigger decision.
Adaptive Mobile EMAs
35
Fig. 2. A framework for adaptive mobile EMA using reinforcement learning. Designing RL strategies for adaptive mEMA follows these steps: 1) Design the RL algorithm; 2) Design the state space; 3) Design the action space; and 4) Design the reward signal.
The RL framework for adaptive mEMAs requires design of a state space that captures critical user states in mEMA response compliance, an action space that controls how mEMAs are delivered, and a reward signal that provides feedback to learn EMA delivery policy. All the sensing data can be uploaded to the cloud for storage and post-processing, followed by policy updates using the chosen RL algorithm. On a daily cycle, the updated policy will be shared with each participant’s sensing app using push notification service.
4
A Two-Level User State Model
The main challenge in applying the RL framework for adaptive mEMAs is to design a state space that captures key contextual determinants that affect users’ response compliance to mEMAs. Most existing approaches look at momentary features such as the current time, location, and activity of a user. Features regarding the status of the user’s mobile device (e.g., a phone call just being ended) are also applied to make trigger decisions on mEMAs. We hypothesize that adding higher level routine contexts in trigger decisions on mEMAs can further enhance response compliance. We define higher level routine contexts to be frequent patterns or arrangements that people live by each day in this work. For example, a person goes to gym for workout everyday at 5pm and spends roughly 2 h there. Using this example, the momentary features at 5:26pm as the person is exercising in the gym could be “late afternoon, gym, highly active”, and the routine feature is “daily exercise in the gym”. There are certainly more complex routines that we have become so accustomed to so that we do not even realize them ourselves. And both our physical and mental states are highly affected by living through them.
36
L. Cai et al. Table 1. Momentary state features at each mEMA trigger decision time. Features
Description
Time
Early morning (8–10am), morning (10am–12pm), noon (12pm–2pm), early afternoon (2–4pm), late afternoon (4–6pm), early evening (6–8pm)
Location
Unique place labels that are learnt by a tempo-spatial clustering algorithm [13]
Speed
Being still, walking, running, being in vehicle using average speed cutoffs (0.1, 1, 5) m/s. Speed is calculated based on average distance between consecutive GPS coordinates within the 10 min time window divided by their corresponding time spans
Hourly activeness
Proportion of time average acceleration in 5 min windows within the past hour is beyond 0.2
Momentary activeness
Proportion of time average acceleration in 1 min windows within the past 10 min is beyond 0.2
Based on this underpinning rationale, we propose a two-level user model to characterize users’ states with the low-level being momentary features, and the high-level being the routine state. Based on the data that are available to us for the current study, we use time, location, speed, hourly activeness level, and momentary activeness level as the momentary features. Table 1 defines these momentary state features. To obtain the high-level routine state, we propose an algorithm called k-routine mining, which is intrigued by the association rule mining algorithm. Before we introduce the k-routine mining algorithm, we first introduce two basic concepts life-block and k-routine in it. Life-block is the basic information unit that describes the whereabouts and activities of a user at a given time. A life-block generally consists of time, location, physical activity based on speed, duration, and any other available contexts that can be extracted through passive sensing data and other mobile phone usage logs. K life-blocks form a k-routine, which is analogous to k-itemset in classic association rule mining algorithm [1]. Without loss of generality, we denote a life-block as (t, loc, act, d) using time (t), location (loc), physical activity based on speed (act), and duration (d) in our examples below. Figure 3 shows an example of a 3-routine with three life-blocks. Note that different life-blocks that form a k-routine need not be adjacent in time as long as they have no overlap and are ordered in time. For each learned unique routine, we assign it a unique code for reference. After being mapped with the momentary state, the routine state will be represented using this assigned code, making this routine state feature categorical.
Adaptive Mobile EMAs
37
Fig. 3. Illustration of an example 3-routine.
5
K-Routine Mining Algorithm
In this section, we provide the details on how we generate these daily k-routines, and map them to the momentary state to provide the high-level state for our two-level user model. 5.1
Mining K-Routines
The process of constructing life-blocks is similar to that of extracting the momentary state features. First, we process the incoming data in ten minute segments, and extract the location, speed from each segment. We choose to process data in ten minute segments because it is reasonable long to provide sufficient data for feature extraction, while relatively short enough to capture people’s fine-gained status. In cases where the user has been in more than one location or one speed category within one segment, we adopt the place or speed category with most data points. If the user is in transition from one place to another, the place label would be denoted as ‘in-transition’. For consecutive segments that the users have same place and speed values, they will be concatenated into a life-block with the time being the arrival time at the place, and duration being the number of segments multiply by 10 min. From this procedure, an entire day of mobile sensing data will be converted into a trajectory of life-blocks. Focusing on daily level, we treat life-blocks as items, and all life-blocks within a day as an ordered transaction, in analogy to items and transaction in classic association rule mining algorithm. However, we can not directly apply frequent itemset generation algorithm in existing association rule mining methods to obtain k-routines due to two key differences. The first difference lies in the temporal order of life-blocks within a day. Life-blocks are sorted by time to form k-routines. The second difference lies in the availability of data being an incremental online process. Data are made available throughout each day, and the algorithm will process the data at 10-min increments to generate daily lifeblock sequences. At the same time, whenever a new life-block is constructed, the k-routine database will be updated to reflect the changes. The k-routine mining algorithm is given in Algorithm 1. K-routines and P laces are the accumulated learned k-routines and visited unique places up to time t, respectively. LBs are the life-blocks of the same day up to time t, and plb is the pending life-block that is being generated and maintained at time t.
38
L. Cai et al.
Algorithm 1. K-Routine Mining Algorithm. Input: K-routines, P laces, LBs, plb, GP Ss, t. Output: K-routines, P laces, LBs, plb, t. 1: act = extractAct(GP Ss) 2: P laces.update(GP Ss) 3: loc = extractLoc(GP Ss) 4: if plb.loc == loc and plb.act == act then 5: plb.update() 6: else 7: LBs.append(plb) 8: K-routines.update(LBs) 9: Clear plb. 10: plb = (t, loc, act, 10mins) 11: end if 12: if t + 10mins remains in the same day then 13: t = t + 10mins 14: else 15: t = t + 10mins 16: Clear LBs and plb. 17: end if 18: return K-routines, P laces, LBs, plb, t.
GP Ss are newly available GPS points in a ten minute segment starting at time t. Algorithm 1 is an online algorithm that will be repeatedly called every 10 min. The number of life-blocks on each day is dependent upon the number of context features that are used to define them, and the number of unique values in each context feature. However, due to the variation in arrival times, uncertainty in visited places (i.e., new places are being visited over time), and duration staying at each place, we cannot reliably estimate its per day computation complexity. Assuming a day has K life-blocks, without limiting the order of k-routines, ˆ this will result in 2K k-routines with k = 1, 2, . . . , K. If we limit k to be k, kˆ i i , where CK is the then the total unique k-routines on the day will be i=1 CK combination of choosing i life-blocks from K life-blocks. For example, if we limit 1 2 3 + CK + CK unique k-routines. k to be 3, then we will have CK 5.2
Merging K-Routines
After obtaining these unique k-routines on a new day, we need to merge them with those learned in the past days if they are similar to each other. We define similarity using the following rules: 1. k1 -routine and k2 -routine are similar only if k1 = k2 . 2. If condition 1) is met, k1 -routine and k2 -routine are similar only if each pair of life-blocks with the same order is similar. 3. Two life-blocks are similar if their place and speed (or activity) are the same, and their arrival time and visiting duration are similar.
Adaptive Mobile EMAs
39
4. Let (t, d) denotes the values of arrival time and visiting duration. (t1 , d1 ) and (t2 , d2 ) are similar if the Euclidean distance between them is smaller than a chosen threshold. 5.3
Mapping K-Routines
At each mEMA trigger decision moment t, we map the learned k-routines to the momentary state to obtain the high-level routine state. We take the following steps: 1. Let t.arrival and d denote the arrival time and duration of a life-block. Existing k-routines will be filtered out if t does not fall in [t.arrival, t.arrival + d] with t.arrival and d referring to the arrival time and stay duration in the last life-block in a k-routine. 2. The remaining k-routines satisfying the above condition will be filtered out if the momentary location and activity are not the same with those associated with the last life-block in each of them. 3. For k-routines with k > 1, we apply the same procedure as in merging newly mined k-routines with existing ones, on all life-blocks other than the last life-block against the life-blocks on the day prior to t. We choose the longest k-routine that survives the above filtering conditions as the routine state associated with moment t. 4. When no k-routines survive the above tests, we assign ‘new routine’ as the routine state.
6
Designing Adaptive mEMA Method Using RL
In this section, we provide a concrete design of the various components required in an RL framework to implement adaptive mEMAs. 6.1
RL Algorithm
We propose an RL algorithm (see Algorithm 2) based on the Q-learning algorithm, which is an off-policy temporal difference (TD) RL algorithm. It can be combined with eligibility trace, most exploration strategies (e.g., -greedy), and planning (e.g., Dyna-Q) to enhance learning speed and sample efficiency. It can also be easily generalized to continuous state space using functional approximation. Specifically, we denote the state space with S, the action space with A, the initial exploration rate with 0 , and the step-size (learning rate) parameter, the discount rate, and the eligibility trace decay parameter with α, γ, λ, respectively. We face three major challenges in designing our proposed algorithm. First, when we decide not to trigger a mEMA, we do not have any feedback on response compliance. We handle this by designing a reward signal that discounts the associated coefficients or adds to them a proportion of their magnitude (see Eq. 1). Second, while no feedback is available when the action is ‘not trigger’, immediate update of the policy is also not possible. Line 32 to 36 in Algorithm 2 are
40
L. Cai et al.
designed to tackle this challenge at the end of each episode. Lastly, real mEMA data are scarce and expensive to collect over the long term, which leads to limited samples for learning optimized policy. We therefore need to improve sample and learning efficiency within limited data. In the current work, we experiment with a simple strategy called Dyna-Q [31] to improve sample efficiency. 6.2
State Space for Adaptive mEMA
We propose several different state feature sets including momentary state features as described in Table 1, first order routine feature, second order routine feature in two different encoding schemes, a motivation feature using rolling compliance based on response data from the past three days. To compare the marginal effectiveness of each feature set, we combine them incrementally to create five different RL strategies with different state spaces including: 1) RL with Momentary state features (RL-M); 2) RL with Momentary and First order routine state features (RL-MF); 3) RL with Momentary and Second order routine state features (RL-MS); 4) RL with Momentary and a more Compactly encoded Second order routine state features (RL-MCS); 5) RL with Momentary and a more Compactly encoded Second order routine state features with Motivation (RL-MCS-M). The difference between the compact and non-compact second order routine representation lies in how k-routines are encoded. In the non-compact encoding, a k-routine is represented by its routine ID; while in the compact representation, a k-routine is represented by all the routine IDs of the 1-routines that form the k-routine. The more compact encoding is potentially more efficient in learning due to partial overlaps among different k-routines. 6.3
Action Space for Adaptive mEMA
The action space in adaptive mEMA can include only two actions, ‘trigger’ and ‘not trigger’ the EMAs; or more than two actions that expand the ‘trigger’ action into ‘trigger’ with different modalities such as sound, vibration, and flash lights. In this study, we consider only two actions – ‘trigger’ and ‘not trigger’. 6.4
Reward Signal for Adaptive mEMA
We design the reward signal in the following way: it takes a binary value when we trigger a mEMA with the following conditions: 1) if the EMA is responded, it receives a positive value 1; if the EMA is not responded, it receives a negative value −1. When we do not trigger a mEMA, the reward signal is more involved because we will not directly receive any feedback as if we would have triggered it. To address this challenge, we need to estimate whether the ‘not trigger’ decision is beneficial at the end of each episode based on how many responded mEMAs we have received for the day. If all triggered mEMAs are responded, we want to reinforce these decisions in their associated states. In contrary, if we end up
Adaptive Mobile EMAs
41
Algorithm 2. Adaptive mEMA using Q-Learning with Linear Approximation and Decaying Exploration. Input: S, A, γ, λ, α, 0 , d. Output: w a , a ∈ A. 1: Initialize w a and ea for each a ∈ A 2: Set S nt , E nt , W nt , W i+1 , Si +1 , Ai +1 to be Φ. 3: for all t = 1, 2, . . . until termination within an episode do 4: Observe st . 5: if st is not terminal state then 6: Take at ∼ -greedy with arg max Φ(st , a)T w a . a∈A
7:
Transition to st+1 , and take at+1 ∼ -greedy with arg max Φ(st+1 , a)T w a .
8: 9: 10: 11: 12: 13: 14: 15: 16:
eat = eat + Φ(st , at ) if at = Not Trigger then δt = rt + γΦ(st+1 , at+1 )T w at+1 − Φ(st , at )T w at for all a ∈ A do w a ←− w a + αδt ea ea ←− γλea end for else Append st to S i , ent to E i , w nt to W i , w at+1 to W i+1 , st+1 to S i+1 , and at+1 to Ai+1 . for all a ∈ A do ea ←− γλea end for end if else Take at ∼ -greedy with arg max Φ(st , a)T w a .
17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: 41: 42: 43:
a∈A
a∈A \nt
Observe rt , transition to st+1 . Take at+1 ∼ -greedy with arg max Φ(st+1 , a)T w a . a∈A
eat = eat + Φ(st , at ) δt = rt + γΦ(st+1 , at+1 )T w at+1 − Φ(st , at )T w at for all a ∈ A do w a ←− w a + αδt ea ea ←− γλea end for end if for j ∈ range(|Si |) do Set sj = Si [j], aj = nt, sj+1 = Si+1 [j], aj+1 = Ai+1 [j], w aj = W i [j], w aj+1 = W i+1 [j], eaj = E i [j]. δj = rj + γΦ(sj+1 , aj+1 )T w aj+1 − Φ(sj , aj )T w aj w aj ←− w aj + αδj eaj end for if d < 0.1 then ←− 0.1 else ←− d end if end for return w a , for each a ∈ A
42
L. Cai et al.
having fewer responded mEMAs than the number of triggered ones, we want to weaken these decisions in their associated states. Let snt i , i = 1, . . . , m denote the states associated with the ‘not trigger’ actions on a given day, and wint , i = 1, . . . , m denote the associated coefficients. We simply reinforce or weaken the t coefficients associated with each state feature in snt i by β|wi |, a proportion of the weight coefficients corresponding to the ‘trigger’ action. The overall reward function is given below: ⎧ 1 at = trigger & task = completed ⎪ ⎪ ⎪ ⎨−1 at = trigger & task = not completed rt = (1) t ⎪ at = not trigger & all tasks are completed β|wi | ⎪ ⎪ ⎩ at = not trigger & not all tasks are completed −β|wit | In Algorithm 2, Line 15 to 20 keep track of all required components for updating the ‘not trigger’ action value function at the end of the episode, and Line 32 to 36 update the ‘not trigger’ action value function after the episode ends. 6.5
Experience Replay for Sample Efficiency Using Dyna-Q
Due to the common nature of EMA studies that last usually few weeks only, we may not have sufficient data to train RL policies that can effectively guide mEMA delivery. To address this challenge, we apply a RL framework called Dyna-Q [31], which integrates planning with learning. In Dyna-Q, an environmental model is created and applied to generate simulated samples for policy updates. This environmental model does not need to be a full model of the environment, but requires only a sample model [31]. We simply adopt a bootstrapping sampler, in which all past episodes including the current one are randomly drawn and replayed to update the policy. In our implementation, we replay ten randomly drawn episodes including the current day at the end of each day to improve sample efficiency. And we combine all available state features with Dyna-Q to be a sixth RL strategy (RL-MCS-M-D). 6.6
Performance Evaluation
We measure mEMA compliance using the following metrics: – Daily Compliance (DC). DC is calculated based on the number of all the responded triggered mEMAs divided by the number of all triggered mEMAs on each day. – Time Constrained Daily Compliance (TCDC). TCDC is calculated based on number of all the mEMAs that are responded within a 10 min window divided by the number of all triggered mEMAs on each day. – Overall Compliance (OC). OC is the final compliance calculated based on the number of all responded triggered mEMAs divided by the number of all triggered mEMAs.
Adaptive Mobile EMAs
43
These metrics are not mutually exclusive. Specifically, the overall compliance reflects the ultimate compliance rate, while ignoring the daily differences. However, it is also important to maintain acceptable daily compliance level as the data can be more representative across time during the data collection. In some application scenarios, when the mEMAs are time sensitive, the time-constrained daily compliance is more critical.
Fig. 4. Study data and EMA statistics: (a) Distribution of number of days in the study for each participant. (b) Distribution of number of EMAs being delivered to each participant. (c) Distribution of actual overall compliance in the triggered EMAs of each participant.
7 7.1
Experiments Data
To test the feasibility of our proposed adaptive mEMA framework using RL, we use real mEMA data from a mobile sensing study that aimed to understand students’ emotions and social anxiety over a two-week window [4]. Data in this study include accelerometer, GPS, communication (e.g., text messages and phone calls), and mobile EMA data from 220 college students using the Sensus mobile application [38]. In particular, accelerometer data were collected 1 Hz for up to two weeks and GPS coordinates every two and a half minutes. Six random time mobile EMAs were delivered in six two hour windows from 9 am to 9 pm everyday. Figure 4 summarizes this mEMA data. 7.2
Baseline Methods
We use two baseline strategies as comparisons to measure the performances of our proposed adaptive mEMA approaches. The first baseline method is a random strategy that randomly selects 3 out of 6 2-h windows each day for EMA delivery. The second baseline method creates a supervised model with all cumulative data available up to the prior day, and applies this model for mEMA trigger decision
44
L. Cai et al.
Fig. 5. Performance by strategies. (a) Average overall compliance; (b) Average daily compliance; (c) Average time-constraint daily compliance across 6 RL strategies and 2 baseline strategies.
at each decision moment. At the end of each day, this model will be retrained with all available data, and deployed for the next day. We apply XGBoost, which is a boosting algorithm that can gracefully handle missing data. The setting for the second baseline method will be the same as the proposed RL methods. We use the same context features learned from our two-level user model approach, including both the momentary and routine features. 7.3
Experimental Settings and Research Questions
Four parameters in the proposed RL algorithm are the initial exploration rate 0 , the step-size (or learning rate) α, the discount rate γ, and the eligibility trace-decay parameter λ. All four parameters fall within a range of 0 and 1. In addition, the reward signal has a discounting parameter β associated with the ‘not trigger’ action. Instead of tuning these parameters, for every participant in each method, we randomly choose values for them from the following options: 1) α = {0.01, 0.05, 0.1}; 2) γ = {0.05, 0.1, 0.2}; 3) λ = {0.05, 0.1, 0.2, 0.5, 0.8}; 4) 0 = {0.1, 0.2, 0.5}; and 5) β = {0.05, 0.1, 0.15, 0.2}. The exploration decaying rate is fixed at d = 0.8. We try to encompass a reasonable range of values for each parameter to be more conservative in our final performance comparisons against the baseline methods. With the above settings, we want to find answers to the following research questions: 1) How does the design of the state features impact the performance on various compliance metrics? 2) How is the performance of the proposed RL methods compared to the baseline methods?
Adaptive Mobile EMAs
45
Table 2. Momentary state features on the selected active sensing times. The cutoffs for the low, median, and high levels in 1) Number of days in study are 7 and 14 days; 2) Total number of triggered EMAs are 30 and 60; 3) Average daily EMAs are 2 and 3. Segment Strategy
Number of days OC
Low
8 8.1
Number of EMAs
TCDC OC
DC
Average daily EMAs
TCDC OC
DC
TCDC
RL-M
0.796 0.793 0.686
0.802 0.803 0.712
0.803 0.803 0.723
RL-MF
0.795 0.798 0.692
0.805 0.806 0.714
0.816 0.813 0.728
RL-MS
0.807 0.800 0.701
0.817 0.813 0.725 0.833 0.826 0.742
RL-MCS
0.803 0.801 0.703 0.809 0.808 0.720
0.824 0.819 0.736
RL-MCS-M
0.766 0.769 0.676
0.791 0.794 0.710
0.792 0.792 0.719
RL-MCS-M-D 0.800 0.795 0.687
0.802 0.798 0.708
0.801 0.792 0.711
Random
0.642 0.651 0.584
0.640 0.649 0.585
0.636 0.642 0.595
Supervised
0.728 0.730 0.647
0.770 0.773 0.694
0.766 0.765 0.704
0.827 0.831 0.737
0.843 0.839 0.760
0.825 0.822 0.763
RL-MF
0.827 0.826 0.728
0.839 0.841 0.760
0.816 0.820 0.761
RL-MS
0.837 0.837 0.739
0.841 0.842 0.760
0.824 0.825 0.767
RL-MCS
0.828 0.828 0.731
0.831 0.830 0.750
0.810 0.812 0.753
RL-MCS-M
0.831 0.833 0.734
0.837 0.838 0.754
0.816 0.819 0.760
RL-MCS-M-D 0.823 0.822 0.727
0.844 0.845 0.763
0.822 0.825 0.764
Random
0.655 0.660 0.577
0.646 0.640 0.573
0.637 0.640 0.588
Supervised
0.836 0.839 0.743 0.846 0.849 0.767 0.818 0.825 0.762
Median RL-M
High
DC
RL-M
0.746 0.739 0.665
0.657 0.652 0.540
0.744 0.742 0.614
RL-MF
0.750 0.749 0.669
0.662 0.658 0.537
0.746 0.745 0.612
RL-MS
0.752 0.752 0.671 0.670 0.667 0.545 0.746 0.744 0.616
RL-MCS
0.740 0.737 0.657
0.660 0.657 0.538
0.741 0.740 0.615
RL-MCS-M
0.745 0.743 0.661
0.654 0.651 0.531
0.739 0.738 0.606
RL-MCS-M-D 0.746 0.744 0.667
0.655 0.651 0.537
0.747 0.745 0.617
Random
0.594 0.587 0.515
0.567 0.562 0.449
0.616 0.613 0.498
Supervised
0.750 0.751 0.666
0.658 0.656 0.531
0.737 0.736 0.605
Results Comparisons Within RL Strategies
Figure 5 shows the average overall compliance, daily compliance, and timeconstraint daily compliance across the 6 RL strategies with different state features and the Dyna-Q method. Since all RL strategies use momentary state features, we will not mention it unless necessary. The RL strategy without any k-routine state feature has the same performance as the one with 1-routine state feature. But the strategy with 2-routines outperforms both of them. When using the compact representation, the performance has no improvements. This is likely due to using only 2-routines in our experiments. Adding the motivation feature also does not lead to performance enhancements. Lastly, the Dyna-Q framework does not improve the overall performance either. Note that the order of performances among the different RL strategies in all three metrics are almost the
46
L. Cai et al.
same. The RL strategy with momentary and 2-routine state features slightly outperform all other strategies by a small margin. 8.2
Comparisons Between RL Strategies and Baseline Methods
From Fig. 5, we can see that the performance of all RL strategies are equal or better than the two baseline methods in all three performance metrics. In particular, the best RL strategy attains an average overall compliance 0.80, an average daily compliance 0.80, and an average time-constraint daily compliance 0.70, comparing to 0.77, 0.77, and 0.69 in the corresponding metrics in the supervised method. To better understand how much improvements have been achieved with the proposed RL methods when compared with the supervised approach, considering a 2 week study with 3 mEMAs daily for 220 participants, a 3% improvement in overall compliance translates into 277 additional surveys that would have been responded by all the participants during the study window. Given that our current dataset had many missing triggers due to technical issues, we expect to see much higher compliance improvements in future real world deployments. 8.3
Performance by Data Segments
The performance of the proposed RL strategies in adaptive mEMAs is greatly dependent on constraints in the real data we used. To better understand these factors, we examine the performances of all strategies in different segments of the data based on number of days in study, total number of triggered EMAs, and average number of daily triggered EMAs (see Table 2). Segments by Number of Days in Study. In the low ( 0: element = x break
sum
sum = 0 f o r l o o p i n d in range ( len ( arr ) ) : i f arr [ loop ind ] > 0: sum = sum + a r r [ l o o p i n d ]
sum = sum ( a f o r a i n a r r i f a > 0)
all
a l l = arr [ 0 ] > 0 loop ind = 0 w h i l e l o o p i n d < l e n ( a r r )−1 and a l l : i f not a r r [ l o o p i n d +1] > 0 : a l l = False loop ind = loop ind + 1
all
= a l l (a > 0 for a in arr )
any
any = a r r [ 0 ] > 0 loop ind = 0 w h i l e l o o p i n d < l e n ( a r r [ 1 : ] ) and not any : l o o p i n d += 1 any = a r r [ l o o p i n d ] > 0
any
= any ( a > 0 f o r a i n a r r )
3.1
= sum ( 1 f o r a i n a r r i f a > 0 )
count
Formal Approach
In this section, we formalize the process of finding and correcting nonidiomatic code snippets. The procedure of refactoring idioms is represented by function
Detecting and Fixing Nonidiomatic Snippets in Python
133
REF ACT OR : Ch∗ → Ch∗ . This function expects a source code to be analyzed and fixed, and outputs the corrected version of the original fragment. Figure 1 shows a visual overview of this function and its parts. For easier understanding, we consider exactly one snippet to be substituted per source code at first. The definition of REF ACT OR is supported by the SU BST IT U T E : Ch∗ × (N × N) × Ch∗ → Ch∗ function. As its arguments it expects the source code to be analyzed, an index pair that represents the location of the nonidiomatic snippet, and a generated alternative of the snippet. Using these arguments the snippet can be replaced in the original code at the given location with its given alternative. The fixed source code is the return value of SUBSTITUTE. Using this function, REF ACT OR is defined as: REFACTOR(SC) := SUBSTITUTE(SC, LOCATE(SC), GEN IDIOM(SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC)), FEATURES(SNIPPET(SC), SNIPPET TYPE(SNIPPET(SC)) VARIABLES(SNIPPET(SC))))),
where LOCAT E : Ch∗ → N × N is the function that returns the location of the snippet represented by an index pair in a full-length source code, while SN IP P ET : Ch∗ → Ch∗ returns the snippet itself. LOCAT E is used in the substitution process, whereas SN IP P ET provides the input for the procedure of generating the improved code. These functions are implemented as recurrent neural networks, detailed in Subsect. 3.2.1. A substitute for a nonidiomatic snippet is composed of the following components: a frame – such as “target=sum(1 for elem in list if condition)” –, a dictionary which maps kinds of variables to object names – such as {target var → “cnt”, list var → “arr”, loop index → “i”} – and some additional features, for example the condition arr[i] > 0. Given these examples, the following idiomatic pattern would be generated: cnt = sum(1 f or elem in arr if elem > 0). Consequently, the function GEN IDIOM expects three parameters: the first is the type of frame, the second is the dictionary of variables, and the third contains the additional features (such as the condition). These three parameters are constructed by the following functions, respectively: SN IP P ET T Y P E : Ch∗ → [0..5] returns the index of the frame (0: count, 1: max, 2: search, 3: sum, 4: all, 5: any), V ARIABLES : Ch∗ → (T ype, Ch∗ )∗ returns the dictionary, and F EAT U RES : Ch∗ × [0..5] × (T ype, Ch∗ )∗ → Ch∗ × Ch∗ × {0, 1} × {0, 1} gives the extra features. SN IP P ET T Y P E and V ARIABLES are implemented by different neural networks. Their definition can be found in Subsects. 3.2.2 and 3.2.3. Given the snippet, the type, and the identifier names, a set of four features can be determined: – the condition (if the snippet contains any) – if the snippet iterates over one row/column in a matrix/tuple, we need to know the indices being used
B. Szalontai et al. Input
fnd = False arr = list(range(0,10)) cnt = 0 for i in range(0,len(arr[j])): if arr[j][i] > 0: cnt += 1 print(arr[j]) while fnd: for elem in arr[0]:
SNIPPET
LOCATE
Found snippet Location of snippet
cnt = 0 for i in range(0,len(arr[j])): if arr[j][i] > 0: cnt += 1
SNIPPET_TYPE
(2, 5)
VARIABLES FEATURES
REFACTOR
134
Variables
Type of pattern
count (0)
Extra information
condition arr[i]>0 matrix second idx is main max or min no maxindex shift no
target list index
cnt arr i
GEN_IDIOM
Generated alternative
cnt = sum(1 for elem in arr[j] if elem > 0)
SUBSTITUTE Output
fnd = False arr = list(range(0,10)) cnt = sum(1 for elem in arr[j] if elem > 0) print(arr[j]) while fnd: for elem in arr[0]:
Fig. 1. Visual overview of the refactoring process.
Detecting and Fixing Nonidiomatic Snippets in Python
135
– if the type of snippet is max, we need to know whether the minimum or maximum value is being calculated – if the type of snippet is max, we need to know whether the theoretical indexing starts from one instead of zero. These features are crucial for generation. Determining each of these features separately is fairly trivial given the snippet with its type and the variables. They can be implemented easily using explicit programming. Utilizing these functions, we define the function which returns the generated alternative: GEN IDIOM : [0..5] × (T ype, Ch∗ )∗ × (Ch∗ × Ch∗ × {0, 1} × {0, 1}) → Ch∗ . GEN IDIOM is used in the following way in REF ACT OR: GEN IDIOM(SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC)), FEATURES(SNIPPET(SC), SNIPPET TYPE(SNIPPET(SC)), VARIABLES(SNIPPET(SC))))
As mentioned before, functions LOCAT E, SN IP P ET , SN IP P ET T Y P E, and V ARIABLES are implemented by neural networks, denoted by M1 , M2 , and M3 . Based on the model predictions (return values of these functions), we get the desired output using the other functions in the definition of REF ACT OR. A summary table of these functions can be seen in Table 2 and a summary figure in Fig. 1. The above method was explained with one snippet per code, but in fact, it is expanded to more snippets by having LOCAT E return a sequence of index pairs and SN IP P ET a sequence of snippets instead of exactly one index pair and snippet. Hence, if LOCAT E and SN IP P ET return a sequence with n elements, GEN IDIOM gets evaluated n times for each snippet returned by SN IP P ET . Even though the method works for multiple snippets in one program, the training data is built up in a simplified way where each fragment contains exactly one nonidiomatic snippet. In spite of this simplification of the dataset, the model learns to generalize well, and is able to handle fragments without any or with multiple snippets to be substituted. 3.2
Neural Architectures
As mentioned in the previous section, the three key functions of the method are implemented as neural networks. In the implementation, two different architectures are used: one solves a classification problem (M2 ), while the other is generally used for Sequence Tagging tasks (M1 , M3 ). These three neural networks are created to solve the following subtasks: 1. Given the full source code, the nonidiomatic snippets need to be located. 2. Given a nonidiomatic snippet, the type of refactoring is to be determined. 3. The variables that the coder used in the algorithm need to be extracted. For each model, the input needs to be an index sequence that represents tokenized source code. We incorporate two approaches to tokenizing. The function
136
B. Szalontai et al.
Table 2. A summary table about all the above defined functions including their types, the expected input value and their output. Name
Type
Expects
Returns
REF ACT OR
Ch∗ → Ch∗
Source code to be analyzed
Analyzed and fixed source code
SU BST IT U T E
Ch∗ ×(N×N)×Ch∗ → Source code, location Ch∗ of snippet, alternative for it
Source code with a replacement at the given location
LOCAT E
Ch∗ → N × N
Source code to be analyzed
Location of the snippet to be replaced represented by an index pair
SN IP P ET
Ch∗ → Ch∗
Source code to be analyzed
The found snippet
GEN IDIOM
[0..5]×(T ype, Ch∗ )∗ × Type of frame, (Ch∗ × Ch∗ × {0, 1} × variables (along with {0, 1}) → Ch∗ their types) to be utilized and some extracted features
SN IP P ET T Y P E Ch∗ → [0..5]
Variables inserted into the frame to their correct position and incorporating the features resulting in the replacement snippet
Found snippet
Type of frame represented by an index Variables along with their types
V ARIABLES
Ch∗ → (T ype, Ch∗ )∗
Found snippet
F EAT U RES
Ch∗ × [0..5] × (T ype, Ch∗ )∗ → Ch∗ × Ch∗ × {0, 1} × {0, 1}
Found snippet, type of Used condition, matrix snippet, variables indexing info, max or min, (along with their maxindex shift types)
used for tokenizing and converting an entire source code to an index sequence will be further abbreviated as T OKEN IZE CODE, and the one used for tokenizing the found snippets will be abbreviated as T OKEN IZE SN IP P ET . Both take a Ch∗ as the argument, and return the tokenized code represented by an index sequence. There are special tokens (e.g. names of variables, objects, classes, functions, etc.) that occur in one code, but not in any other. Including only some of these tokens in the tokenization process can lead to problems due to bad representations, as Karampatsis et al. [10] suggest. Since the ability to differentiate tokens is crucial in their work, they use an open-vocabulary neural language model to overcome this problem. We found that it is sufficient to simply unify all of these tokens for our classification and tagging problems. The representation returned by T OKEN IZE CODE substitutes each of these tokens for a special NAME token. When T OKEN IZE SN IP P ET tokenizes a localized nonidiomatic snippet, it differentiates special NAME tokens further into one of the following six categories: – – – –
ARR - name of the list variable IDX - name of the loopindex variable PRF - predicate or function FLD - field of an object
Detecting and Fixing Nonidiomatic Snippets in Python
fnd = False arr = list(range(0,10)) cnt = 0 for i in range(0,len(arr)): if pr(arr[i]): cnt += 1 print(arr) while fnd:
M1
137
OOO OOOOOOOOOOO III IIIIIIIIIIIII IIIIIIIII III OOOO OOO
Fig. 2. An example of how the first model tags (on the right) a certain token sequence in order to determine the location of a nonidiomatic snippet (on the left). The model is represented by M1 , and the tags I and O stand for IN and OUT.
– VAR - variable of other type – UNK - indeterminable from the snippet without extra information or not included in any above defined category. For basic tokenizing we use the tokenize package from the Python Standard Library with the modification of splitting up tokens of multiple tabs. The tokenized source code is given as an input to the neural networks. In the following, these neural networks are explained. 3.2.1 Snippet Location (M1 ) The first task is represented by functions LOCAT E and SN IP P ET . As described above, LOCAT E returns a sequence of index pairs representing the locations of the snippets and SN IP P ET returns a sequence of the snippets to be substituted. The task is formulated as a sequence tagging problem, solved by a recurrent neural network (M1 ). It tags each element in an index sequence (which represents the tokens of a source code) with an index (0 or 1) representing either IN or OUT. The input layer expects inputs of consistent length, which we achieve by padding each training example with the special PAD token beforehand. Next is an embedding layer which embeds tokens into a 256 dimensional vector space. The second layer is a bidirectional LSTM [8] with 256 units that returns the whole sequence of outputs. The last layer is a dense layer with 2 units (number of possible tags) which is applied to each element of the sequence returned by the BiLSTM. We use softmax as the activation function, categorical crossentropy as the loss function and Adam [11] (with 0.001 as the learning rate) as the optimizer. After around 15 epochs the model learns to tag a given tokenized code quite well. With the correct tagging of a program, we can extract the desired information using the T AGS2LOC function: it takes the source code and the result of M1 (tags) as its arguments, and returns the sequence of beginning and ending indices. As described above, the found snippet itself is also needed for other
138
B. Szalontai et al.
subtasks. For this purpose, the T AGS2SN P function is used: it also takes the result of M1 as its argument, and returns only those tokens that got labeled with the IN tag. Based on these, the definition of LOCAT E is LOCAT E(SC) := T AGS2 LOC(M1 (T OKEN IZE CODE(SC))) and the definition of SN IP P ET is SN IP P ET (SC) := T AGS2SN P (SC, M1 (T OKEN IZE CODE(SC))). 3.2.2 Determining the Type of Refactor (M2 ) The second task, represented by SN IP P ET T Y P E, is to determine the kind of pattern the code snippet implements. This is a classification problem, which we solve with a feedforward neural network (M2 ). The first layer of the network is dense with 512 units and the input shape is the maximum length (number of tokens) of the training examples. We use the ReLU activation function, then a Dropout [14] with the rate of 20%. A dense layer with 6 units is integrated as the last layer. We use the softmax activation function, categorical crossentropy as the loss function and Adam [11] as the optimizer. After around 20 epochs the model learns to distinguish between the nonidiomatic pattern types. Using M2 we define SN IP P ET T Y P E(SCS) := M2 (T OKEN IZE SN IP P ET (SCS)). 3.2.3 Determining the Variables to Utilize for the Substitution (M3 ) The function V ARIABLES is used to determine the variables that the programmer used in the algorithm. We need to know what list was being iterated over, what functions the programmer used, etc. We need to find the tokens representing the variables that are used for storing the result, and any others being used when determining it. This subtask is also formulated as a sequence tagging problem. The goal of the model M3 is to tag an index sequence, which represents a localized snippet tokenized with function T OKEN IZE SN IP P ET . The following tags are used: – – – – – –
LIST - the iterated list TARGET and TARGET2 - the variable(s) used for storing the results FIELD - field of an object PRED - predicate function FUNC - any function used on every element of the list UNK - any other unknown token.
The architecture of the model that we train to tag a sequence is almost the same as M1 , the main difference comes from the different labeling. This difference implies that the dense output layer of M3 has 8 (number of possible tags including PAD as a special tag) units instead of 2 (as in M1 ). Assuming that we have a snippet that is tagged correctly with these tags, the information needed can be easily extracted. This process is represented by function T AGS2V ARS. It expects a code snippet and the tags of the tokens, and returns a tuple of token-type pairs describing the variables that need to be utilized.
Detecting and Fixing Nonidiomatic Snippets in Python
cnt = 0 for i in range(0,len(arr)): if pr(arr[i]): cnt += 1
M3
139
TARGET U U U U U U U U U U U LIST U U U U PRED U LIST U U U U U TARGET U U
Fig. 3. An example of how the third model tags (on the right) a token sequence (on the left) in order to determine the variables to be utilized for generating an alternative for the original nonidiomatic snippet. The model is represented by M3 , and the tag U stands for UNK. Table 3. A summary table about the functions defined in this section including their names, the expected input value and their output. Name
Expects
Returns
T OKEN IZE
Python source code
Tokenized code represented by an index sequence
M1
Code represented by an index Predicted tags (In, Out) sequence
T AGS2LOC
Tags (In, Out)
Locations of nonidiomatic snippet(s)
T AGS2SN P
Source code and tags (In, Out)
Nonidiomatic snippet(s)
M2
Code represented by an index Predicted type of sequence substitution
M3
Code represented by an index Predicted tags (Target, List, sequence Func, Pred, ...)
T AGS2V ARS Source code and tags Main variables of a (Target, List, Func, Pred, ...) nonidiomatic snippet
Based on the functions defined above, V ARIABLES is defined the following way: V ARIABLES(SCS) := T AGS2V ARS(SCS, M3 (T OKEN IZE SN IP P ET (SCS))). Table 3 summarizes the functions defined in this section. In the next section, the dataset generation is explained for the three neural networks defined above.
4
Dataset Generation
As mentioned in the previous section, three neural networks are used to locate and replace nonidiomatic snippets in source code. The neural networks are trained with two different generated datasets. The generation process consists of three steps. First, templates are generated for each code pattern using generic names for variables and functions. The second, augmentation step includes randomly renaming the identifiers and creating
140
B. Szalontai et al.
random conditions for the code patterns. The last step is to combine these modified patterns with real world code from Github. The results of the second and third steps are used as training sets. These are: 1. A collection of nonidiomatic code patterns with randomly renamed variables and functions, and randomly created conditions 2. A collection of Python scripts from Github projects containing randomly inserted nonidiomatic code patterns. 4.1
Template Generation
The templates of the possible nonidiomatic patterns are created via a contextfree grammar (Table 4). The variable names and conditions are generic so they can be replaced easily. In the next step, we replace the identifiers with random strings and we create conditions with random numerical operators. The generator is written in Python using the Natural Language Toolkit (NLTK) [12]. The examples in Table 4 implement the nonidiomatic pattern count, which counts the elements in a list for which a predicate is satisfied. More examples can be found in Appendix Table 5, 6, 7, 8 and 9. Table 4. Two code snippets generated by the grammar of the count pattern. The columns are: the class index of the code (count: 0), the snippet, the name of the list, and the name of the result. count = 0 f o r loopInd in range ( len ( arr ) ) : 0 arr count i f Pred ( a r r [ l o o p I n d ] ) : count += 1 count = 0 f o r loopInd in range ( len ( arr ) ) : 0 arr count i f Pred ( a r r [ l o o p I n d ] ) : count = count+1
With these rules about 400 templates can be generated. Figure 4 shows the exact numbers for each kind of pattern. 4.2
Augmentation of Templates
Augmenting the examples by replacing the generic variable and function names with random strings creates more data for the neural networks and is also a good method to create equal numbers of patterns. The conditions (generally predicate functions) are also changed at this stage. In most cases, the predicate function is an inline boolean expression (such as
Detecting and Fixing Nonidiomatic Snippets in Python
141
1 == arr[i]). In some of the snippets, the list which we iterate over is replaced by a fixed row/column of a matrix/tuple (such as arr[i][1]). In some snippets we replace all the occurrences of the elements of the list with an indexed version (using the [] operator). With these modifications applied, we generate 120 000 snippets with equal distribution of the six classes. These snippets are processed further to train M2 and M3 , the models that are used to determine the type of the pattern in the snippets and extract the variables from it. To help generalization, the tokens denoting variable names and function names are converted to one of the special tokens (ARR, IDX, PRF, FLD, VAR, UNK) according to the T OKEN IZE SN IP P ET tokenization procedure of, described in Sect. 3.2. The obtained processed snippets (each with its corresponding type of pattern) are used to train M2 . In order to create the datased for training M3 , each token in the snippets is tagged according to its type (LIST, TARGET, TARGET2, FIELD, PRED or FUNC, see Sect. 3.2.3), while the rest is tagged UNK. Figure 3 shows an example of how the tagging works. In order to achieve consistent snippet length, the PAD special token and tag is used for padding the end of the sequences. To train M1 to locate nonidiomatic snippets, programs are downloaded from Github in large amounts and nonidiomatic snippets are inserted into them. These are stored in a table where each row contains the type of the snippet, the larger code with the inserted snippet, the location of the insertion, and the important identifier names. Only the contexts of the snippets are kept to get rid of the large amount of unnecessary code. After tokenizing, each training example becomes approximately 1500 tokens long, of which the snippet is 30–60 tokens. We tag each token with tags IN or OUT depending on whether a given token is in the snippet or not. Figure 2 shows an example of how the tagging works. We use T OKEN IZE CODE for tokenizing full source codes (as described in Sub-
192
200
Amount
150 90
100 64 50 0
24 count max search sum
24
24
all
any
Fig. 4. The distribution of the number of different patterns before renaming.
142
B. Szalontai et al.
sect. 3.2.2). To achieve consistent input length, we use the special PAD token for padding with the OUT tag.
5
Evaluation
In order to demonstrate the validity of our approach, we tested the method on a real dataset, containing scripts coded mostly by students with no experience with Python. The programs had been originally submitted as homeworks to the Mester assignment system operated by E¨otv¨ os Lor´and Universtiy. We have received 13373 Python files, some of which presumably contains nonidiomatic snippets. The assignment tasks are of varying difficulty. The easier ones can be solved with simple program constructions, while the more difficult ones require deeper understanding of patterns. The tasks expect the input from the console and print the output to it as well. The inputs generally consist of multiple lines, where in most cases the first line describes the number of upcoming rows and/or columns. An example of an assignment task can be seen in Fig. 6 in Appendix A. 5.1
Automated and Manual Evaluation
In order to identify the precision, we combine two complementary testing approaches: automated testing and manual tagging. Automated testing is done with a testing tool which compares the original and the refactored programs. It generates inputs for the programs, runs them, then compares their outputs. The inputs are created randomly with the following structure. The length of an input ranges from 4 to 7 lines (in order for the inputs to better work with the programs in the dataset, too short or too long inputs are not preferred). Each line contains 1 to 6 integers separated by spaces, with the values ranging from 1 to 5. The values are bounded because they represent index values in several cases. It should be noted that some of the solutions require string-management. Since our method is not designed to resolve such issues, string inputs are not generated in the automated evaluation process. For each code, 1000 inputs are generated in order to find some that terminate with no error after running it on the original version. If the outputs of both versions match on all inputs, we interpret that the fix is successful. The second approach is applied if no appropriate inputs are found for the original version. These examples are manually tagged to tell whether the fix is successful or not. The tagging is done by a thorough examination of the original and refactored versions of each code. In order to filter out incorrect tags, this tagging procedure was performed twice, then the differences were compared and resolved.
Detecting and Fixing Nonidiomatic Snippets in Python
5.2
143
Precision
We ran our method on each program in the dataset. Out of the 13373 programs, changes were made to 1303. The automated testing approach could be applied in 1054 cases, where 506 were identified as correct. For the rest (249) of the codes, the manual tagging procedure counted 85 correct modifications. The sum results in 591 correct localizations and substitutions (out of the 1303 cases). Thus the precision of our method is 45.35% (591/1303). It should be noted, that the automated testing process pointed to 300 trivially recognizably corrupted codes that crash, produce runtime errors, or do not terminate. These can be easily filtered out with suitable tools, thus it is worth observing the precision neglecting these 300 codes: 58.92% (591/(1303-300)). As we don’t run the programs during the manual testing process, there might be more than 300 such files, making the aforementioned percentage a lower estimate. 5.3
Recall
In order to estimate the prevalence of nonidiomatic snippets in the test dataset, a sample of 300 programs was taken. The programs were selected randomly and tagged with whether they should be fixed or not. As in the previous tagging procedure, all of the programs were tagged twice, then compared. 28 programs were found to contain snippet(s) to be replaced. In the vast majority of these cases, such programs contain exactly one snippet. There are a few exceptions, resulting in 35 snippets in total. The pattern distribution of the found snippets is the following: – – – – – –
count: 17/35 (48.57%) max: 7/35 (20.0%) search: 1/35 (2.85%) sum: 7/35 (20.0%) any: 1/35 (2.85%) all: 2/35 (5.71%).
According to the sample, about 9.33% of the programs contain one or more snippets to be refactored (28/300). Our method correctly localized and substituted 591 snippets, which is 4.41% of all of the codes (591/13373). This indicates the estimated recall of the whole system: 47.27% (4.41%/9.33%). Furthermore, the number of codes with correctly localized snippets (ignoring the correctness of the fix) is 706, which is 5.27% of all of the codes (706/13373), indicating the estimated recall of localization: 56.27% (5.27%/9.33%). 5.4
Precision of Subsystems
In order to determine the precision of the localization capability alone (correct snippet localizations to all localizations), the refactored programs were tagged
144
B. Szalontai et al.
as follows. A program was marked with “correct localization” if all of the found snippets in the code were nonidiomatic. As in the previous cases, the programs were tagged twice in order to reduce the chance of incorrect taggings. Since 591 of the files were already identified as correct fixes, it was only necessary to inspect the rest. 115 programs were found to contain correctly localized, but not appropriately substituted snippets. Altogether there were 706 correct localizations, indicating 54.18% (706/1303) as the precision of the finder algorithm. In possession of this knowledge, it is natural to try to identify the precision of the fixing process (correct substitutions to correct localizations). Out of the programs containing correctly localized snippet(s) 591 were correctly fixed, indicating the precision of the substitution process being 83.71% (591/706). A visualization of the performance of the subsystems can be seen in Fig. 5.
Fig. 5. Visual overview of the precision of the subsystems.
6
Conclusion
We presented a method for locating and fixing nonidiomatic snippets by substituting them with Pythonic alternatives. We introduced a novel approach by using one feedforward and two recurrent neural networks along with explicit programming in order to locate the snippets and generate an equivalent alternative for them. The approach was validated by testing on more than 13 000 Python programs coded by students. According to the evaluation results, given a source code containing nonidiomatic snippets, our algorithm localizes and correctly fixes them in about half of the cases, making the code more Pythonic.
Detecting and Fixing Nonidiomatic Snippets in Python
145
The precision of a practical system would go up to about 60% as many corrupted programs can be trivially recognized. Such a system could be utilized to idiomatize large Python projects with human supervision. It could also be used for educational purposes, since it has the ability to aid the learning process by offering an improved version of the code. In future work, we would like to apply a more general approach to the refactoring process in order to increase the number of substitutable nonidiomatic patterns. We also plan to experiment with different neural network architectures to improve the overall precision and recall of the system. Acknowledgments. EFOP-3.6.3-VEKOP-16-2017-00001: Talent Management in Autonomous Vehicle Control Technologies – The Project is supported by the Hungarian Government and co-financed by the European Social Fund. We would like to express our great appreciation to L´ aszl´ o Zsak´ o and Gyula Horv´ ath for providing an enormous amount of Python codes to test our algorithm on. The data is from the E¨ otv¨ os Lor´ and University’s programming exercise bank and submission website.
A
Appendix
Table 5. Pattern of maximum search with the name of the list, and the name of the maximum value. maXind = 0 ; MAx = a r r [ 0 ] f o r loopInd in range ( len ( arr [ 1 : ] ) ) : i f MAx < a r r [ l o o p I n d + 1 ] : 1 arr MAx MAx = a r r [ l o o p I n d +1] maXind = l o o p I n d+1
Table 6. Pattern of linear search with the name of the list, and the name of the boolean return value. found = Pred ( a r r [ 0 ] ) loopInd = 0 w h i l e l o o p I n d < l e n ( a r r )−1 and not found : 2 arr found i f Pred ( a r r [ l o o p I n d + 1 ] ) : found = True loopInd = loopInd + 1
146
B. Szalontai et al.
Table 7. Pattern of summation with the name of the list, and the name of the sum value. SUm = 0 f o r loopInd in range ( len ( arr ) ) : 3 arr SUm i f Pred ( a r r [ l o o p I n d ] ) : SUm += a r r [ l o o p I n d ]
Table 8. Pattern of all with the name of the list, and the name of the boolean return value. A l l = Pred ( a r r [ 0 ] ) loopInd = 0 w h i l e l o o p I n d < l e n ( a r r )−1 and A l l : 4 arr All i f not Pred ( a r r [ l o o p I n d + 1 ] ) : All = False loopInd = loopInd + 1
Table 9. Pattern of any with the name of the list, and the name of the Boolean return value. Any = Pred ( a r r [ 0 ] ) loopInd = 0 5 w h i l e l o o p I n d < l e n ( a r r )−1 and not Any : arr Any i f not Pred ( a r r [ l o o p I n d + 1 ] ) : Any = True loopInd = loopInd + 1 Most expensive house A real estate firm stores the area and price of the houses for sale. Write a program which finds the index of the most expensive house. Input: The first line of the standard input contains the number of houses (1≤N≤100), the following N lines each contain the area of house (in m2 , 1≤A≤500) and the price (in thousands of USD, 1≤P≤1000). Output: The first line of the output should contain the index of the most expensive house. If there are multiple solutions then the smallest index should be written. Example: Input Output 6 4 42 15 110 20 125 160 166 180 42 10 110 39
Fig. 6. An example exercise.
Detecting and Fixing Nonidiomatic Snippets in Python
147
References 1. Aftandilian, E., Sauciuc, R., Priya, S., Krishnan, S.: Building useful program analysis tools using an extensible java compiler. In: 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation, pp. 14–23. IEEE (2012) 2. Ahmed, T., Hellendoorn, V., Devanbu, P.T.: Learning lenient parsing & typing via indirect supervision. CoRR, abs/1910.05879 (2019) 3. Allamanis, M., Barr, E.T., Devanbu, P., Sutton, C.: A survey of machine learning for big code and naturalness. ACM Comput. Surv. (CSUR) 51(4), 1–37 (2018) 4. Danish, M., Allamanis, M., Brockschmidt, M., Rice, A., Orchard, D.: Learning units-of-measure from scientific code. In: 2019 IEEE/ACM 14th International Workshop on Software Engineering for Science (SE4Science), pp. 43–46. IEEE (2019) 5. Gupta, R., Pal, S., Kanade, A., Shevade, S.: DeepFix: fixing common C language errors by deep learning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 (2017) 6. Habib, A., Pradel, M.: Neural bug finding: a study of opportunities and challenges. CoRR, abs/1906.00307 (2019) 7. Hellendoorn, V.J., Bird, C., Barr, E.T., Allamanis, M.: Deep type inference. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 152–162 (2018) 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997) 9. Kaleeswaran, S., Santhiar, A., Kanade, A., Gulwani, S.: Semi-supervised verified feedback generation. In: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 739–750 (2016) 10. Karampatsis, R.-M., Sutton, C.: Maybe deep neural networks are the best choice for modeling source code. CoRR, abs/1903.05734 (2019) 11. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR (Poster) (2015) 12. Loper, E., Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the ACL-2002 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics, ETMTNLP 2002, USA, vol. 1, pp. 63–70. Association for Computational Linguistics (2002) 13. Pradel, M., Sen, K.: DeepBugs: a learning approach to name-based bug detection. Proc. ACM Program. Lang. 2(OOPSLA), 1–25 (2018) 14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014) 15. Vasic, M., Kanade, A., Maniatis, P., Bieber, D., Singh, R.: Neural program repair by jointly learning to localize and repair. CoRR, abs/1904.01720 (2019) 16. Wong, W.E., Gao, R., Li, Y., Abreu, R., Wotawa, F.: A survey on software fault localization. IEEE Trans. Softw. Eng. 42(8), 707–740 (2016)
BreakingBED: Breaking Binary and Efficient Deep Neural Networks by Adversarial Attacks Manoj-Rohit Vemparala1(B) , Alexander Frickenstein1 , Nael Fasfous2 , Lukas Frickenstein1 , Qi Zhao1 , Sabine Kuhn1 , Daniel Ehrhardt1 , Yuankai Wu1 , Christian Unger1 , Naveen-Shankar Nagaraja1 , and Walter Stechele2 1
BMW Autonomous Driving, Unterschleiheim, Germany {manoj-rohit.vemparala,alexander.frickenstein,lukas.frickenstein,qi.zhao, sabine.kuhn,daniel.ehrhardt,yuankai.wu,christian.unger, naveen-shankar.nagaraja}@bmw.de 2 Technical University of Munich, Munich, Germany {nael.fasfous,walter.stechele}@tum.de
Abstract. Deploying convolutional neural networks (CNNs) for embedded applications presents many challenges in balancing resourceefficiency and task-related accuracy. These two aspects have been wellresearched in the field of CNN compression. In real-world applications, a third important aspect comes into play, namely the robustness of the CNN. In this paper, we thoroughly study the robustness of uncompressed, distilled, pruned and binarized neural networks against whitebox and black-box adversarial attacks (FGSM, PGD, C&W, DeepFool, LocalSearch and GenAttack). These new insights facilitate defensive training schemes or reactive filtering methods, where the attack is detected and the input is discarded and/or cleaned. Experimental results are shown for distilled CNNs, agent-based state-of-the-art pruned models, and binarized neural networks (BNNs) such as XNOR-Net and ABCNet, trained on CIFAR-10 and ImageNet datasets. We present evaluation methods to simplify the comparison between CNNs under different attack schemes using loss/accuracy levels, stress-strain graphs, box-plots and class activation mapping (CAM). Our analysis reveals susceptible behavior of uncompressed and pruned CNNs against all kinds of attacks. The distilled models exhibit their strength against all white box attacks with an exception of C&W. Furthermore, binary neural networks exhibit resilient behavior compared to their baselines and other compressed variants.
Keywords: Convolutional neural networks Model robustness · Adversarial attacks
M.-R. Vemparala, contributions.
A.
Frickenstein,
N.
· Model compression ·
Fasfous
and
L.
Frickenstein—Equal
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 148–167, 2022. https://doi.org/10.1007/978-3-030-82193-7_10
Breaking Binary and Efficient Deep Neural Networks
1
149
Introduction
Neural network compression is an extensively studied topic for reducing the computational complexity [21,27,36], the memory demand [15,19,25] and/or the energy consumption [42] of deep neural networks (DNN) deployed on embedded systems. These aspects widen the potential for DNN applications in real-world scenarios. Particularly in the field of robotics and autonomous driving, increasingly deeper and larger convolutional neural networks (CNNs) are deployed on resource-constrained hardware platforms, enabling computer vision-based applications, such as pedestrian detection or free-space detection. Systems in autonomous vehicles are safety critical, maintaining zero-tolerance for potential threats to functional safety. Attacking (breaking) neural networks can be done by injecting small perturbations to their inputs, referred to as adversarial attacks [39]. Under the assumption of varying degrees of information on the CNN and the accessibility of its internal parameters, several black-box (GenAttack [2], LocalSearch [31]) and white-box (FGSM [22], DeepFool [30] and Carlini & Wagner [5]) attacks are potential threats. Understanding these threats helps to develop pro-active [11] and re-active [33] methods to defend against adversarial examples and thereby improve CNN robustness.
Fig. 1. Experimental setup of BreakingBED for breaking binary (C) and Efficient (A) and (B) DNNs attacked with white-box (FGSM, PGD and C&W) and black-box (LocalSearch and GenAttack) adversarial attacks. Evaluated by using loss/accuracy levels, stress-strain Graphs, box-plots and class activation mapping (CAM).
Recent works investigated the mitigation of such threats through robust training of neural networks [14] and robust neural architecture search (NAS) techniques [12]. In [26], the authors compress neural networks through robust quantization, lowering the computational complexity while maintaining good performance against potential attacks. Further investigations on the robustness of binary neural networks (BNNs) were carried out in [10], where BNNs were attacked with white-box (FGSM [22] and C&W [5]) and black-box [34] techniques. The robustness of BNNs was concluded, albeit on basic adverserially trained networks from [34] and a small set of attacks. In order to get a deeper understanding of the effectiveness of adversarial attacks (Sect. 3), applied to binary and efficient DNNs (Sect. 2), we perform an
150
M.-R. Vemparala et al.
extensive set of robustness evaluation experiments. In detail, we expose vanilla full-precision, distilled, pruned and binary DNNs to a variety of adversarial attacks in Sect. 4.
2
Compression of Deep Neural Networks
Many works in literature have focused on reducing the redundancy emerging from training deeper and wider neural networks, aiming to mitigate the challenges of their deployment on edge devices. Compression techniques such as knowledge distillation, pruning, and binarization can potentially make CNNs more efficient in embedded settings. 2.1
Knowledge Distillation
Knowledge distillation (KD) is the transfer of knowledge from a teacher to a student network [20,40]. The student can be a smaller DNN, which is trained on the soft labels of the larger teacher network, achieving an improvement in an accuracy-efficiency trade-off. The student represents a compressed version of the teacher, condensing its knowledge. This paper focuses on KD training, using Kullback-Leibler (KL) divergence between the teacher and the student output distribution formulated as the loss function in Eq. (1). Here, σ(ft (I)) and σ(fs (I)) represent the softmax output logits of the teacher and student network respectively, computed for a sample image I in a mini batch of N samples. N σ(ft (In )/T ) KL LKD (ft , fs , T ) = σ(ft (In )/T ) log (1) n=1
σ(fs (In )/T )
During the knowledge transfer using the teacher’s logits, a softmax temperature T 1 is used. During the evaluation, we use T = 1 to obtain softmax-cross entropy loss. 2.2
Pruning
Pruning aims to eliminate redundancies in DNNs and produce smaller, more efficient neural networks. Pruning has been investigated in many works, over a wide range of DNN models, achieving high compression rates while maintaining high prediction accuracy [15,18,19]. Guo et al. [13] present an irregular pruning method, which can significantly reduce the parameter redundancy by integrating connection pruning with the retraining process. Recently, structured pruning techniques, which remove larger, regular parts of the network, achieve a tangible improvement in hardware acceleration with a negligible accuracy loss [9,15,17,43]. More recently, He et al. proposed a learning-based compression method in AMC-AutoML [19]. The authors leverage a reinforcement learning (RL) agent, which learns the possible sparsities in each layer and prunes them based on an 2 -norm heuristic. We adapt
Breaking Binary and Efficient Deep Neural Networks
151
the RL-agent of AMC-AutoML to support different pruning regularities such as filter-wise (F. Prune), channel-wise (Ch. Prune), kernel-wise (K. Prune) and weight-wise (W. Prune) pruning (shown in Fig. 1). Pruning input channels from a layer also discards corresponding output filters from previous layers. Thus, Ch./F. Prune result in a similar compression ratio and CNN structure. The pruning regularity has a direct impact on the hardware implementation complexity and throughput benefits. In this paper, the pruning rate is set at a constant value of 50% over all experiments and pruning regularities. 2.3
Binarization
Binarization represents the most aggressive form of quantization, where the network weights W and activations are constrained to ±1 values. This greatly reduces the memory requirements of DNNs. In theory, binarizing a singleprecision floating-point DNN, reduces its memory footprint by up to ×32. Different schemes for binarization of a DNN have been proposed. Courbariaux et al. [7] introduced the concept of training neural networks with binary weights B during inference and maintaining a latent representation during back-propagation. The authors later augmented this approach with binarized activations [21]. Rastegari et al. [36] introduced XNOR-Net, where the convolution of an input feature map Al−1 and weight tensor W is approximated by a combination of XNOR operations and popcounts ⊕, followed by a multiplication with a scaling factor α, such that Conv(Al−1 , W ) ≈ (sign(Al−1 ) ⊕ B) · α (shown in Fig. 1). Binary neural networks (BNNs) typically suffer from accuracy degradation. To mitigate this problem, Lin et al. [27] proposed a scheme for Accurate Binary CNNs (ABC-Net). The authors approximated the convolution by using a linear combination of multiple binary bases for weights and activations, shrinking the accuracy gap to full-precision counterparts. In this paper, we implement ABCNet and XNOR-Net binarization techniques, to evaluate the effect of adversarial attacks on accurate BNNs.
3
Adversarial Attacks
One option to attack (break) neural networks is by injecting small perturbations (adversarial biases) called adversarial attacks. An adversarial example I Adv that forces a given classifier with parameters θ to misclassify an image I with true label L, renders a successful non-target attack: A = {I Adv |θ(I Adv ) = L }. Whereas, a successful target attack can be defined as: A = {I Adv |θ(I Adv ) = Lt } for some target class t. The capability of the adversary can be described by a set of allowed perturbations S : D(I, I Adv ) ≤ , restricting the maximum possible perturbation distance D(I, I Adv ) to a given image I by some adversarial manipulation budget . Finding I Adv can be formulated as a maximization problem as defined in Eq. (2), whereby various attacks are designed to be effective using different distance metrics (1 , 2 , ∞ ) [4]. max L(I Adv , L, θ)
I Adv ∈S
(2)
152
M.-R. Vemparala et al.
Attacks can be categorized regarding the degree of accessibility to a model’s internal parameters θ. White-box attacks [3,5,22,24,29,39] assume complete model transparency, allowing full control and access to the target CNN. In most real-world scenarios, a model’s fine internal details are not easily accessible, rendering white-box attacks less practical [6]. On the other hand, black-box attacks [2,31] assume no such information. The adversary can be a standard user, with access to only the inputs and the outputs of a targeted model. Such attacks are more practical and can have severe consequences in real-time critical applications. Different models learn similar features when they are trained for the same task. Adversarial perturbations are highly aligned with the weight vectors of a model. This results in the generalization of adversarial examples over different models [22], making it possible to transfer a white-box attack from one model as a black-box attack to another [24]. 3.1
White-Box Attacks
Fast Gradient Sign Method: The most commonly used attack to verify the robustness of neural networks against input perturbations is the fast gradient sign method (FGSM) [22]. FGSM linearizes the loss function of a neural network around θ by calculating its gradient ∇L(I, L, θ) to generate adversarial examples I Adv , resulting in an efficient solution to Eq. (2). The input variation parameter controls the perturbation’s amplitude [24], as expressed in Eq. (3). I Adv = I + · sign (∇L (I, L, θ))
(3)
The attack is strengthened when performed iteratively. This can be considered as an extension of FGSM, generating adversarial samples using a small step-size [24]. Projected Gradient Descent: An even more effective variant is iterative projected gradient descent (PGD) on the loss function with uniform random noise initialization [38], expressed in Eq. (4). Adv = πS IiAdv + α · ∇L IiAdv , L, θ (4) Ii+1 Adv Here, adversary examples Ii+1 are generated by taking one step into the ascent direction of the loss gradient ∇L(IiAdv , L, θ) with respect to the previous image IiAdv at iteration i, where the step-size is scaled by α, followed by a potential projection π onto the legal set S. Legal adversaries are ensured by a projection π onto the legal set I + S with S = {δ : ||δ||p ≤ }. If not mentioned otherwise, PGD attacks focus on the ∞ -norm as a distance metric for D(I, I Adv ), representing an ∞ -ball around natural images I. The iterative multi-step optimization method is able to converge to local maxima of the non-concave and constrained maximization problem, defined in Eq. 2, representing possible worst-case adversaries for the underlying model. By
Breaking Binary and Efficient Deep Neural Networks
153
considering random uniform initialization, arbitrary starting points on the corresponding loss surface are ensured, thus resulting in an exploration of potentially varying local maxima and lastly giving rise to the structural behavior of the corresponding loss surface. This renders the PGD attack as the “ultimate” firstorder adversary, as stated by Madry et al. [28]. Carlini and Wagner: Carlini and Wagner (C&W) [5] presented a targeted attack, to refute the promising defensive approach of defensive distillation [35]. The proposed C&W attack emerged as one of the strongest attacks in literature [1]. CW finds perturbations δ with minimal distance D(I, I + δ) that will change the classification of image I to the target class t. This is a challenging non-linear optimization problem and therefore the authors introduce a function g, such that g(I + δ) = 0 when the classifier gets fooled towards the target class. The attack constructs adversarial examples which try to minimize the objective as mentioned in Eq. (5). min(δp + · g(I + δ)), where g(I) = ((max Z(I)j ) − Z(I)t )+
(5)
j=t
Z(I)j indicates the output of the CNN for class j before the softmax layer. The minimum condition g(I) = 0 occurs when Z(I)t ≤ Z(I)j ∀j = t. The choice of maintains a trade-off between the attacked image similarity and the success rate of the target class. Using 2 distance metric, the objective function is minimized through the gradient decent. DeepFool: With the DeepFool [30] attack, the authors propose a method to generate adversarial examples that fool classifiers on large-scale datasets by estimating the distance of an input instance I to the closest decision boundary. The iterative method estimates the perturbation δi at each iteration i till the classifier f (Ii ) changes its prediction (f (Ii ) = L). In practice, once an adversarial perturbation δ is found, the adversarial example is pushed further beyond the decision boundary. The algorithm is not guaranteed to converge to the optimal perturbation, nevertheless it generates adversarial examples with good approximations of the minimal perturbation. The size of the calculated perturbation can also be interpreted as a metric for the model’s robustness against adversarial attacks [41]. 3.2
Black-Box Attacks
LocalSearch: LocalSearch [31] is a simple gradient-free adversarial black-box attack, which is based on random perturbation and a greedy search algorithm around the perturbed pixels. The LocalSearch procedure works in iterations, where each iteration consists of two steps. The first step is to select and evaluate a small subset of points Pi , referred to as the local neighborhood. In the second step, a new solution Pi+1 is selected by taking the evaluation of the previous solution Pi into account. LocalSearch is simple to implement, but is computationally expensive, similar to most greedy search algorithms.
154
M.-R. Vemparala et al.
GenAttack: GenAttack [2] is a gradient-free optimization strategy based on a genetic algorithm. The initial population of perturbed image examples is generated by adding uniform random noise. The best individuals survive the generation based on their fitness evaluation, the selection strategy and the crossover and mutation probabilities. Fitness evaluation reflects the optimization objective, while the selection strategy allows elite individuals in the population to generate new children perturbations through crossover and mutation mechanisms. GenAttack is a faster search algorithm when compared to LocalSearch [31], and generates perturbations which are imperceptible to the human eye.
4
Breaking Binary and Efficient DNNs
Although a successful attack could easily be carried out by adding large perturbations, the requirement of finding the minimum necessary perturbation in each case is typically desirable to perform the attack in an inconspicuous manner. This justifies CNNs to being particularly robust against adversarial attacks that are relevant or expected in practice. However, despite the requirement to keep the perturbation as small as possible, the target for training against an attack structure can be to maximize a corresponding loss function. A prior analysis on the robustness of real world compressed CNNs provides insights which facilitate the realization of strong adversarial defense methods. We evaluate robustness of CNNs which are trained and evaluated on CIFAR10 [23] or ImageNet [37] datasets. The 50K train and 10K test images (32 × 32 pixels) of CIFAR-10 are used to train and evaluate compressed variants of ResNet20/56. [16,19,27,36,40] respectively. The ImageNet dataset consists of ∼1.28M train and 50K validation images (256 × 256 pixels). Compressed variants of ResNet18/50 are trained and evaluated for ImageNet experiments. If not otherwise mentioned, all hyper-parameters specifying the training and the attacks were adopted from the reference implementation. The robustness evaluation covers various white-box (FGSM, PGD, C&W, DeepFool) and black-box (LocalSearch, GenAttack) attacks on the CIFAR-10-trained ResNet20/56 compressed variants, as well as ImageNet-trained CNNs. We perform all the experiments using the trained statistics for the batch normalization layers. 4.1
CNN Compressed Variants
Table 1 summarizes the compressed CNNs and their full-precision counterparts analyzed in this paper. It shows that the neural networks drastically vary in their memory demand and their compute complexity. Deep learning inference accelerators such as the NVIDIA-T4 GPU [32] or Xilinx FPGAs with DSP48 blocks support SIMD-based bit-wise operations [8]. In particular, a single DSP48 block can perform two 16-bit fixed-point multiplications or 48 XNOR operations at once. The normalized compute complexity (NCC) is defined as the optimal utilization of MAC and XNOR operations in one compute unit. The DSP48 block serves as a reference implementation to compute NCC in Table 1.
Breaking Binary and Efficient Deep Neural Networks
155
Table 1. Accuracy top1 [%], memory demand [MB] and the normalized compute complexity (NCC) of compressed CNNs and their full-precision counterparts. Dataset
Model
CIFAR-10 ResNet20 [16]
92.46 %
1.07
41
KD-KL [40]
93.25 %
1.07
41
Ch.Prune [19]
89.76 %
0.70
19
K.Prune [19]
90.73 %
0.61
20
W.Prune [19]
91.98 %
0.59
20
XNOR [36]
82.71 %
0.04
1.3
ABC(1 × 1) [27]
83.42 %
0.04
1.3
ABC(3 × 3) [27]
88.94 %
0.12
8.0
ABC(5 × 5) [27]
90.64 %
0.20
21.3
ResNet56 [16]
93.88 %
3.40
125
KD-KL [40]
94.24 %
3.40
125
Ch.Prune [19]
92.86 %
2.50
62
K.Prune [19]
93.04 %
2.19
63
W.Prune [19]
93.54 %
2.02
62
XNOR [36]
83.24 %
0.11
3.0
ABC(1 × 1) [27]
86.29 %
0.11
3.0
ABC(3 × 3) [27]
92.48 %
0.33
24
ABC(5 × 5) [27]
92.82 %
0.55
66
ImageNet ResNet50 [16]
4.2
Acc. [%] Memory demand [MB] NCC [106 ]
75.43 % 102.01
10216
ResNet18 [16]
69.00 %
46.72
1814
ResNet18-Ch.Prune [19]
67.62 %
34.52
884
ResNet18-XNOR [36]
49.10 %
4.14
173
ResNet18-ABC(1 × 1) [27] 51.07 %
3.48
153
ResNet18-ABC(3 × 3) [27] 59.83 %
6.28
417
Evaluation of Robustness
PGD-Evaluation: Considering PGD attack as the “ultimate” first-order attack, this section experimentally explores the structure of the loss surfaces and the corresponding accuracy deterioration of the proposed efficient DNNs, while exposing the models to PGD adversaries, similar to those proposed by Madry et al. [28]. Investigating the resulting structural behavior, especially the loss level to which the PGD attack is converging to and the speed of deterioration of accuracy, helps in understanding the adversarial robustness of the underlying models with respect to a defined PGD threat model τP GD = { , α, i }. All models are pre-trained on CIFAR-10 without any adversarial examples, to distinguish the influence of varying compression techniques on adversarial robustness. Subsequently, each model is exposed to PGD attacks from τP GD = { = 2, α = 0.5, i = 1000}. Following the method of Carlini et al. [4], i was increased to verify convergence, ensuring local-maxima, representing potentially worst-case adversarial examples for the underlying model with respect to the applied threat
M.-R. Vemparala et al.
102
XNOR 30
101
20
Acc[%]
156
100
Vanilla
Ch.Prune
K.Prune
W.Prune
ABC(1 × 1) ABC(3 × 3) ABC(5 × 5) 102 Above BL: Binary, Distilled Below BL: Vanilla, Pruned
KD-KL 30
0
20
40
60
PGD Iteration (i)
80
20
Loss
Acc [%]
Loss
XNOR remains above BL 101 BL
10
100
10
0 100
0
20
40
60
80
0 100
PGD Iteration (i)
Fig. 2. PGD attack accuracy (solid) and loss (dashed) over PGD iterations for compressed variants of ResNet20 (left) and ResNet56 (right) averaged over five reruns of PGD attack. Additionally, the horizontal breaking line (BL - dashed black) visualizes the deterioration of model accuracy below random guessing (≤ 10%) for CIFAR-10. Visual markings are added to categorize models above and below the BL at i = 10.
model τP GD . However, results are only shown up to i=100, since τP GD showed convergence for all investigated models in this range. The loss value and the corresponding accuracy of the models to the adversary were tracked every 5th iteration. In the following, the adversarial robustness of a model against PGD attacks is evaluated by (1) the overall loss level the PGD attack is converging to and as a consequence the resulting accuracy (2) the number of iterations a model can sustain until it breaks. We can consider a CNN model broken, if its accuracy indicates that the classification is random (10% for CIFAR-10 dataset), represented by model accuracy graphs dropping below the breaking line (BL). Figure 2 shows the mean over five reruns of PGD attack for all models to exploit random initialization, which ensures random exploration of the underlying nonconcave maximization problem as described in Sect. 3. Consistently, all investigated pruning techniques harm adversarial robustness against PGD attack with respect to its vanilla versions of ResNet20/56, when considering (1) the loss and accuracy after a converged attack and (2) the speed of breaking. Vanilla and pruned versions of ResNet20 break within five iterations, whereas the respective ResNet56 versions break within ten iterations. KD shows greater resilience to the PGD attack since (1) its accuracy after the converged attack is higher compared to both the ResNet20/56 vanilla variants and (2) breaking at a higher number of iterations. KD-KL breaks at i = 15 for its ResNet20 variant and at i = 40 at its ResNet56 variant. Binarization can improve the robustness against the defined PGD attack, materializing in (1) the higher loss and accuracy after a converged attack and (2) the greater resilience for a longer period of PGD iterations. XNOR-Net and ABC(5 × 5) break at i = 20, while ABC(3 × 3) and ABC(1 × 1) break at around i = 60 for their ResNet20 variants. For the ResNet56 variants, ABC(1 × 1) and
Breaking Binary and Efficient Deep Neural Networks
157
ABC(5 × 5) break at i = 20, whereas ABC(3 × 3) sustains up to i = 40. The ResNet56 variant of XNOR-Net outperforms all other models in (1) accuracy after converged attack (∼14%) and (2) being the only model that never breaks throughout this experiment (see Fig. 2 right). Stress-Strain Evaluation: To facilitate the interpretation of the data generated from the experiments, we propose a method for evaluating robustness. Different models such as ResNet20 and ResNet56 have different baseline accuracies, making it difficult to directly compare the robustness of different training or compression schemes. Existing metrics, such as attack success-rate [2] or accuracy degradation, fail to capture the differences of the baseline accuracy of a network. Taking inspiration from the field of mechanics, we use formulas of stress and strain to make an analogy with the robustness of networks before they break. Applying a certain amount of stress on an object causes a certain measure ∗ of deformity or strain. We adapt the strain formula to our problem as ε = A−A A , where ε is the strain, A is the accuracy before attack and A∗ the deteriorated accuracy. Note that, we use and ε to represent perturbation amplitude and strain respectively. A network which sustains higher strain ε w.r.t. an attack is less robust. The rate of change in ε with increased stress indicates the resilience or fragility of the CNN under heavier forms of the same attack. Similar to the different types of mechanical stress (compressive, tensile or shear), iterative and amplitude based attacks can represent different types of attack-stress σ. Using σ and ε, we can compare the degree of robustness between networks, relative to their base accuracies. We can use inverted stress-strain graphs to better visualize the robustness of networks accordingly. Given the behavior of a network under a certain attack, we can classify its robustness in terms of material properties. A network that sustains a high attack stress before breaking is a strong network. On the other hand, a network which gradually degrades with increased attack stress is a ductile network. Lastly, a network which breaks before it deforms can be considered a brittle network. Figure 3 shows a set of stress-strain graphs for all the networks and attacks investigated. Fast Gradient Sign Method: For FGSM attacks, the results show that the KD-KL variant is more resilient compared to other compression techniques, as its strain ε increases at a slower rate with intensified attack stress. During the training, the distillation is performed using higher temperature (T = 30). The attack perturbations are generated using cross-entropy loss with T = 1, resulting in saturated gradients and therefore weakening the attack. Figure 3 shows an interesting effect of increased FGSM stress on the XNOR-Net variant. The robustness of ResNet56-XNOR is higher than other variants under low stress of up to σ = 4. Beyond that point, further attack stress severely harms the robustness of the network, making it the second-worst variant, following ABC (1 × 1). Generally, a boost in robustness is observed when the base CNN is the larger ResNet56 model. This increases their ductility, as they sustain more attack stress before breaking, when compared to the more brittle ResNet20 models. Interestingly, the same does not apply for the binarized ABC models, as they show similar robustness, irrespective of being ResNet20 or ResNet56 variants.
M.-R. Vemparala et al. W.Prune
15
0
1 0.8 0.2 1
1.5
2
0
0
0.8 0.6 0.4
Strain (ε)
0 1.5
2
0
200
0
1 0.8 150
200
0
150
200
GenAttack - ResNet56 - = 8 N = 16
0
40
CW - ResNet20 - Fixed = 1 1 Strain (ε)
0.8
0.8 0.6
0
0.2
0.4
Strain (ε)
0.2 0 100
Attack Stress (σ)
20
Attack Stress (σ)
1
1
50
0.6 0.2
100
LocalSearch- ResNet20 - Fixed = 16
0.8 0.6 0
0.2
0.4
strong
20
0 50
Attack Stress (σ)
GenAttack - ResNet20 - = 8 | N = 16
15
0.4
Strain (ε)
0.8 0.6 0.4
Strain (ε)
0.2 150
10
DeepFool - ResNet56
0 100
5
Attack Stress (σ)
1
1 0.8 0.6 0.4 0.2
50
Attack Stress (σ)
0
1
PGD - ResNet56 - Fixed i = 3
0 0
0.5
Attack Stress (σ)
FGSM - ResNet56
20
0.2
ductile
0.2 15
15
1
1 0.4
Strain (ε)
brittle
0 10
10
DeepFool - ResNet20
0.6
0.8
1 0.8 0.6 0.4 0.2
5
5
Attack Stress (σ)
PGD - ResNet20 - Fixed i = 3
Attack Stress (σ)
KD-KL
0 0.5
Attack Stress (σ)
0 0
ABC(5 × 5)
0.6
Strain (ε)
0.6 0.4
Strain (ε)
0.2 10
FGSM - ResNet20
Strain (ε)
ABC(3 × 3)
0 5
Attack Stress (σ)
Strain (ε)
ABC(1 × 1)
0.8
1 0.8 0.6 0.4
Strain (ε)
0.2 0 0
Strain (ε)
XNOR
0.4
K.Prune
0.6
Ch.Prune
1
Vanilla
0.4
158
50
100
150
200
Attack Stress (σ)
LocalSearch - ResNet56 - Fixed = 16
0
20
40
Attack Stress (σ)
CW - ResNet56 - Fixed = 1
Fig. 3. Stress-strain graphs for various attacks on compressed variants of ResNet20 (top) and ResNet56 (bottom).
LocalSearch
ABC(5 × 5)
KD-KL
0
0
Acc. after Attack
CW
0
0
0
20 40 60 80 100
PGD Acc. after Attack
20 40 60 80 100
FGSM Acc. after Attack
strong
ABC(3 × 3) 20 40 60 80 100
brittle ductile
ABC(1 × 1) Acc. after Attack
XNOR
Acc. after Attack
W.Prune
159
20 40 60 80 100
K.Prune
20 40 60 80 100
Ch.Prune
0
Acc. after Attack
Vanilla
20 40 60 80 100
Breaking Binary and Efficient Deep Neural Networks
DeepFool
GenAttack
Fig. 4. Box-plots for attacks on compressed variants of ResNet20 and ResNet56.
Projected Gradient Descent: For PGD, increased attack stress can be interpreted as higher perturbation amplitude or more iterations i. Figure 3 shows the attack stress σ = , with iterations fixed to 3. The CNNs show various characteristics for this attack hyper-parameter setting. We observe KD-KL and XNOR variants of ResNet56 having a lower slope compared to other compressed CNNs indicating the ductile behavior. Carlini & Wagner: For the C&W method, we set the attack stress σ to search iterations over = 1 (see Eq. 5). The results show the strength of this method, rendering all our networks brittle. This is characterized by the steep ascent in strain, breaking all CNNs with minimal attack stress. DeepFool: Similar to the C&W attack, DeepFool renders most of the considered CNNs brittle. One exception is the ResNet56-XNOR, which can sustain some amount of stress before completely breaking. It is worth noting that the other binary CNNs do not perform as well as ResNet56-XNOR in this case. LocalSearch: The LocalSearch attack can also offer two types of stress: amplitude and iterations. In Fig. 3, the stress-strain curves for a fixed amplitude of = 16. For this amplitude, none of the networks completely break, even after 200 iterations of the attack. However, it is worth noting that binarized CNNs outperform the full-precision variants for both ResNet20 and ResNet56 experiments. GenAttack: For GenAttack, we take the number of generations i as the measure of attack stress, and fix amplitude = 8 and population N = 16. In Fig. 3, a clear difference between the robustness of BNNs and other variants is observed. We can classify BNNs as strong against GenAttacks, and all other variants, as brittle.
160
M.-R. Vemparala et al.
Box-Plots: In Fig. 4, we present box-plots from data collected over a range of experiments. For each attack, we sweep over the respective strength and iterations mentioned in Table 2. The exact definition of strength and iteration for each attack can be recalled from Sect. 3. The data includes both models, ResNet20 and ResNet56. Table 2. All strength and iteration combinations tested for ResNet20 and ResNet56 variants (vanilla, pruned, binary, and distilled). Strength and iteration definitions for each attack are explained in Sect. 3. Attack
Strength
Iterations i
FGSM
2, 4, 8, 16
N/A
PGD
0.1, 0.5, 1.0, 2.0
2, 3, 4, 5
CW
0.01, 0.1, 1.0, 5.0, 10.0 1,10, 20, 50
DeepFool
N/A
1, 5, 10, 20
Local search 8, 16, 32
50, 100, 150, 200
GenAttack
50,100, 150, 200 popsize = 6, 16
8, 12
Each plot shows the distribution of all the accuracies achieved by the compression technique, after being attacked by the corresponding method, over all the considered strengths and iterations, as well as their combinations. The boxplots reveal the strength of BNNs against both black-box attacks (GenAttack and LocalSearch), when compared to other variants. Different compression techniques produce different distributions for the PGD attack (marked in Fig. 4). CW proves to be the strongest adversarial attack scheme across all the compressed variants. 4.3
Class Activation Mapping on Attacked CNNs
We use class activation maps (CAM) [44] to determine the region of interest (RoI) for the prediction class using clean and attacked images. The output feature maps of the last convolutional layer and the weight tensor of the fullyconnected layer is considered as the input to the CAM. The CAM highlights regions of the image that influence the CNN’s prediction to a specific class. Similar to heat-maps, red regions indicate those with the highest contribution, while blue indicates the ones with the least. We applied CAM on various compressed variants of ResNet20 and ResNet56, trained on CIFAR-10, which are attacked by DeepFool (Table 3). As mentioned in Sect. 3, DeepFool attempts to find the adversarial perturbation which leads the CNN to the closest decision boundary. Once a perturbation is found, it is reinforced to push the prediction beyond that boundary. Through the CAM visualizations in this section, we attempt to capture this behaviour over the attack iterations.
Breaking Binary and Efficient Deep Neural Networks
161
Table 3. CAM for ResNet20/56 and its compressed variants performed on nonattacked and DeepFool attacked images on the automobile image from CIFAR-10 dataset. Distilled KD-KL
Ch./F.
Pruned Kernel
Weight
XNOR
Binary ABC(1×1) ABC(3×3)
ABC(5×5)
i=1
No AA
Vanilla
No AA i=1 i=5
ResNet56 - CIFAR10
i=5
ResNet20 - CIFAR10
Model Image→ I Adv
All the compression techniques produce no mis-classification in the automobile example using the unattacked raw image in Table 3. Three interpretations can be made from the heat maps. We support our interpretation with quantitative analysis by measuring the third quartile value of the heat map intensity across all the pixels. Observing the CAM output of ResNet56’s vanilla and channel-pruned variants for the unattacked input image, the RoI has large focused interest regions. For an intensity range of (0,255) blue→red, the third quartile value of the heat map intensity across all pixels is 184 and 162 for vanilla and channel-pruned respectively, indicating a large RoI. Second, the intensity of the interest regions decreases, after the attack is applied for one iteration. The third quartile value decreases (171, 152) indicating the lower interest regions. Third, after the attack is applied for five iterations, the focus on the attacked region (bonet) is reinforced to fool towards the nearest class (truck). The third quartile value further decreases (135, 121). Under DeepFool attacks, ResNet56 is more robust compared to ResNet20 which can be illustrated by the more distinct RoIs in the heat maps. The BNN variants have a small RoI compared to their vanilla model for unattacked images. The third quartile value for ResNet56XNOR is 98 indicating this aspect. As the inherent RoI for BNNs are small and concentrated, it could reduce the chances of finding and perturbing the smaller set of critical pixels by the attack model.
162
M.-R. Vemparala et al.
Table 4. Accuracy (Top1) [%] of CNNs after FGSM adversarial attacks for ImageNet.
FGSM
Nat.Acc = 2 = 4 = 8 = 16
ResNet50 [16]
75.43 % 22.18 16.24 12.08 7.46
ResNet18 [16]
69.00 % 12.82
8.16
5.19 2.95
ResNet18-Ch.Prune [19]
67.62 % 11.18
6.64
3.99 2.34
ResNet18-XNOR [36]
49.10 %
7.57
4.54
2.19 0.93
ResNet18-ABC(1 × 1) [27] 51.07 %
9.11
4.65
2.30 1.13
ResNet18-ABC(3 × 3) [27] 59.83 % 11.33
5.73
2.65 1.43
Table 5. Accuracy [%] of CNNs after PGD adversarial attacks for ImageNet. PGD
ResNet50 [16] (75.43 %)
0.1 25.77 16.07 0.5 3.35 0.94
i=2 i=3 i=4 i=5 9.83 0.43
5.91 0.27
ResNet18 [16] (69.00 %)
0.1 17.86 10.32 0.5 1.33 0.17
5.58 0.04
3.11 0.01
ResNet18-Ch.Prune [19] (67.62 %)
0.1 17.02 10.23 0.5 1.40 0.27
5.92 0.06
3.50 0.02
ResNet18-XNOR [36] (49.10 %)
0.1 13.16 11.46 10.06 0.5 5.67 3.07 1.57
8.84 0.78
ResNet18-ABC(1 × 1) [27] 0.1 18.35 16.22 14.20 12.37 0.5 7.60 3.64 1.75 0.82 (51.91) ResNet18-ABC(3 × 3) [27] (59.83)
4.4
0.1 23.90 20.81 17.80 15.07 0.5 8.31 3.70 1.59 0.66
Robustness Evaluation on ImageNet Dataset
For the robustness evaluation on the ImageNet dataset [37], we use pre-trained ResNet50 and ResNet18 models, and compressed variants of ResNet18. We observe a higher attack search time for ImageNet compared to the CIFAR-10 dataset due to the larger image sizes and model complexity. Therefore, we limit our analysis to two white-box attacks (FGSM and PGD), and one black-box attack (GenAttack). We consider compressed variants such as Ch-Prune, XNOR, ABC(1 × 1) and ABC(3 × 3) specified in Table 4- 6. Fast Gradient Sign Method: In Table 4, we report the natural accuracy and attacked accuracy for different strengths ( = {2, 4, 8, 16}). ResNet50 achieves the highest natural accuracy and attacked accuracy for different strengths compared to other models. Among the compressed variants the channel pruned and ABC(3x3) models portray slightly higher robustness at different strengths.
Breaking Binary and Efficient Deep Neural Networks
163
Projected Gradient Decent: In Table 5, we report the attacked accuracy for two strengths ( = 0.1, = 0.5). The attacked accuracy decreases for all the models as we increase the number of iterations i. We observe 9.16% higher attacked accuracy for binarized ResNet18 using ABC(3 × 3) compared to the ResNet50 model at i = 5 and = 0.1. Robustness at higher attack strength = 0.5 degrades the prediction accuracy for all the compressed variants. Table 6. Accuracy (Top1) [%] of CNNs after GenAttack adversarial attacks for ImageNet. Pop Size = 6. GenAttack
i = 200
i = 400
i = 600
i = 800
i = 1000
OA/TA
OA/TA
OA/TA
OA/TA
OA/TA
ResNet50 [16]
8.0 21.29/12.80 11.64/34.46 6.87/51.94 4.67/64.08 3.06/72.82
(75.43 %)
12.0 13.16/17.45 5.67/41.19 3.55/56.65 2.40/67.29 1.60/74.58
ResNet18 [16]
8.0 16.41/14.52 8.11/41.83 4.35/62.58 2.36/75.62 1.34/83.29
(69.00 %)
12.0 10.24/22.44 5.13/50.74 2.70/68.85 1.58/80.21 1.04/86.62
ResNet18-Ch.Prune [19]
8.0 12.34/12.82 6.05/39.02 3.17/60.46 2.00/74.46 1.22/82.79
(67.62 %)
12.0 7.33/20.25 3.29/49.44 1.84/68.97 1.08/80.11 0.88/86.80
ResNet18-XNOR [36]
8.0 13.06/0.64 12.86/0.72 12.64/0.84 12.68/0.86 12.68/0.94
(49.10 %)
12.0 11.56/0.78 11.14/0.92 11.14/1.04 11.04/1.16 10.82/1.22
ResNet18-ABC(1 × 1) [27] 8.0 17.59/1.48 17.67/1.62 17.37/1.76 17.23/1.88 16.89/1.98 (51.07 %)
12.0 15.83/1.90 15.40/2.08 15.20/2.26 15.02/2.34 14.86/2.52
ResNet18-ABC(3 × 3) [27] 8.0 26.00/0.68 25.02/0.82 25.26/0.92 25.46/0.98 25.58/0.96 12.0 22.50/0.74 22.04/0.94 22.36/1.02 21.75/1.08 21.90/1.14 (59.83 %) OA/TA = Accuracy to original label/Accuracy to target label.
GenAttack: We set an adaptive mutation rate ρ and mutation range α for GenAttack based on the dataset configuration and set the population size to 6, as in [2]. In Table 6, we report overall attacked accuracy and accuracy w.r.t. the fooled target class at several iterations during the attack search (i = {200, 400, 600, 800, 1000}). We also analyze the robustness for two attack strengths ( = 8, 12). Similar to previous observations, ABC models portray higher robustness with respect to their unattacked accuracy, when compared to other compressed variants and the vanilla ResNet50 and ResNet18 models. 4.5
Discussion
The robustness of distilled models can be attributed to their soft label training, which can be more informative than sheer, hard labels. The student is ideally able to learn both the correct classification and the distribution of closeness among other classes. Furthermore, the student is distilled using a high temperature factor T , causing the magnitude of the predicted class to be T times more confident than when trained on hard labels [5]. Thus, white box attacks like FGSM, PGD and DeepFool would require strong adversarial perturbation for
164
M.-R. Vemparala et al.
fooling the final prediction to its nearest class. However, the C&W attack is able to fool the distilled model, even at higher temperatures as the attack is not focused on the cross-entropy loss directly. The training scheme for BNNs is not as simple as vanilla or pruned models. It requires a straight-through-estimator, making the white-box attacks challenging compared to other variants. Introducing multiple scaling factors in case of ABC-Net eases the approximation to its full-precision model. Thus, XNOR-Nets appear to be more resilient against white-box attacks (Fig. 3, Fig. 4). Moreover, the PGD loss levels in Fig. 2 demonstrate the robustness of XNOR-Net through lower loss convergence values and breaking speed. The discretization of weights and activations also makes BNNs stronger against black-box attacks. The CAM results support the robustness for BNNs as they inherently possess smaller and concentrated RoI, reducing the chances of finding and perturbing the critical set of pixels. The BNN robustness is also observed for the ImageNet dataset when attacked with PGD and GenAttack (Tab. 5, Table 6). Pruning is the process of eliminating unused and/or redundant parameters. Here, balancing the compression rate and the accuracy is a key factor. Due to the reduced learning ability, pruned models are not automatically more robust than their full-precision counterpart. This would call for an extra objective function for improving the robustness. Existing works have shown that it is possible to remove more model parameters when pruning is applied in an unstructured manner [15]. A similar behavior can be expected if the robustness is included in the pruning and fine-tuning process.
5
Conclusion
In this paper, we provided a comprehensive analysis on recent white-box and black-box adversarial attacks against state-of-the-art vanilla, distilled, pruned and binary neural networks. We demonstrated that the robustness of CNNs not only depends on the adversarial attack but also on the compression technique at hand. By varying the attacks’ hyper-parameters, strong, ductile and brittle CNNs were identified. Conclusions were made on robustness by analyzing PGD loss/accuracy levels, box-plots, stress-strain graphs and CNN heat maps with CAM. From the presented data, we show that knowledge about the expected adversarial attack or the used compression technique can help the designer or the attacker generate more robust applications or stronger attacks, respectively.
References 1. Akhtar, N., Mian, A.S.: Threat of adversarial attacks on deep learning in computer vision: a survey. IEEE Access 6, 14410–14430 (2018) 2. Alzantot, M., Sharma, Y., Chakraborty, S., Zhang, H., Hsieh, C.J., Srivastava, M.B.: GenAttack: practical black-box attacks with gradient-free optimization. In: ACM Genetic and Evolutionary Computation Conference (GECCO), pp. 1111– 1119. Association for Computing Machinery, New York (2019)
Breaking Binary and Efficient Deep Neural Networks
165
3. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., Frossard, P.: Universal adversarial perturbations. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 86–94, July 2017 4. Carlini, N., et al.: On evaluating adversarial robustness. CoRR, abs/1902.06705 (2019) 5. Carlini, N., Wagner, D.A.: Towards evaluating the robustness of neural networks. In: IEEE Symposium on Security and Privacy (SP), pp. 39–57, May 2017 6. Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.-J.: ZOO: zeroth order optimization based black-box attacks to deep neural networks without training substitute models, pp. 15–26, November 2017 7. Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 3123–3131. Curran Associates Inc. (2015) 8. Fasfous, N., Vemparala, M.R., Frickenstein, A., Stechele, W.: OrthrusPE: runtime reconfigurable processing elements for binary neural networks. In: 2020 Design, Automation Test in Europe Conference Exhibition (DATE), pp. 1662–1667 (2020) 9. Frickenstein, A., Rohit Vemparala, M., Unger, C., Ayar, F., Stechele, W.: DSC: Dense-sparse convolution for vectorized inference of convolutional neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), June 2019 10. Galloway, A., Taylor, G.W., Moussa, M.: Attacking binarized neural networks. In: International Conference on Learning Representations (2018) 11. Goldblum, M., Fowl, L., Feizi, S., Goldstein, T.: Adversarially robust distillation. In: AAAI (2020) 12. Guo, M., Yang, Y., Xu, R., Liu, Z., Lin, D.: When NAS meets robustness: in search of robust architectures against adversarial attacks (2019) 13. Guo, Y., Yao, A., Chen, Y.: Dynamic network surgery for efficient DNNs. In: Advances in Neural Information Processing Systems (NeurIPS) (2016) 14. Han, B., et al.: Co-teaching: robust training of deep neural networks with extremely noisy labels. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., CesaBianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8527–8537. Curran Associates Inc. (2018) 15. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 1135–1143. Curran Associates Inc. (2015) 16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, June 2016 17. He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406, October 2017 18. He, Y., Liu, P., Wang, Z., et al.: Filter pruning via geometric median for deep convolutional neural networks acceleration. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019) 19. He, Y., Lin, J., Liu, Z., Wang, H., Li, L.-J., Han, S.: AMC: AutoML for model compression and acceleration on mobile devices. In: The European Conference on Computer Vision (ECCV) (2018) 20. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
166
M.-R. Vemparala et al.
21. Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., Bengio, Y.: Binarized neural networks. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 4107– 4115. Curran Associates Inc. (2016) 22. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (ICLR) (2015) 23. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. University of Toronto (2009) 24. Kurakin, A., Goodfellow, I.J., Bengio, S.: Adversarial Machine Learning at Scale. abs/1611.01236 (2016) 25. LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems (NeurIPS), pp. 598–605. Morgan-Kaufmann (1990) 26. Lin, J., Gan, C., Han, S.: Defensive quantization: when efficiency meets robustness. In: International Conference on Learning Representations (2019) 27. Lin, X., Zhao, C., Pan, W.: Towards accurate binary convolutional neural network. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 345–353. Curran Associates Inc. (2017) 28. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. ArXiv, abs/1706.06083 (2018) 29. Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I.J.: Adversarial autoencoders. In: International Conference on Learning Representations Workshop (ICLR-W) (2016) 30. Moosavi-Dezfooli, S.-M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582 (2015) 31. Narodytska, N., Kasiviswanathan, S.P.: Simple black-box adversarial attacks on deep neural networks. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR-W), pp. 1310–1318, July 2017 32. NVIDIA. NVIDIA Turing GPU architecture (2017). https://www.nvidia. com/content/dam/en-zz/Solutions/design-visualization/technologies/turingarchitecture/NVIDIA-Turing-Architecture-Whitepaper.pdf. Accessed 28 Feb 2020 33. Papernot, N., McDaniel, P.: Extending defensive distillation. ArXiv, abs/1705.05264 (2017) 34. Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z.B., Swami, A.: Practical black-box attacks against machine learning. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pp. 506–519. Association for Computing Machinery, New York (2017) 35. Papernot, N., McDaniel, P.D., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: IEEE Symposium on Security and Privacy (SP), pp. 582–597, May 2016 36. Rastegari M., Ordonez V., Redmon J., Farhadi A.: XNOR-Net: ImageNet classification using binary convolutional neural networks. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) The European Conference on Computer Vision (ECCV), pp. 525–542. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0 32 37. Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. (IJCV) 115(3), 211–252 (2015)
Breaking Binary and Efficient Deep Neural Networks
167
38. Shafahi, A., et al.: Adversarial training for free! In: Wallach, H., Larochelle, H., Beygelzimer, A., dAlch´e-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems (NeurIPS), pp. 3358–3369. Curran Associates Inc. (2019) 39. Szegedy, C., et al.: Intriguing properties of neural networks. Presented at the (2014) 40. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. Presented at the (2020) 41. Wiyatno, R.R., Xu, A., Dia, O., de Berker, A.: Adversarial examples in modern machine learning: a review, November 2019 42. Yang, T., Chen, Y., Sze, V.: Designing energy-efficient convolutional neural networks using energy-aware pruning. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6071–6079, July 2017 43. Zhang, T., et al.: StructADMM: a systematic, high-efficiency framework of structured weight pruning for DNNs (2018) 44. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929, June 2016
Parallel Dilated CNN for Detecting and Classifying Defects in Surface Steel Strips in Real-Time Khaled R. Ahmed(B) School of Computing, Southern Illinois University Carbondale, Illinois, USA [email protected]
Abstract. To improve the quality of steel industry, automatic defects inspection and classification is of great importance. This paper proposed and developed DSTEELNet convolution neural network (CNN) architecture to improve detection accuracy and the required time to detect defects in surface steel strips. DSTEELNet includes three parallel stacks of convolution blocks. Each convolution block used dilated convolution that expands the receptive fields and increase the feature resolutions. The experimental results indicate significant improvements in accuracy and illustrate that the DSTEELNet achieves 97% mAP to detect defects in surface steel strips on NEU dataset and able to detect defect in single image in 22 ms. Keywords: Computer vision · Defect detection · Defect classification · Parallel processing · Convolution Neural Network
1 Introduction Quality is an important competitive factor to the steel industry success [1–3]. Surface defect detection is an important part of steel production and has significant impact upon the quality of products. Manual defect detection methods are time-consuming and subject to human made errors and hazards. To overcome the shortcomings of manual inspection, traditional automatic surface defect detection methods were proposed. These include eddy current testing, infrared detection, magnetic flux leakage detection, and laser detection. These methods are not able to detect all the faults, especially the tiny ones [4]. This motivates many researchers [5–8] to develop computer vision systems capable to classify and detect defects in ceramic tiles [5], textile fabrics [9] and steel industries [7– 10]. Structure-based methods extract image structure features such as texture, skeleton and edge. While other methods extract statistical features such as mean, difference and variance from the defect surface and then apply machine learning algorithms to learn these features to recognize defected surfaces [11, 13]. The combination of statistical features and machine learning achieve higher accuracy and robustness than structurebased methods [44]. Using machine learning such as Support Vector Machine (SVM) classifier to classify different types of surface defects may take about 0.239 s to extract features from a single defect image during testing [12]. Therefore, it fails to meet the realtime surface defect detection requirements. However, convolutional networks (CNN) © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 168–183, 2022. https://doi.org/10.1007/978-3-030-82193-7_11
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
169
provide automated feature extraction techniques that take raw defect images and predict surface defects in short time and lessen the requirements to manually extract suitable features. The main objective of this research is to enhance steel strips defects detection accuracy and produce a significant generalization. To this end, this paper proposes a modular deep CNN-based architecture, DSTEELNet for detecting and classifying defects in surface steel strips using traditional convolution and dilated convolution. The dilated convolution able to capture more distinctive features by shifting the receptive field [36]. The main contributions of this paper are as follows: • We designed and developed a novel framework called DSTEELNet that detects and classifies defects in surface steel strips. To enhance the detection accuracy significantly, the proposed CNN includes three parallel stacks of convolution blocks as shown in Fig. 5. They are able to capture and propagate important features in parallel. Each convolution block uses different dilated rates. • Evaluate the proposed architecture with the traditional CNN architectures to highlight the effectiveness of DSTEELNet in detecting and classifying defects in surface steel strips. The generated trained model improves the product quality of steel strips since it accurately able to detect and classify defected regions. • We developed deep convolution generative adversarial network DCGAN to extend the size of the NEU dataset. The rest of this paper is organized as follows. Section 2 reviews the related works. The dataset, the traditional and neural augmentation techniques are described in Sect. 3. Section 4 illustrates the details of the proposed DSTEELNet architecture. Section 5 discuss the experiment setup and results. Section 6 concludes this paper and provides the future research direction.
2 Related Work There are many studies have investigated the machine vision techniques in surface defect detection. They are mainly divided into two categories, namely: the traditional image processing method, and the machine learning methods. The traditional image processing methods detect and segment defects by using the primitive attributes reflected by local anomalies. They detect various defects by features extraction techniques that are categorized into four different approaches [15]: structural method [16, 17], threshold method [18–20], spectral method [21–23], and model-based method [24, 25]. In traditional image processing methods, multiple thresholds to detect various defects are needed and are very sensitive to background colors and lighting conditions. These thresholds need to be adjusted to handle different defects. The traditional algorithms require to extract handcrafted features manually that require plenty of manpower [14]. Machine learning-based methods typically include two stages of feature extraction and pattern classification. The first stage analyzes the characteristics of the input image and produces the feature vector describing the defect information. These futures include grayscale statistical features [28], local binary patterns (LBP) feature [3, 26], histogram of oriented gradient (HOG)
170
K. R. Ahmed
features [27], and gray level co-occurrence matrix (GLCM) [28]. Some research efforts have been developed to speed up the features extraction process in parallel using GPU as our previous research work in [13]. The second stage feeds the feature vector into a classifier model that is trained in advance to detect whether the input image has a defect or not [43]. In a complex condition, handcrafted features or shallow learning techniques are not sufficiently discriminative. Therefore, these machine learning-based methods are typically dedicating for a specific scenario, lacking adaptability, and robustness. Recently, neural network methods have achieved excellent results in many computer vision applications. Convolutional neural networks (CNN) have been used to develop several defect detection methods. Some of the CNN research efforts have been developed to classify the defects in steel images as in [10], authors demonstrate that using a sequential CNN to extract features able to improve classification accuracy on defect inspection. The authors in [29] developed a. multi-scale pyramidal pooling network for the classification of steel defects. The authors in [30] developed a flexible multi-layered deep feature extraction framework. Both research works succeed to classify defects, however they failed to localize the location of the defects. Therefore, researchers convert the surface defect detection task into an object detection problem in computer vision to localize defects as in [42]. A simple and direct method is used by first locating defect and then classifying it. The authors in [42] developed a defect detection network (DDN) that integrates the ResNet [44] and Region proposal network (RPN) for precise defect detection and localization. In addition, they proposed the multilevel-feature fusion network that combined lower and high-level features. In other words, the inspection task classifies on regions of defects instead of a whole defect image. The research work in [31] employed traditional CNN with a sliding window to localize the defect. In [32] authors developed a structural defect detection method based on Faster R-CNN [33] that is succeeded to detect five types of surface defects: concrete, cracks, steel corrosion, steel delamination, and bolt corrosion. In [34] authors developed a cascaded autoencoder (CASAE). In first stage, they localize and extract the features of the defect from the input image. In second stage, they used compact CNN to accurately classify defects. Deep learning techniques facilitate quality assurance in manufacturing while, they require large datasets to avoid overfitting. Annotation of the data collected from the manufacturing lines may is a time-consuming task. To address this issue, there has been recent interest in the research community to mitigate it. The next section illustrates the using of data augmentation technique based on traditional techniques and neural networks to enlarge the NEU dataset.
3 Dataset and Augmentation This section introduces the dataset and the expansion techniques in detail to facilitate the training of the proposed model. In our experiment, NEU dataset [3] is used. Originally, the NEU dataset has 1,800 grayscale steel images and includes six types of defect as shown in Fig. 1. The defect types are crazing, inclusion, patches, pitted surface, and scratches, and rolled-in scale, 300 samples for each type. To annotate the dataset, each defect appears in the defected images is marked by a bounding red box (groundtruth box) as shown in Fig. 1. About 5000 groundtruth boxes have been created. The bounding box
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
171
Fig. 1. Six types of surface steel strips defect
is used to localize defects and does not represent a defect’s borders and cannot describe its shape. To expand the dataset with new samples, a naive solution to oversampling with data augmentation would be a simple random oversampling with small geometric transformations such as 8° rotation, shifting image horizontally or vertically, etc. There are other simple image manipulations such as mixing images, color augmentations, kernel filters, and random erasing can also be extended to oversample data in the same manner as geometric augmentations. This can be useful for ease of implementation and quick experimentation with different class ratios. In this paper, data augmentation is used to manually increase the size of the dataset by artificially creating different versions of the images from the original training dataset. Table 1 shows the images augmentation setting parameters used in the training process such as flip mode, zoom range, width shift, etc. For example, width shift is used to shift the pixels horizontally either to the left or to the right randomly and generate transformed images. Table 1. Image augmentation setting parameters Parameters
Value
Height Shift
0.08
Width Shift
0.08
Rotation Range 8 Fill mode
Nearest
Zoom Range
0.08
Shear Range
0.3
172
K. R. Ahmed
Fig. 2. Generator adversarial and discriminator loss during training
However, oversampling with basic image transformations may cause overfitting on the minority class which is being oversampled. The biases present in the minority class are more prevalent post-sampling with these techniques. Therefore, this paper also used neural augmentation networks such as Generative Adversarial Network (GAN) [36, 37]. The GAN able to generate synthetic defect images that are near identical to their groundtruth original ones. We have developed a deep convolution GAN named DCGAN that includes two CNNs [38]: generator G (reversed CNN) and discriminator D. Generator G takes random input and generates an image as output from up-sampling the input with transposed convolutions. However, D takes the generated images and original images and tries to predict whether a given image is generated (fake) or original (real). The GAN network performs min–max two players game with value function V(D, G) [36]: minG maxD V (D, G),
(1)
V (D, G) = Eω∼Sdata (ω) [loG D(ω)] + Eτ ∼Sτ (ωτ ) [loG(1 − D(G(τ )))]
(2)
where D(ω) is the probability of ω is a real image, S data is the distribution of the original data, τ is random noise used by the generator G to generate image G(τ ) and Sτ is the distribution of the noise. During training, the aim of the discriminator D is to maximize the probability D(ω) assigned to fake and real images. Since it’s a binary classification problem, this model is fit seeking to minimize the average binary cross entropy. Minimax Gan loss is defined to minimax simultaneous optimization of the disseminator and generator models as shown in Eq. 1. The discriminator pursues to maximize the average of the log probability for real images and the LoG of the inverted probabilities of fake images. In other word, it maximizes the LoG D(ω) + LoG(1−D(G(τ ))). The generator pursues to minimize the LoG of the inverse probability predicted by the discriminator for fake images. In other word, it minimizes the LoG(1−D(G(τ ))). The training results are shown in Fig. 2. It shows the discriminator loss and adversarial loss during training till 600 iterations. It shows that D loss is converging, and the G adversarial loss is converging too. The mean of discriminator loss and adversarial loss are 0.031 and 1.617 respectively.
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
173
Fig. 3. Synthetic images by DCGAN
The training was proceeded in six steps. In step 1, we randomly generate a noise vector using Gaussian distribution and pass it to the generator to generate an actual image in step 2. We mix the authentic images form the training dataset and the generated synthetic images in step 3. In step 4, we train the discriminator using the mixed dataset with aiming to correctly label each image as fake or real. Again, we generate random noise and label each noise vector as real image in step 5. Finally, in step 6 we train the GAN using this noise vectors and real imaged labels even they are not actual real images. In summary, at each iteration of the GAN algorithm firstly, it generates random images and then trains the discriminator to distinguish fake and real images, secondly it tries to fool the discriminator by generating more synthetic images, finally it updates the weights of the generator based of the received feedback from the discriminator which enable us to generate more authentic images. We developed the generator architecture as follows. First, it includes a dense layer with a ReLU activation function followed by batch normalization to stabilize GAN as in [36]. We feed a random vector noise generated by Gaussian distribution into this layer. To prepare the number of nodes to be reshaped into 3D volume, we added another dense layer with the ReLU activation function followed by batch normalization. Then Reshape layer is added to generate 3D volume from the input shape. To increase the spatial resolution during training we add a transposed convolution (Conv2DTranspose) with stride 2, 32 filters, each of which is 5 × 5, ReLU activation function and followed by batch normalization and dropout of size 0.3 to avoid overfitting. Finally, we added five up-sample and transposed convolutions (Conv2DTranspose), each of which uses stride 2 and tanh activation function. They increased the spatial dimension resolution from 14 × 14 to 224 × 224, which is the exact of the input images. Afterward, we developed the discriminator generator as follows. It includes two convolution layers (Conv2D) with stride 2, 32 filters, each of which is 5 × 5 and Leaky ReLU activation function to stabilize training. We added flatten and dense layers with sigmod activation function to capture the probability of whether the image is synthetic or real. Figure 3 shows examples of the results of generated images from the NEU dataset. This paper feeds about 1800 images of the NEU dataset to the DCGAN framework that generates 540 synthetic images added to the original NEU dataset and create new dataset called GNEU. We divide GNEU dataset into training, validation and testing sets.
174
K. R. Ahmed
The training set includes 1260 real and synthetic images, the validation set includes 540 real and synthetic images. The test set includes 540 real images.
4 Proposed DSTEELNet Architecture This section describes the proposed DSTEELNet CNN framework to detect and classify defects in surface steel strips. The DSTEELNet includes parallel stack of convolution, activation and Max-Pooling layers as shown in Fig. 5. At the feature level, we added parallel layers and then performed convolution with activation on the resulting feature maps. We added flatten layer to unstack all the tensor values into a 1-D tensor. The flattened features are used as inputs to two dense layers (Multi-layer perception). To reduce/avoid overfitting, we applied dropout. For classification task, we added dense layer with softmax activation function. Finally, the architecture generates a class activation map. The receptive field RF is the portion of the image where the filter extracts features and defined by the filter size of the layer in the CNN [39]. To generate high quality training results and achieve fine details of input 2D image, it is required to increase feature resolution by expanding the receptive field RF . Therefore, this paper used dilated convolution [35] with dilation rate larger than 1 to decrease computational costs by adding dilation rate to the conv2D kernel. The dilation rate is the spacing between each pixel in the convolution filter. Equation 3 shows the form to calculate the receptive field RF where k is the size of the kernel and d is the dilated rate. RF = d (k − 1) + 1
(3)
For example, if dilation rate of 2 is used then each input skips a pixel. Figure 4-c. shows 3 × 3 kernel with dilation rate of 2 has the same field of view as 5 × 5 kernel. As a result, the receptive field RF increased and enabled the filter to capture more contextual information. However, using dilation rate of 1 and 3 × 3 kernel generates receptive field with size 3 × 3 which is equivalent to the the standard convolution as shown in Fig. 4-b. From Eq. 3 the size of the output can be calculated as follows: g + 2p − RF +1 (4) σ = s Where g × g input with a dilation factor, padding and stride of d, p and s respectively. After using a number of receptive fields having different sizes, we can capture important features in the scene area at different scales. Figure 5 shows the proposed DSTEELNet architecture. It includes five dilated convolution blocks in three parallel stacks. Assume each stack includes m convolution blocks CB(i) where i ∈ {1, 2, . . . m} and the corresponding output of each CB(i) is denoted by βi . The input features and output features are denoted as f in and f out respectively and f out can obtained as follows: m βi (5) fout = fin + i=1
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
175
Fig. 4. Dilated convolution in DSTEELNet
βi =
(i) in ) CB (f i=1 i−1 (i) CB fin + k=1 βi 1 < i ≤ m
(6)
Each convolution block CBt=j = conv(n = F) followed by Max-pooled block to reduce the feature size and the computational complexity for the next layer. For efficient pooling, we used pool_size = (2,2) and strides = (2,2) [41]. Each convolution block CBt=j = conv(n = F) includes two Conv2D layers followed with ReLU activation function where F is total number of filters and j is the dilation rate. We have used 3 × 3 filters in all convolution blocks. The total number of filters in first convolution block is 64, and the rest are 128, 256, 512 in order. The three parallel stacks (branches) are similar except they have different dilation rates j = 1,2 and 3 respectively as shown in Fig. 5. Standard convolution is equivalent to a convolution with dilation rate equals 1. Each parallel branch generates features from images at different CNN layers and then produces different proprieties. Therefore, we concatenated the generated features from these parallel branches and handed the resulted features to the next convolution layer to produce the final low-level features. This convolution layer has 512 filters with a filter size 3 × 3, dilation rate 1, stride of 1 and followed by ReLU activation function. To convert the square feature map into one dimensional feature vector, flatten layer has been added. Two perception (fully connected) layers with size 1024 were used to feed the results of the flatten layer through dense layer that will perform classification. The last dense layer uses SoftMax activation function to determine class scores. To avoid/reduce overfitting during training. dropout layer has been added to discard some weight produced from two fully connected layers. In this paper, we used dropout of size 0.3.
176
K. R. Ahmed
5 Experiments The performance of the DSTEELNet is evaluated on the generated dataset (GNEU). We demonstrate that DSTEELNet achieves a reasonable design and significant results. Therefore, we compare the proposed DSTEELNet with VGG16, VGG19, ResnNt50, and MobilNet.
Fig. 5. DSTEELNet architecture
5.1 Experiment Metrics For the performance evaluation we use the following performance metrics: TP (TP + FN ) TP Precision = (TP + FP ) Recall + Precision AP = 2 2TP F1 = (2TP + FN + FP ) Recall =
(7) (8) (9) (10)
where, TP is the number of true Positives, FN is the number of false Negative, and FP is the number of false Positive. True positive is referred to defective steel image identified as defective. False positive is referred to defect-free steel image identified as defective. False negative is referred to defective steel image identifies as defect-free. The F1 score is measured to seek a balance between Recall and Precision. In addition, the mean average precision (mAP) is calculated to evaluate the overall performance that is mean value of AP of all classes.
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
177
5.2 Setup The experiment platform in this work is Intel(R) Core™ i7-9700L with a clock rate of 3.6 GHz, working with 16 GB DDR4 RAM and a graphics card that is NVIDIA GeForce RTX 2080 SUPER. All experiments in this project were conducted in Microsoft Windows 10 Enterprise 64-bit operating system, using Keras 2.2.4 with TensorFlow 1.14.0 backend. We train the STEELNet and VGG16, VGG19, ResNet50 and MobileNet for about 150 epochs on the GNEU training and validation datasets with batch size of 32 and image input size 224 × 224. We applied the Adam optimizer [40] with learning rate 1e-4. In addition, we applied the categorical cross entropy loss function to the training. The loss is measured between the probability of the class predicted from softmax activation function and the true probability of the category. All the trained models did not use any pretrained weights such ImageNet because ImageNet has no steel surface images. 5.3 Results This section illustrates gradually the results of the proposed CNN architecture to detect defects in surface steel strips. Table 3 shows the class-wise classification performance metrics listed in Eqs. 7–10. It illustrates the comparison between DSTEELNet and the state-of-the-art CNN architectures. It shows that almost models tend to enhance the classification of most categories (such as crazing, patches, rolled-in_scale and scratches). The state-of-the-arts models show poor performance to detect defects such as inclusion and pitted_surface due to some similarities in their defect structures. However, the DSTEELNet succeed to detect all the class categories with high accuracy. Table 3 shows that DSTEELNet produces 97.2% mAP which outperforms the other models, e.g. VGG16 (91.2%, 6% higher mAP), VGG19 (90.0%, 7.2% higher mAP), ResNet50 (93%, 4.2% higher mAP) and MobileNet (94%, 3.2% higher mAP). Table 2 demonstrates the weighted average results. It illustrates that for steel surface defect detection DSTEELNet performs the highest precision, recall and F1 scores as shown in bold values in Table 2. Table 2. Weighted average results Model
Precision Recall F1-score
DSTEELNet 0.97
0.97
0.97
Vgg16
0.89
0.89
0.92
Vgg19
0.92
0.90
0.90
ResNet50
0.95
0.93
0.93
MobileNet
0.94
0.93
0.93
0.97
1.00
1.00
1.00
0.87
1.00
Pitted_surface
Rolled-in_Scale 0.99
0.972
Patches
Scratches
mAP
1.00
0.86
0.97
Inclusion
0.912
1.00 1.00
0.99 0.96
0.92 0.66
1.00 0.89
0.91 1.00
0.87
1.00
0.97
1.00
0.51
1.00
0.90
0.93 1.00
0.98 0.94
0.79 0.67
0.94 1.00
0.68 0.94
0.99
1.00
0.89
0.98
0.54
1.00
0.93
0.99 1.00
0.97 0.96
0.76 0.74
0.99 1.00
0.69 1.00
0.98
1.00
0.98
0.99
0.66
1.00
Precision
MobileNet
0.94
0.99 0.96
0.98 0.98
0.84 0.73
0.99 1.00
0.79 1.00
0.99 0.98
Precision Recall F1
Resnet50 0.97 0.99
Precision Recall F1
VGG19 1.00 0.95
Precision Recall F1
1.00 1.00
Recall F1
1.00
1.00
Precision
Crazing
VGG16
DSTEELNet
Table 3. Detection results on GNEU dataset
1.00
0.90
0.98
0.99
0.82
1.00
0.98
0.94
0.84
0.99
0.82
0.99
Recall F1
178 K. R. Ahmed
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
179
In addition, Table 3 shows that DSTEELNet delivers consistent results for the precision, recall and F1 for crazing, patches, pitted_surface, rolled-in_scale and scratches defects. The DSTEELNet succeeds to detect inclusion defect with highest F1 score (0.91) followed by MobileNet (0.82), ResNet50 (0.79), VGG19 (0.69) and VGG16 (0.68) respectively in order. Similarly, the DSTEENet succeeds to detect pitted_surface defect with highest F1 score (0.92) followed by MobileNet (0.84), ResNet50 (0.84), VGG16 (0.79) and VGG19 (0.76) respectively in order. The examples of DSTEELNet detection results are shown in Fig. 6. It shows that DSTEELNet succeeds to detect defects with significance confidence scores. Figure 7 shows the confusion matrices for DSTEELNet and other evaluated models where the test dataset includes 90 images of each surface defect class. As shown in Fig. 7-a DSTEELNet detects all of the steel surface defects perfectly excepts the inclusion defects. It detects about 13 images out of 90 with inclusion defects as pitted_surface. Furthermore, as shown in Fig. 7 (b-d) MobileNet, ResNet50, and VGG19 are able to detect 24, 31 and 40 images out of 90 with inclusion defect as pitted_surface respectively. In other words, DSTEELNet fails to detect 2.9% of defects in 540 images however, ResNet50, MobileNet, VGG19, and VGG16 fail to detect defects in 6.6%,7.4%, 10% and 11% of 540 images, respectively.
Fig. 6. Examples of detection results using DSTEENet on GNEU dataset, green box indicating defect location with confidence score
Figure 8 shows the training and validation accuracy for DSTEELNet. It shows that both training and validation accuracy started to improve from epoch 25 and then converged to the highest accuracy values. 5.4 Computational Time Table 4 shows the average inference time to detect defects in single image by the proposed technique DSTEELNet, and other deep learning and traditional techniques. It reveals that the traditional methods generally are not able to meet the requirements in real-time. In addition, Table 4 shows that the proposed DSTEELNet is the fastest one to detect defects and can meet the real-time requirements. DSTEELNet speeds the defect detection time of
180
K. R. Ahmed
Fig. 7. Confusion matrices for DSTEELNet, MobileNet, ResNet50 and VGG19 on test dataset
Fig. 8. Training and validation accuracy of DSTEELNet
the traditional techniques about 20 times and outperforms the deep learning techniques. The accuracy of the MobileNet and Resnet50 are higher than VGG16 and VGG19 but they take longer time to detect defects. In summary, the DSTEELNet achieves the highest accuracy and shortest detection time due to the reduction of its computation complexity. It also outperforms the recent technique called end-to-end defect detection (EDDN) [45] that added to Vgg16 extra architectures including multi-scale feature maps and predictors for detection. The authors reported that EDDN achieved 0.724 mAP and able to detect defects in single image in 27 ms. The DSTEELNet outperforms EDDN and able to detect defect in single image in 22 ms with 0.972 mAP.
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
181
Table 4. Comparison of computational time for traditional and deep learning techniques Traditional techniques
Deep learning techniques
HOGSVM
LBP-SVM GLCM-SVM Vgg16
Vgg19 Resnet50 MobileNet DSTEELNet
443.53 ms
382.35 ms 454.57 ms
29 ms
28 ms
32 ms
34 ms
22 ms
6 Conclusion The major aim of this paper is to design and develop a CNN architecture that is suitable for surface steel strips defect detection task. DSTEELNet that can form sparse receptive fields is proposed to generate more robust and discriminative features for defect detection. The experiment results show that the proposed DSTEELNet can achieve 97% mAP and outperform the state-of-the-art CNN architectures such as VGG16, VGG19, Resent50 and MobilNet with 6%, 7.2%, 4.2% and 3.2% higher mAP respectively. As a future research, we will explore methods to achieve more precise defect boundaries such as performing defect segmentation based on deep learning techniques.
References 1. Quality & Yield Optimization for Flat Steel Production (2017). www.isra-parsytec.com 2. Sadeghi, M., Soltani, H., Zamanifar, K.: Application of parallel algorithm in image processing of steel surfaces for defect detection. Fen Bilimleri Dergisi (CFD) 36, 4 (2015) 3. Song, K., Yan, Y.: A noise robust method based on completed local binary patterns for hot-rolled steel strip surface defects. Appl. Surf. Sci. 285, 858–864 (2013) 4. Tian, S., Xu, K.: An algorithm for surface defect identification of steel plates based on genetic algorithm and extreme learning machine. Metals 7(8), 311 (2017) 5. Ragab, K., Alsharay, N.: An efficient defect classification algorithm for ceramic tiles. In: 2017 IEEE 13th International Symposium on Autonomous Decentralized System (ISADS), pp. 255–261 (2017) 6. Ragab, K.: Fast and parallel summed area table for fabric defect detection. Int. J. Pattern Recogn. Artif. Intell. 30(09), 1660004 (2016) 7. Neogi, N., Mohanta, D.K., Pranab, K.: Review of vision-based steel surface inspection systems. EURASIP J. Image Video Process. 1(2014), 50 (2014) 8. Jia, H., et al.: An intelligent real-time vision system for surface defect detection. In: Proceedings of the 17th International Conference on Pattern Recognition. ICPR 2004, vol. 3. IEEE (2004) 9. Sager, K.H., George, L.E.: Defect detection in fabric images using fractal dimension approach. In: International Workshop on Advanced Image Technology, vol. 2011 (2011) 10. Zhou, S., et al.: Classification of surface defects on steel sheet using convolutional neural networks. Materiali Tehnologije 51(1), 123–131 (2017) 11. Ghorai, S., Mukherjee, A., Gangadaran, M., Dutta, P.K.: Automatic defect detection on hotrolled flat steel products. IEEE Trans. Instrum. Meas. 62, 612–621 (2012) 12. Ke, X.U., Lei, W., Wang, J.: Surface defect recognition of hot-rolled steel plates based on tetrolet transform. J. Mech. Eng. 52, 13 (2016)
182
K. R. Ahmed
13. Ahmed, K.R., AlSaeed, M., AlJumah, M.: Parallel Algorithms to detect and classify defects in Surface Steel Strips. In: The World Congress in Computer Science, Computer Engineering, and Applied Computing (CSCE 2020). Transactions on Computational Science & Computational Intelligence. Springer, New York (2020) 14. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015) 15. Ren, R., Hung, T., Tan, K.C.: A generic deep-learning-based approach for automated surface inspection. IEEE Trans. Cybern. 48, 929–940 (2018) 16. Tastimur, C., Yetis, H., Karaköse, M., Akin, E.: Rail defect detection and classification with real time image processing technique. Int. J. Comput. Sci. Softw. Eng. 5, 283 (2016) 17. Jian, C., Gao, J., Ao, Y.: Automatic surface defect detection for mobile phone screen glass based on machine vision. Appl. Soft Comput. 52, 348–358 (2017) 18. Win, M., Bushroa, A.R., Hassan, M.A., Hilman, N.M., Ide-Ektessabi, A.: A contrast adjustment thresholding method for surface defect detection based on mesoscopy. IEEE Trans. Ind. Inform. 11, 642–649 (2015) 19. Kalaiselvi, T., Nagaraja, P.: A rapid automatic brain tumor detection method for MRI images using modified minimum error thresholding technique. Int. J. Imag. Syst. Technol. 1, 77–85 (2015) 20. Wang, L., Zhao, Y., Zhou, Y., Hao, J.: Calculation of flexible printed circuit boards (FPC) global and local defect detection based on computer vision. Circ. World 42, 49–54 (2016) 21. Bai, X., Fang, Y., Lin, W., Wang, L., Ju, B.F.: Saliency-based defect detection in industrial images by using phase spectrum. IEEE Trans. Ind. Inform. 10, 2135–2145 (2014) 22. Borwankar, R., Ludwig, R.: An optical surface inspection and automatic classification technique using the rotated wavelet transform. IEEE Trans. Instrum. Meas. 67, 690–697 (2018) 23. Hu, G.H.: Automated defect detection in textured surfaces using optimal elliptical Gabor filters. Optik 126, 1331–1340 (2015) 24. Susan, S., Sharma, M.: Automatic texture defect detection using Gaussian mixture entropy modeling. Neurocomputing 239, 232–237 (2017) 25. Cen, Y.G., Zhao, R.Z., Cen, L.H., Cui, L.H., Miao, Z.J., Wei, Z.: Defect inspection for TFTLCD images based on the low-rank matrix reconstruction. Neurocomputing 149, 1206–1215 (2015) 26. Gibert, X., Patel, V.M., Chellappa, R.: Deep multitask learning for railway track inspection. IEEE Trans. Intell. Transp. Syst. 18, 153–164 (2017) 27. Shumin, D., Zhoufeng, L., Chunlei, L.: Adaboost learning for fabric defect detection based on hog and SVM. In Proceedings of the International Conference on Multimedia Technology, Hangzhou, China, 26–28 July 2011 28. Chondronasios, A., Popov, I., Jordanov, I.: Feature selection for surface defect classification of extruded aluminum profiles. Int. J. Adv. Manuf. Technol. 83, 33–41 (2016) 29. Masci, J., Meier, U., Fricout, G., Schmidhuber, J.: Multi-scale pyramidal pooling network for generic steel defect classification. In: Proceedings of the Int. Joint Conf. on Neural Networks, Dallas, TX, USA, 4–9 August 2013 30. Natarajan, V., Hung, T.Y., Vaikundam, S., Chia, L.T.: Convolutional networks for voting-based anomaly classification in metal surface inspection. In: Proceedings of the IEEE International Conference on Industrial Technology, Toronto, ON, Canada, 22–25 March 2017 31. Wang, T., Chen, Y., Qiao, M., Snoussi, H.: A fast and robust convolutional neural networkbased defect detection model in product quality control. Int. J. Adv. Manuf. Technol. 94, 3465–3471 (2018) 32. Cha, Y.J., et al.: Autonomous structural visual inspection using region—Based deep learning for detecting multiple damage types. Comput. Aided Civ. Infrastruct. Eng. 33, 731–747 (2018) 33. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, NIPS 2015 Proceedings (2015)
Parallel Dilated CNN for Detecting and Classifying Defects in Real-Time
183
34. Tao, X., Zhang, D., Ma, W., Liu, X., Xu, D.: Automatic metallic surface defect detection and recognition with convolutional neural networks. Appl. Sci. 8, 1575 (2018) 35. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. In: International Conference on Learning Representations (ICLR) (2016) 36. Xu, H., Warde-Farley, D., Ozair, S., Courville A., Yoshua, K.: Generative Adversarial Networks. arXiv:1406.2661 (2014) 37. Goodfellow, I., Pouget-Abadie, J. Mirza, M.: Genserative Adversarial Networks. arXiv:140 6.266 (2014) 38. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 (2016) 39. Luo, W., et al.: Understanding the effective receptive field in deep convolutional neural networks. arXiv preprint arXiv:1701.04128 (2017) 40. Kingma, D. P., Ba, J.L.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations, pp. 1–13 (2015) 41. Scherer, D., Müller, A., Behnke, S.: Evaluation of pooling operations in convolutional architectures for object recognition. In: Diamantaras, K., Duch, W., Iliadis, L.S. (eds.) ICANN 2010. LNCS, vol. 6354, pp. 92–101. Springer, Heidelberg (2010). https://doi.org/10.1007/ 978-3-642-15825-4_10 42. He, Y., Song, K., Meng, Q., Yan, Y.: An end-to-end steel surface defect detection approach via fusing multiple hierarchical features. IEEE Trans. Instrum. Meas. 69(4), 1493–1504 (2020). https://doi.org/10.1109/TIM.2019.2915404 43. Mang Xiao, M., Jiamh, G., Li, L.X., Li, Y.: An evoslutionary classifier for steel surface defects with small sample set. EURASIP J. Image Video Process. 48, 236 (2017). https://doi.org/10. 1186/s13640-017-0197-y 44. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90 45. Lv, X., Duan, F., Jiang, J.-J., Fu, X., Gan, L.: Deep metallic surface defect detection: the new benchmark and detection network. Sensors 20, 1562 (2020). https://doi.org/10.3390/s20 061562
Selective Information Control and Network Compression in Multi-layered Neural Networks Ryotaro Kamimura1,2(B) 1
Kumamoto Drone Technology and Development Foundation, Techno Research Park, Techno Lab 203 1155-12 Tabaru Shimomashiki-Gun, Kumamoto 861-2202, Japan 2 IT Education Center, Tokai University, 4-1-1 Kitakaname, Hiratsuka, Kanagawa 259-1292, Japan
Abstract. The present paper aims to propose a new type of information-theoretic method called “selective information control” to produce a variety of internal representations from among which we can choose appropriate ones for interpretation. The new method aims to improve our network compression method to produce more interpretable representations by changing the selective information. The selective information proposed here represents to what extent a components network responds selectively to inputs. When the component responds more selectively to the inputs, the selective information in the component becomes higher. We applied the method to a simplified bank marketing data set. By gradual increasing or decreasing selective information, we could produce connection weights to improve generalization as well as weights close to the correlation coefficients of the original data set. The better interpretation could be obtained by the gradual selective information decrease. This means that better interpretation can be obtained by increasing the selective information in the lower hidden layers, and then, the information should be filtered out in the higher hidden layers. Keywords: Selective information Interpretation · Generalization
1
· Network compression ·
Introduction
Since the beginning of the connectionism approach to the exploration of cognitive functions [42–44], there have been many attempts to interpret internal representations generated by neural networks and to relate them to the actual cognitive processes. In addition to the clarification of cognitive functions, it has been well recognized that the interpretation of inference mechanisms of neural networks can contribute to the trustworthiness of methods as well as the improvement of general performance [24]. Recently, the trustworthiness of employed models have been one of the serious problems in machine learning, because the introduction of complicated machine c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 184–204, 2022. https://doi.org/10.1007/978-3-030-82193-7_12
Selective Information
185
learning techniques has been endangering our daily life. In particular, the neural network, dealt with in this paper, has been considered one of the most typical black-box models among many machine learning methods [7,9,19,38,41,47]. Thus, neural networks can be used to improve generalization performance for many actual problems, but if we cannot explain the reasons why such improved performance can be obtained, the final results can be accepted with difficulty. Thus, a number of interpretation methods have been developed to respond to the trustworthy and safety problems of machine learning [8,11,17,51]. In addition, the black-boxed property can be a serious problem in improving the general performance itself of neural networks. Though neural networks have progressed rapidly in improved prediction performance, they cannot necessarily considerably outperform their counterparts, in particular, human beings. To improve the more general performance of neural networks, we need to deepen our knowledge on how neural networks respond to inputs and produce the corresponding outputs. In particular and among all, there are a number of serious problems to be solved for neural networks for them to be applicable to actual problems, for example, adversarial attacks [16,33] and catastrophic forgetting [12,20,25]. Those problems cannot be solved when the inner inference mechanism of neural networks cannot be well understood. Thus, in addition to the safety and trustworthiness of neural networks, to improve the general performance itself, we need to interpret exactly the inner mechanism, hidden in complicated neural network configurations. One of the major problems with those interpretation methods is that they have focused on the local interpretation where much effort has been on how a neural network responds to a given input pattern. In particular, in the field of convolutional neural networks (CNN), this tendency toward the local interpretation has been apparent. The CNN, dealing with image data sets, has penetrated rapidly into many application fields, and the necessity for interpretation has been urgent. However, the network architecture for CNN has been more and more complicated, with many specialized layers, such as convolutional layers, which has prevented us from interpreting their inner inference mechanism. Thus, in spite of an urgent need for the inner inference mechanism to be known and the proposal of many different types of interpretation methods, the main focus has been restricted to the local and individual interpretation. This means that, due to easy and intuitive understanding of image data sets, the interpretation has been replaced by instance-based visualization methods such as activation maximization, selectivity detection, local perturbations, Grad-CAM and layer-wise relevance propagation [3,5,14,15,34,37,45,46]. As mentioned, these types of visual interpretation methods cannot necessarily consider the inner inference mechanism of neural networks as was done in the beginning of the connectionism approach [42–44]. We should clarify the inner inference mechanism, hidden behind complicated input patterns as well as interwoven components of neural networks, to uncover the main and fundamental learning processes of neural networks. To extract this core inner inference mechanism, we have introduced a method of network compression [22,23]. The method aims to compress complicated and multi-layered neural networks into the simplest ones without hidden layers, whose
186
R. Kamimura
interpretation is much easier than that of multi-layered neural networks. Recently, model compression has received due attention to reduce the computational burden of complicated and multi-layered neural networks [2,4,10,13,18,21,32,36]. However, these types of conventional model compression have been developed to improve generalization performance. More concretely, complicated neural networks have been replaced by much simpler ones whose generalization performance is approximately equivalent to that of complicated ones. Actually, these compression methods have been performed by black-boxing all components in original complicated neural networks. Thus, even if we can interpret the inner mechanism of simplified networks, the interpretation does not necessarily represent that of original and complex neural networks. The interpretation of smaller models and original larger models are completely different from each other. Thus, it is impossible to apply the compression to the interpretation problem. Contrary to those conventional model compression methods, our network compression method tries to compress original neural networks to keep information contained in connection weights of original and complex neural networks as much as possible by considering all possible routes from inputs to the corresponding outputs. This compression method has been successfully applied to several data sets, [22,23], producing very simple networks with better interpretation performance. However, the simple compression of original complicated neural networks cannot necessarily produce internal representations for easy interpretation. This is due to the fact that, in compressing networks, connection weights and neurons may be complicatedly interwoven. Thus, we need to develop a method to produce networks with more interpretable representations for the results by compression to be applied to the interpretation problem. In this context, we try to introduce here a new type of information-theoretic method to control information content in neural networks, expecting that the appropriate information control can lead us to find simpler and compressed networks for easy interpretation. As has been known, one of the methods to control final internal representations is that of controlling information content to be stored in neural networks. We should note that in the information theory the efficient information transmission is the most important thing to be considered, but in neural networks, the importance should be put not on the transmission but on the storage of information inside neural networks on input patterns and outer conditions. Since the pioneering works of Linsker, many different types of informationtheoretic methods have been proposed, from the maximum to the minimum information preservation principles [6,27–31,39,40,48–50]. This maximum and minimum information can be differentiated, depending on our focus on generalization or interpretation. Thus, one of easiest ways is to borrow those conventional information measures even for the problem of interpretation. However, two major problems in borrowing those measures can be pointed out, namely, computational complexity and interpretability. This means that information-theoretic measures such as mutual information cannot necessarily be applied to controlling the production of internal representations for interpretation, due to the computational complexity and abstract property of information
Selective Information
187
measures. First, information-theoretic measures such as entropy and mutual information cannot be easily implemented, and even if successfully implemented, much computational complexity has made it hard to apply them to actual data sets. Second, even if the implementation and computation is possible, the interpretation of information content tends to be not so easy due to the abstract property of information measures. Then, even if information can be extracted from neural networks, it may be impossible to extract concrete meanings with respect to the inner mechanism of neural networks. For solving those problems, the present paper proposes a new measure of information, called “selective information.” In this new information measure, we suppose that information content in neural networks should be represented in terms of selectivity control of components such as neurons and connection weights in neural networks. When a component can very selectively respond to a specific input, the component contains much information on the specific input. In our actual situation, we suppose that this selectivity can be represented in connection weights. When a neuron is firmly connected with a specific neuron under a given input pattern, this connection weight between the two neurons should have much information on the input pattern. If it is possible to describe this selectivity, we can have a new definition of information, which can be concretely described in terms of connection weights. Thus, we propose here a new information measure of selective information and try to use it for changing the information content to obtain multiple internal representations, from among which we can choose the appropriate one suited for interpretation. The present paper is organized as follows. In Sect. 2, we first present how to compress multi-layered neural networks to obtain the simplest ones without hidden layers. Then, we introduce the selective information described in terms of the number of connection weights between layers for simple interpretation. In Sect. 3, we apply the method to the simplified version of the well-known bank marketing data set in the machine learning data base. In the experiments, we try to show that we can increase or decrease the selective information, producing different compressed weights. The compressed weights by the gradual information decrease tend to produce compressed weights close to the correlation coefficients of the original data set. Thus, the compressed weights are easily interpretable. On the other hand, the compressed weights by the gradual information increase can produce connection weights close to the regression coefficients of the logistic regression analysis, producing better generalization performance. The results show that the selective information control can be used to produce networks with much more interpretable weights, explaining why such a simple interpretation is possible.
2 2.1
Theory and Computational Methods Network Compression
One of the main techniques for interpretation is to simplify network configurations as much as possible. We have proposed a new type of simplification in terms of network compression [23]. In this method, we try to compress all layers step
188
R. Kamimura (3)
(2)
(4)
(5)
(1) (6)
(a) Original network
(b) 1st compression (3)
(4)
(5)
(4)
(1)
(5)
(d) 3rd compression
(5)
(1)
(1) (6)
(6)
(6)
(c) 2nd compression (e) Final compression (1) (6)
(f) Final compressed network
Fig. 1. Network architecture with six layers, including four hidden layers (a) and compressed network (b).
by step by considering all routes from inputs to outputs to obtain the simplest network without hidden layers. Let us show an example of network compression, and for simplicity’s sake, we suppose seven layers, including the input and output layer, in Fig. 1. We compress this seven-layered neural network into a two-layered one without hidden layers step by step, considering all routes from inputs to outputs. Let us take connection weights from the second layer represented by (2) to the third layer (3). As shown (1,2) in Fig. 1(b), weights from the first layer to the second layer wij and from the (2,3)
second layer to the third layer wij (1,3)
wik
=
, are combined into n2 j=1
(1,2)
wij
(2,3)
wjk
(1)
where (1,3) represents a route from the first to the third layer. Suppose that (1,3) wik denotes a compressed weight from the first layer to the third layer. Then, this compressed weight is again compressed with a connection weight from the third to the fifth layer in Fig. 1(c). (1,4)
wil
=
n3 k=1
(1,3)
wik
(3,4)
wkl
(2)
In the same way, we can have the compressed weight from the first layer (1,5) to the fifth layer wim . By combining this compressed weight with connection weights to the output layer in Fig. 1(e), we have
Selective Information (1,6)
win
=
n5 m=1
(1,5)
(5,6) wim wmn
189
(3)
Following these steps, we can compress any neural network, though the steps are limited to fully connected ones. 2.2
Controlling Selective Information
As mentioned in the introduction section, the network compression does not necessarily produce interpretable networks, due to the existence of complicated and interwoven connection weights, neurons, and layers. Thus, we must transform the original neural networks so that they are easily compressible and interpretable. One of the ways to control network configurations is to control information content contained in neural networks. As mentioned above, we do not use the conventional information measures such as mutual information, because we have had difficulty in interpreting those conventional information measures when applied to the interpretation problem. For easy interpretation, we need to use a measure to be interpreted more concretely in terms of components such as neurons and connection weights in neural networks. For this purpose, we introduce selective information. The reason we adapt this concept of selective information is that the information content not transmitted but stored in neural networks can be translated into a concept where information content means how neural networks selectively respond to specific input patterns. When neural networks respond more selectively to input patterns, they have naturally more information content on the input patterns. Now, let us begin with the definition of selectivity, and for simplicity’s sake, we compute the selectivity between the second and third layer (2,3). For the first approximation, we suppose that the importance of weights can be obtained by their absolute values (2,3) (2,3) (4) ujk =| wjk | Then, we normalize this importance by its maximum value, which can be computed by (2,3) ujk (2,3) (5) gjk = (2,3) maxj k uj k We call this importance “relative importance,” which can be used to increase or decrease the selectivity, described below. By using this relative importance, selective information can be computed by G(2,3) = n2 n3 −
n2 n3 j=1 k=1
(2,3)
gjk
(6)
This selective information is maximized when only one connection weight has a certain value, while all the others become zero. This case shows that all the information from the precedent layer is contained in one connection weight
190
R. Kamimura
with the highest selectivity. The highest selective information is represented by n2 n3 − 1. On the other hand, this selective information is minimized when all connection weights become equal and the minimum value is zero. In an extreme case where all connection weights are zero, the selective information becomes zero by definition. Because no connection weights exist, naturally information content stored in connection weights should be zero. This information measure is closely related to the entropy measure of information theory [1], but it has a more concrete meaning. When the selective information increases gradually, the number of connection weights also decreases. Thus, this measure of information is directly related to the regularization methods such as weight decay, which have played very important roles in improving the general performance of neural networks [26]. However, this selective information aims to focus on the condensation of information on important features of input patterns. We try to increase or decrease this selective information. To control the selective information, we must control the normalized importance or relative importance. For this, we introduce an inverse case of the original relative importance (2,3)
g¯jk
(2,3)
= 1 − gjk
(7)
This means that, when the importance increases, this inverse one decreases. This importance can be used to decrease the strength of weights with larger importance, and selective information can be reduced. Then, by combining those two types of importance, we have a unified importance (2,3)
hjk
β (2,3) (2,3) = αgjk + α ¯ g¯jk
(8)
where the parameter α is used to control the magnitude of importance (¯ α = 1 − α), ranging between zero and one; and the other parameter β should be larger than zero, and it is used to control the stability of learning processes. When the parameter α is one, this unified form is equivalent to the initial relative importance. On the other hand, when the parameter α becomes zero, the unified form is equivalent to the inverse form. Thus, by changing the parameter α from one to zero, we can easily change the importance in more detail. The next step is to include this unified importance equation in the learning process. Though it might be better to introduce it in the actual learning processes, this direct inclusion causes much contradiction between error minimization and selective information control. This is because error minimization between outputs and the corresponding targets is contradictory to the selective information control. 2.3
Selective Information-Driven Learning
We introduce here a new type of learning method in which the selective information guides learning processes, and this method is called “selective informationdriven learning” to stress the importance of selective information. The learning steps have some sub-steps called “assimilation steps.” In the beginning of a
Selective Information
191
learning step, the unified importance is applied to the connection weights, and in the following assimilation sub-steps, this importance is assimilated. Because the strength of importance can be weakened in the process of sub-steps, in the next learning step, the importance is again applied, followed by the subsequent assimilation sub-steps, and so on. The actual weights for the tth learning step can be computed by (2,3)
(2,3)
(2,3)
wjk (t) = hjk (t − 1) wjk (t − 1)
(9)
When the parameter α is larger, only one weight tends to be stronger, meaning that the corresponding selective information becomes larger. On the contrary, when the parameter α is smaller, stronger weights are pushed toward smaller ones, and the selective information becomes smaller. Compared with the abstract information measures of information theory, the selective information is directly connected with the actual meaning in terms of the number of connection weights. When the selective information increases, the number of strong connection weights becomes smaller, meaning that information content is stored in a smaller number of weights. On the other hand, when the selective information decreases, the number of strong weights becomes larger. Contrary to the information content of information theory, which represents the possible information to be transmitted, the information content is information stored in connection weights. In this sense of information in terms of information storage, the selective information represents how certain a connection weight is connected with the corresponding layers. Let us see some examples by controlling the selective information in Fig. 2. Note that the selective information is actually controlled in this paper only for hidden layers. This is because it is easier to control the selective information for hidden layers than the corresponding input and output layer, which are constrained to receive input and output information. Figure 2(a) shows a situation where the selective information increases from the second layer to the fifth layer. Actually, the parameter α increases from 0.1 to 0.9. We here use the parameter value less than one here, because we try to make all connection weights smaller, preventing the explosion of weights by this selective information assimilation. On the other hand, when the strength of connection weights becomes smaller, connection weights are pushed toward smaller values. When the parameter α is larger, the relative importance becomes effective, in which larger connection weights remain the same because of the large importance. Gradually, several weights with large strength remain the same, while all the other weak connection weights become weaker and weaker. Actually, as shown in Fig. 2(a), the number of connection weights between layers becomes smaller when the layers become higher. In terms of information, information content in the connection weights remaining in the higher layer becomes larger. We should note again that the information content is not to be transmitted but to be stored, and the information content to be stored is the inverse to the information to be transmitted. On the contrary, Fig. 2(c) shows a case where the selective information decreases, meaning that the number of strong connection weights becomes larger from the
192
R. Kamimura (2)
(3)
(4)
(3)
(2)
(5)
(4)
(5)
(2)
(1)
(1)
(b) No information control
(1)
(5)
(6)
(c) Information decrease (1)
(1) (6)
(6)
(d) Compressed weights
(4)
(6)
(6)
(a) Information increase
(3)
(1)
(e) Compressed weights
(6)
(f) Compressed weights
Fig. 2. Three types of network configuration: by gradual information increase (a), no information control (b), and gradual information decrease (c).
second to the fifth layer. In this case, the parameter α is gradually decreased from 0.9 to 0.1, and the selective information becomes smaller, and finally in the last and final hidden layer, all connection weights tend to have the same strength or importance. Because all connection weights have the same importance, we should say that information content to be stored should be small, and information is distributed into many connection weights. In the following section on the experimental results, we will show how information, generalization, and compressed weights can be changed by controlling the selective information.
3 3.1
Results and Discussion Experimental Outline
We applied the method to the well-known machine learning data set of direct marketing campaigns [35]. We tried to predict whether a client would subscribe to a term deposit. Figure 3 shows a network architecture for the bank marketing data set. The number of inputs was reduced to only six, which were chosen to have some correlations with the targets in the original data set for easy interpretation. The number of hidden layers increased as much as possible, and the actual number of hidden layers was set to 25, because it was impossible to increase the number of layers beyond this point with reasonably good generalization performance. The number of input patterns was set to 24,415. The parameter α for the unified function increased from 0.1 to 0.9 by the gradual information increase method. The learning parameter β was forced to be very small and actually set to 0.05, and this value was used to stabilize the learning processes by slowing the learning rate as much as possible. On the other hand, for the gradual information decrease method, the parameter was decreased from 0.9 to 0.1, where
Selective Information
193
Fig. 3. A network architecture with 25 hidden layers (27 layers with the input and output layers) and six input units (a) and the corresponding simplified network with compressed weights (b) for the bank marketing data set.
the selective information was forced to decrease when the hidden layer number increased from one to 25. For the experiment, we used the neural network package of scikit-learn, except for the number of epochs (steps) and tangent-hyperbolic activation function. Naturally, we added our selective information control inside the package. The number of learning steps was set to 150, in which ten sub-steps (epochs) were used to assimilate the initial importance of connection weights. We compared the present methods with the conventional methods, which were also set to the same setting except for the selective information control. 3.2
Selective Information Control
First, we increased the selective information gradually by increasing the parameter α from 0.1 to 0.9. Figure 4 shows the selective information from the initial hidden layer to the last (25th) hidden layer when the number of learning steps increased from one (top left) to 150 (bottom right). The selective information was plotted every three steps, from one to 150. As can be seen in the figure,
194
R. Kamimura
Fig. 4. Selective information by the gradual information increase method from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).
the selective information remained small even if the layer number increased in the first several steps. When the learning steps increased further, the selective information tended to increase gradually from the first to the last hidden layer. However, one of the interesting things to see is that, when the learning steps increased, the lower hidden layers tended to have some higher selective information. This may be because the initial several hidden layers tended to be influenced by the input layer in which no selective information control was implemented in this experiment. The results show that the gradual information increase by increasing the parameter α from 0.1 to 0.9 had a natural effect of increasing the selective information content. Figure 5 shows the weights from the first hidden layer (top left) to the 25th hidden layer (bottom right). As can be seen in the figure, many connection weights were strong in the initial several hidden layers. Then, the number of strong weights became smaller and smaller. In the end, in the last hidden layer, only one weight had the stronger value, while all the others had much smaller values. The results show that the number of stronger connection weights became smaller when the selective information increased, as can be expected. On the other hand, we employed the gradual selective information decrease method by decreasing the parameter α from 0.9 to 0.1. Figure 6 shows the selective information for the hidden layers No.1 to No.25 and when the number of learning steps increased from one (top left) to 150 (bottom right), where
Selective Information
195
Fig. 5. Weights for all hidden layers by the gradual selective information increase method for the bank data set. Weights were arranged from top left (weights from the first hidden layer) to bottom right (weights to the last hidden layer.).
Fig. 6. Selective information by the gradual information decrease method from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).
the selective information was plotted every three learning steps. The selective information was small for all hidden layers in the first place (top left). Then, gradually, the selective information for the initial several hidden layers became larger. Then, the selective information gradually decreased when the hidden layer number increased, though in the last several hidden layers, the strength
196
R. Kamimura
Fig. 7. Weights for all hidden layers by the selective information decrease method for the bank data set. Weights were arranged from top left (weights from the first hidden layer) to bottom right (weights to the final, 25th layer).
Fig. 8. Selective information without information control from the first hidden layer to the last hidden layer when the learning steps increased from one (top left) to 150 (bottom right).
of connection weights became slightly larger. The results show that the gradual information decrease method by decreasing the parameter α was effective in actually decreasing the selective information. Figure 7 shows the connection weights for 25 hidden layers from the first (top left) to 25th hidden layer (bottom right). As can be seen in the figure, the number
Selective Information
197
Fig. 9. Weights for all hidden layers without information control for the bank data set. Weights were arranged from top left (the first step) to bottom right (150th step).
of strong weights was smaller in the first several hidden layers. Then, when the hidden layer number increased further, the number of strong weights became larger and larger. This means that the number of strong weights became larger when the selective information was forced to decrease by the gradual selective information decrease method. Figure 8 shows the selective information for networks without selective information control. Even if the hidden layer number increased from one to 25, the selective information decreased very slightly, but actually little change could be seen in the selective information. These results show that the selective information control was effective in controlling the selective information content. Then, Fig. 9 shows connection weights from the first hidden layer (top left) and to the 25th hidden layer (bottom right); the strength of all hidden weights were almost random, and no regularity could be seen. This shows that connection weights could not be explicitly arranged without the selective information control. 3.3
Generalization Performance
Then, we tried to compare the generalization performance in terms of accuracy, precision, recall, and F-score for the method with and without selective information control. Figure 10 shows generalization performance in terms of accuracy, precision (top left), recall, and F-score (bottom right) by the gradual selective information increase method. The accuracy increased and became close to 0.8, which was the best performance of all three methods. The precision increased up to the level of 0.7, but the recall value became close to 0.5. Thus, the information increase method with the highest accuracy score tried to increase the precision, while the recall remained small. Figure 11 shows the results by the gradual information decrease method. As can be seen in the figure (top left), the accuracy was lower than that by the information increase method in Fig. 10(a), and it could not increase beyond 0.7.
198
R. Kamimura
Fig. 10. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) by selective information increase for the bank data set.
Fig. 11. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) by selective information decrease for the bank data set.
Selective Information
199
Fig. 12. Generalization errors in terms of accuracy (top left), precision, recall, and F-score (bottom right) without selective information for the bank data set.
Compared with the results by the information increase method in Fig. 10, the precision was lower, but the recall was higher and larger than 0.5. Finally, the F-score was slightly lower than that by the gradual information increase method. These results show that the precision by the information increase method was higher than that of the information decrease method. On the contrary, the recall was the inverse, meaning that the information decrease method produced higher recall values. From these results, we can see that, by changing the information content in hidden layers, different generalization performances could be obtained, meaning that different internal representations were obtained. Finally, Fig. 12 shows the results by the method without information control. All measures were close to those by the gradual information increase or decrease method. However, one of the main differences is that the method without information control produced less stable learning processes in which suddenly lower values were seen when the number of learning step increased beyond about 100 steps. This means that the conventional method without information control may be instable when the number of hidden layers increases considerably. Thus, these results imply that the information control methods have effects to stabilize the learning processes when the number of hidden layer increases considerably. 3.4
Interpreting Compressed Weights
Finally, we tried to interpret compressed weights by the present methods and to compare them with the regression coefficient of the logistic regression analysis.
200
R. Kamimura
Figure 13(a) shows correlation coefficients between inputs and targets of the original data set. As can be seen in the figure, input No.2 (the duration of last contact) had the largest strength and importance. This means that, to subscribe to the term deposit, we need to make sustained contact with customers, which is intuitively reasonably valid. Figure 13(b) shows compressed weights by the gradual selective information increase. Among strong correlation coefficients in Fig. 13, only input No.2 had the largest value, followed by input No.3 (campaign) with moderately strong importance, while all the other inputs had almost no importance. As explained above, the prediction performance in terms of accuracy showed the best performance by the gradual information increase method; those inputs with moderately strong correlations in the original correlation coefficients in Fig. 13(a) were of no use in improving the prediction performance. Note that the correlation between the original correlation in Fig. 13(a) and those weights was 0.92. For the gradual information decrease method in Fig. 13(c), in addition to the strongest input No.2, inputs No.4 (housing) and No.6 (sending documents) had moderate importance, and the correlation coefficient between those compressed weights and the original correlation coefficient became the largest value of 0.97. This means that, though the information decrease method showed lower generalization performance than the information increase method, the gradual information decrease method produced compressed weights quite similar to the correlation coefficients in Fig. 13(a). Thus, the gradual information decrease method, though the prediction performance became lower, produced weights whose interpretation was easier due to the similarity to the original correlation coefficients. Figure 13(d) shows compressed weights by the method without information control. As can be seen in the figure, inputs No.2, No.4, and No.6 had larger importance. These three inputs had also some importance by the gradual information decrease in Fig. 13(c) and the original correlation coefficients in Fig. 13(a). However, the correlation between the original correlation and those compressed weights was 0.7, the lowest score. Finally, Fig. 13(e) shows the regression coefficients by the logistic regression analysis. As can be seen in the figure, the coefficients were similar to compressed weights by the gradual information increase method in Fig. 13(b), and the correlation between those weights and the original correlations was 0.92, slightly better than the 0.91 by the gradual information increase method. These results show that, though the logistic regression analysis could be expected to extract linear correlations between inputs and targets, the linear correlations quite close to the original correlation coefficients were obtained by the gradual information decrease method. To extract the real linear relations between inputs and outputs, we need to use multi-layered neural networks with information control, more exactly, gradual information decrease. Though with some speculation from these results, multi-layered neural networks have the property of losing information content naturally when we see neural networks from the point of view of the information channel in which information should decrease [1]. However, if we can appropriately control the information content
Selective Information
(a) Correlation coefficients
(c) Information decrease
201
(b) Information increase
(d) Without informaton control
(e) logistic regression
Fig. 13. The original correlation coefficients between inputs and targets (a), information increase (b), information decrease (c), no information (d), and regression coefficients by the logistic regression analysis for the bank data set.
in multi-layered neural networks, complicated components such as connection weights may be disentangled as much as possible to produce the very simple and linear relations between inputs and outputs.
4
Conclusion
The present paper aimed to propose a new type of information-theoretic method for interpretation. We suppose that information content on input patterns can be represented in terms of selectivity of components in neural networks. When a neuron responds to a specific input very selectively, the neuron should have some information content on the input. Thus, contrary to the abstract and incomprehensible property of conventional information-theoretic measures such as mutual information, the present measure of selective information can be interpreted very
202
R. Kamimura
concretely. Then, we proposed a method to control flexibly selective information content to obtain different types of internal representations. We applied the method to the bank marketing data set, examining how the gradual selective information increase or decrease could produce different internal representations. The results showed that the selective information control could produce different types of connection weights. The gradual information increase produced networks with better generalization, while the gradual information decrease was related to the production of simple relations between inputs and outputs. Thus, for simple and easy interpretation, we should adopt the gradual information decease, and we should obtain much information in the hidden layers close to the input layer. We focused here on two type of information control, namely, gradual information increase and decrease. However, we should examine further to what extent the information change in hidden layers affects the interpretability more exactly. Though some problems should be solved for actual applications, the results in this paper can certainly contribute to understanding the inner inference mechanism of neural networks.
References 1. Abramson, N.: Information theory and coding (1963) 2. Adriana, R., Nicolas, B., Ebrahimi, K.S., Antoine, C., Carlo, G., Yoshua, B.: FitNets: hints for thin deep nets. In: Proceedings of ICLR (2015) 3. Arbabzadah, F., Montavon, G., M¨ uller, K.R., Samek, W.: Identifying individual facial expressions by deconstructing a neural network. In: German Conference on Pattern Recognition, pp. 344–354. Springer, Cham (2016). https://doi.org/10. 1007/978-3-319-45886-1 28 4. Ba, J., Caruana, R.: Do deep nets really need to be deep? In: Advances in Neural Information Processing Systems, pp. 2654–2662 (2014) 5. Bach, S., Binder, A., Montavon, G., Klauschen, F., M¨ uller, K.R., Samek, W.: On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PLoS ONE 10(7), e0130140 (2015) 6. Becker, S.: Mutual information maximization: models of cortical self-organization. Netw. Comput. Neural Syst. 7, 7–31 (1996) 7. Ben´ıtez, J.M., Castro, J.L., Requena, I.: Are artificial neural networks black boxes? IEEE Trans. Neural Networks 8(5), 1156–1164 (1997) 8. Bojarski, M., et al.: Explaining how a deep neural network trained with end-to-end learning steers a car. arXiv preprint arXiv:1704.07911 (2017) 9. Bologna, G.: Is it worth generating rules from neural network ensembles? J. Appl. Logic 2(3), 325–348 (2004) 10. Buciluˇ a, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 535–541. ACM (2006) 11. Caruana, R., Lou, Y., Gehrke, J., Koch, P., Sturm, M., Elhadad, N.: Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1721–1730. ACM (2015)
Selective Information
203
12. Chen, X., Wang, S., Fu, B., Long, M., Wang, J.: Catastrophic forgetting meets negative transfer: batch spectral shrinkage for safe transfer learning. In: Advances in Neural Information Processing Systems, pp. 1908–1918 (2019) 13. Cheng, Y., Wang, D., Zhou, P., Zhang, T.: A survey of model compression and acceleration for deep neural networks (2020) 14. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. University of Montreal 1341 (2009) 15. Fu, R., Hu, Q., Dong, X., Guo, Y., Gao, Y., Li, B.: Axiom-based GradCAM: towards accurate visualization and explanation of CNNs. arXiv preprint arXiv:2008.02312 (2020) 16. Goodfellow, I.J., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014) 17. Goodman, B., Flaxman, S.: European union regulations on algorithmic decisionmaking and a right to explanation. arXiv preprint arXiv:1606.08813 (2016) 18. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey (2020) 19. Hart, A., Wyatt, J.: Evaluating black-boxes as medical decision aids: issues arising from a study of neural networks. Med. Inform. 15(3), 229–236 (1990) 20. Hayes, T.L., Kafle, K., Shrestha, R., Acharya, M., Kanan, C.: Remind your neural network to prevent catastrophic forgetting. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) European Conference on Computer Vision, pp. 466–483. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58598-3 28 21. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 22. Kamimura, R.: Collective mutual information maximization to unify passive and positive approaches for improving interpretation and generalization. Neural Netw. 90, 56–71 (2017) 23. Kamimura, R.: Neural self-compressor: collective interpretation by compressing multi-layered neural networks into non-layered networks. Neurocomputing 323, 12–36 (2019) 24. Kindermans, P.J., et al.: The (un)reliability of saliency methods. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L., Muler, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 267–280. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 14 25. Kirkpatrick, J., et al.: Overcoming catastrophic forgetting in neural networks. Proc. Nat. Acad. Sci. 114(13), 3521–3526 (2017) 26. Kukaˇcka, J., Golkov, V., Cremers, D.: Regularization for deep learning: a taxonomy. arXiv preprint arXiv:1710.10686 (2017) 27. Leiva-Murillo, J.M., Art´es-Rodr´ıguez, A.: Maximization of mutual information for supervised linear feature extraction. IEEE Trans. Neural Networks 18(5), 1433– 1441 (2007) 28. Linsker, R.: Self-organization in a perceptual network. Computer 21(3), 105–117 (1988) 29. Linsker, R.: How to generate ordered maps by maximizing the mutual information between input and output signals. Neural Comput. 1(3), 402–411 (1989) 30. Linsker, R.: Local synaptic learning rules suffice to maximize mutual information in a linear network. Neural Comput. 4(5), 691–702 (1992) 31. Linsker, R.: Improved local learning rule for information maximization and related applications. Neural Netw. 18(3), 261–265 (2005) 32. Luo, P., Zhu, Z., Liu, Z., Wang, X., Tang, X.: Face model compression by distilling knowledge from neurons. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
204
R. Kamimura
33. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017) 34. Montavon, G., Samek, W., M¨ uller, K.R.: Methods for interpreting and understanding deep neural networks. Digital Signal Process. 73, 1–15 (2018) 35. Moro, S., Cortez, P., Rita, P.: A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst. 62, 22–31 (2014) 36. Neill, J.O.: An overview of neural network compression. arXiv preprint arXiv:2006.03669 (2020) 37. Nguyen, A., Yosinski, J., Clune, J.: Understanding neural networks via feature visualization: a survey. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L., Muller, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp. 55–76. Springer, Cham (2019). https://doi.org/10.1007/978-3-03028954-6 4 38. Olden, J.D., Jackson, D.A.: Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. Model. 154(1–2), 135–150 (2002) 39. Principe, J.C.: Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-15702 40. Principe, J.C., Xu, D., Fisher, J.: Information theoretic learning. In: Unsupervised Adaptive Filtering, vol. 1, pp. 265–319 (2000) 41. Qiu, F., Jensen, J.: Opening the black box of neural networks for remote sensing image classification. Int. J. Remote Sens. 25(9), 1749–1768 (2004) 42. Rumelhart, D.E., Hinton, G.E., Williams, R.: Learning internal representations by error propagation. In: Rumelhart, D.E., Hinton, G.E., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 318–362. MIT Press, Cambridge (1986) 43. Rumelhart, D.E., McClelland, J.L.: On learning the past tenses of English verbs. In: Rumelhart, D.E., Hinton, G.E., Williams, R.J. (eds.) Parallel Distributed Processing, vol. 2, pp. 216–271. MIT Press, Cambrige (1986) 44. Rumelhart, D.E., Zipser, D.: Feature discovery by competitive learning. In: Rumelhart, D.E., Hinton, G.E., et al. (eds.) Parallel Distributed Processing, vol. 1, pp. 151–193. MIT Press, Cambridge (1986) 45. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: GradCAM: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 618– 626 (2017) 46. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013) 47. Spining, M., Darsey, J., Sumpter, B., Nold, D.: Opening up the black box of artificial neural networks. J. Chem. Educ. 71(5), 406 (1994) 48. Torkkola, K.: Nonlinear feature transform using maximum mutual information. In: Proceedings of International Joint Conference on Neural Networks, pp. 2756–2761 (2001) 49. Torkkola, K.: Feature extraction by non-parametric mutual information maximization. J. Mach. Learn. Res. 3, 1415–1438 (2003) 50. Van Hulle, M.M.: The formation of topographic maps that maximize the average mutual information of the output responses to noiseless input signals. Neural Comput. 9(3), 595–606 (1997) 51. Varshney, K.R., Alemzadeh, H.: On the safety of machine learning: cyber-physical systems, decision sciences, and data products. Big Data 5(3), 246–255 (2017)
DAC–Deep Autoencoder-Based Clustering: A General Deep Learning Framework of Representation Learning Si Lu(B) and Ruisi Li Portland State University, Portland, USA [email protected]
Abstract. Clustering performs an essential role in many real world applications, such as market research, pattern recognition, data analysis, and image processing. However, due to the high dimensionality of the input feature values, the data being fed to clustering algorithms usually contains noise and thus could lead to in-accurate clustering results. While traditional dimension reduction and feature selection algorithms could be used to address this problem, the simple heuristic rules used in those algorithms are based on some particular assumptions. When those assumptions does not hold, these algorithms then might not work. In this paper, we propose DAC, Deep Autoencoder-based Clustering, a generalized data-driven framework to learn clustering representations using deep neuron networks. Experiment results show that our approach could effectively boost performance of the K-Means clustering algorithm on a variety types of datasets. Keywords: Clustering · K-Means · Representation learning neuron networks · Deep autoencoder
1
· Deep
Introduction
Clustering is the task of grouping samples such that the ones in the same group are more similar to each other than to the ones in other groups. Nowadays, clustering performs as a basic and essential pre-processing step of many real world applications. For example, it could be used to help with fake news identification [6], document analysis [16], marketing and sales, etc. Specifically, clustering algorithms can figure out useful information for the applications via grouping according to a variety of data similarity metrics and data grouping schemes. For example, similar patches could be used for image denoising [1–3] or depth enhancement [9], and clustering could be used to find good similar patches [8]. To let the samples be properly assigned to different groups (called clusters), meaningful feature values of the samples need to be obtained first. However, in real world applications, the data we get is often of high dimensions [5] and usually contains noise, making the clustering difficult to succeed. For example, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 205–216, 2022. https://doi.org/10.1007/978-3-030-82193-7_13
206
S. Lu and R. Li
in the MNIST dataset [7], each input hand-written digit image has 784 pixels. While we know some pixels (e.g. the ones at image corners) might not be as useful as others(e.g. the ones around image centers), it is difficult to manually distinguish them in clustering. Traditional dimensionality reduction algorithms, namely, Principle Component Analysis (PCA) [10], Linear Discriminant Analysis (LDA) [4], and Canonical Correlation Analysis (CCA) [13], could be used to reduce the number of features. In addition, feature selection algorithms can be used to select from the original feature values a set of useful and noiseless ones. These algorithms aim to extract the core information given the redundant and correlated input highdimension data features. However, these algorithms often fail mainly due to two reasons. Firstly, most of them require complex mathematical analysis, which is difficult and time consuming as well. Secondly, their is no single approach that could work for all types of datasets. Different datasets could have different dimensions, data sizes and even might be used in totally different applications. Some datasets are linear and some of them are non-linear. As a result, it is difficult to find a way to generally work on all types of datasets. Recently, due to the emerging of the powerful deep neuron networks, deep learning-based approaches have been introduced to learn better data representations and achieve appealing performance improvements for clustering algorithms. One simple approach is to learn representations using deep auto-encoders. Specifically, the original input high dimensional features are fed into a encoder that generates a low dimensional output. This output is further fed to a decoder that tries to recover the raw input data as much as possible. However, most of the existing approaches [11,15] are using images as input and thus using convolutional neuron networks in their work. In this paper, we propose Deep Clustering Autoencoder, a simple but more general framework for representation learning that takes feature vectors as input. Thus, our approach could be applied to more generalized datasets. In addition, according to the group labels, we propose a scheme to adaptively weight all input features. We combine this estimated weight with the loss function computation during training. Experiment results show that our approach could effectively improve the performance of K-Means clustering algorithm on different types of datasets, namely, MNIST, Fashion-MNIST [14], as well as Human Activities and Postural Transitions Data Set (HAPT) [12]. The rest of the paper is organized as follows: in Sect. 2 we describe the overview of our deep autoencoder-based clustering. We then describe the deep autoencoder for representation learning in more details in Sect. 3. We finally show experimental results in Sect. 4 and conclude in Sect. 5.
2
Overview of Deep Autoencoder-Based Clustering
Figure 1 shows an overview of our deep autoencoder-based clustering framework. There are two main steps: training and clustering testing. In the training step, a deep autoencoder with an encoder and a decoder is trained using the training set.
Deep Autoencoder-Based Clustering
207
Fig. 1. Overview of our deep autoencoder-based clustering on MNIST dataset. The autoencoder (consists of an encoder and a decoder) tries to encode and decode the input features such that the decoded output is as close to the input as possible. The input size is 28 × 28 = 784, the size of the learned low-dimension representation is 10. In the testing stage, the learned encoder output is then fed into the classic K-Means algorithm to do clustering.
Here a flattened input vector is fed into the multi-layer deep encoder which has a low dimensional learned representation. This learned representation is further fed into a decoder that tries to recover an output of the same size as the input. The training process of this autoencoder tries to reconstruct the input as much as possible. In the following clustering step, we apply the autoencoder to the testing set. The output of the encoder (learned representations) is then fed to a classic K-Means algorithm to do clustering. The learned low dimensional representation vector contains key information of the given input, and thus yield better clustering results.
3
Deep Autoencoder for Representation Learning
The architecture of our deep autoencoder for representation learning is shown in Fig. 1. As could be seen, the model is not as complex as some of the advanced neuron networks. The reason is that we do not want our model to over-fit in two-folds. First, we do not want our model to over-fit on the training dataset over the testing dataset. Second, we do not want our model to over-fit on the reconstruction problem it-self over the clustering problem. Thus, we select a model of reasonable median complexity. 3.1
Encoder
The encoder aims to encode or compress the input data into a smaller size representation, and at the same time preserve as much key information as possible.
208
S. Lu and R. Li
As shown in Fig. 1, the encoder consists of 8 layers, include the input layer and the learned representation output layer. Here the input layer is being normalized such that all its values is in the range of (0, 1). Specifically, from the beginning, each larger layer is fully connected to the next smaller layer followed by a couple of activation layers. There are mainly two types of activation layers, Relu and Tanh, as shown in Eq. 1 and 2. Adding the Relu layers could introduce nonlinearity to our model, making it more robust against non-linear input data. The Tanh layer, on the other hand, could transform the data into a normalized range of (−1, 1), to alleviate the gradient vanishing/exploding problem. Relu(x) = max(0, x)
(1)
ex + e−x 2 ex − e−x sinh(x) = 2 ex − e−x sinh(x) = x tanh(x) = cosh(x) e + e−x
(2)
cosh(x) =
3.2
Decoder
The decoder aims to decode or decompress the encoded output to reconstruct the original input data as much as possible. It contains nine layers, include the input layer, which is the output of the encoder, and the final output layer. Specifically, each smaller layer is fully connected to the next larger layer followed by a Tanh activation layer. In addition, the decoder has a Sigmoid activation layer (shown in Eq. 3) at the final stage to enforce the output values lie into the range of (0, 1). Sigmoid(x) = 3.3
1 1 + e−x
(3)
Objective Function
Clustering-Weighted MSE Loss. While the goal of the classic autoencoder is to reconstruct the original input as much as possible, it counts each input feature value equally. However, it is possible that each individual input feature contributes differently to the final clustering results. For example, in MNIST dataset, the pixels at the four corners of almost all images are of the same color black (with zero intensity input values), thus have no impact to the final clustering at all. On the other hand, some pixels around the center of the images are likely to perform more important roles. We thus propose a scheme to compute a clustering-weighted MSE loss to let the autoencoder focus more on the reconstruction of more important input feature values, as shown in Eq. 4. n wi (yi − yˆi )2 (4) Lcmse = i=1 n
Deep Autoencoder-Based Clustering
209
Fig. 2. A map of the clustering weight computed for MNIST dataset using 1000 samples from the training set. It could be seen that pixels at boundaries and corners are less important than the ones around image centers.
Here wi is the weight of each feature. It is computed using all ith feature values sampled from a subset of the training dataset with m samples. Denote all ith feature values as {xi k|k = 1, 2, .., m} and the corresponding ground truth group/cluster labels of the m samples {lk |k = 1, 2, .., m}. The corresponding feature weight will be large if both of the two following conditions are met. First, all sampled values in the same groups/clusters have small differences. Second, all sampled values in different groups/clusters have large differences. Thus, the weight is computed as: wi =
lp =lq
e−(xip −xiq )
1
2
•
lp =lq
(1 − e−(xip −xiq )
lp =lq
1
2
)
(5)
lp =lq
Figure 2 shows a map of the clustering weight computed for MNIST dataset using 1000 samples from the training set. Pixels at boundaries and corners are less important than the ones around image centers, thus have smaller weights (white means larger weights). Final Objective Function. The final objective function then combines the Clustering-weighted MSE Loss and a standard L2 norm regularization, as shown in Eq. 6. Here the L2 norm regularization Lr is computed using all parameters from the autoencoder. β is a balancing factor with a default value of 0.00001. L = Lcmse + β L˙ r
(6)
210
S. Lu and R. Li
Fig. 3. Samples of the MNIST dataset.
4 4.1
Experimental Results Data Set
We evaluate our approach on the classic MNIST hand-written digits dataset. This dataset has 50, 000 images as the training set and 10, 000 images as the testing set. There are 10 groups in total. We show some samples of MNIST dataset in Fig. 3. 4.2
Measurement Metrics
To evaluate our framework, we apply our trained encoder to the testing dataset. We then compare the generated representations from our trained encoder to the raw input features by applying them to the K-Means algorithm. To measure the performance of clustering algorithms, we use the Adjusted Rand Index (ARI). Specifically, this metrics computes a similarity between two clustering results by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and ground truth clustering results. The proposed approach is denoted as DAC. 4.3
Experiment Setup
We implement our framework in Python and PyTorch and test it on a desktop with RTX 2080-Ti. We train the autoencoder for 200 epochs using Adam Optimization Algorithm. The initial learning rate is set to 0.003 and will decrease with the number of epochs during training. Model Complexity. Our model for MNIST has 944.86 k parameters and a computational complexity of 0.001 G Macs (Multiply accumulation operations) during inference. The average processing time per frame is 0.42 ms, leading to a FPS of 2381.
Deep Autoencoder-Based Clustering
211
Table 1. Clustering results on MNIST testing dataset.
K-Means PCA ARI 0.3477
DAC
0.4026 0.6624
Fig. 4. Sample results of our trained autoencoder on MNIST dataset. Top: Raw input images. Bottom: Reconstructed images
4.4
Results on MNIST
Table 1 shows the quantitative performance of the proposed approach in terms of ARI. Comparing to the raw K-Means algorithm, our approach (DAC) boosts the K-Means algorithm’s performance from 0.3477 to 0.6624, which is a 90.50% boost. We also compare our approach with PCA feature dimension reduction which reduces the feature dimension from 784 to 10. From Table 1, it could be seen that while PCA could improve K-Means clustering’s performance from 0.3477 to 0.4026, our approach (DAC) still has the best performance. We also show some of the reconstructed results by our trained autoencoder in Fig. 4. It shows that our trained autoencoder can properly reconstruct the raw input hand-written digits. 4.5
Results on Other Datasets
To test the robustness of our approach against different data types, we apply our method to two other datasets: Fashion-MNIST[14], and Human Activities and Postural Transitions Data Set (HAPT) [12]. Fashion-MNIST is a similar dataset to MNIST, with the same image format and image size. It has 60, 000 images as training set and 10, 000 images as testing set. The only difference is the content: it contains images of 10 types of clothes. The ten categories are shown in Table 2. We show some samples of this dataset in Fig. 5. Table 2. Fashion-MNIST category labels.
T-shirt/top Trouser Pullover Dress Coat Sandal
Shirt
Sneaker Bag
Ankle boot
212
S. Lu and R. Li
Fig. 5. Samples of the Fashion-MNIST dataset.
Human Activities and Postural Transitions Data Set is a dataset that has been captured by smart phone’s sensors [12]. The authors captured 3-axial linear acceleration and 3-axial angular velocity at a constant rate 50 Hz using the embedded accelerometer and gyroscope of the device, which is a smartphone (Samsung Galaxy S II). There are 30 volunteers whose ages are in the range of 19–48 years old. In their data capturing experiment, the volunteers was doing one of twelve activities. There are six basic activities: three static postures (standing, sitting, lying) and three dynamic activities (walking, walking downstairs and walking upstairs). Another six postural transitions that occurred between the static postures have also been added to the dataset. These are: stand-to-sit, sit-to-stand, sit-to-lie, lie-to-sit, stand-to-lie, and lie-to-stand. All twelve types of activities are shown in Table 3. Table 3. HAPT category labels.
walking
walking upstairs walking downstairs sitting
standing laying
stand to sit
sit to stand
standing laying
stand to sit
sit to stand
sit to lie lie to sit
stand to lie
lie to stand
The sensor signals (accelerometer and gyroscope) were then denoised by some noise filters. The authors then sampled in fixed-width sliding windows of 2.56 sec and 50% overlap (128 readings/window), leading to a sample size of 561 features. Each sample is captured when the volunteer is doing one type of activities. During the capture process, 70% of the volunteers were randomly selected to generate the training set and 30% were selected to generate the testing set. In total, this dataset has 7767 samples for training and 3162 samples for testing.
Deep Autoencoder-Based Clustering
213
Fig. 6. Sample results of our trained on Fashion-MNIST dataset. Top: Raw input images. Bottom: Reconstructed images
We apply our method to Fashion-MNIST dataset and report the results in Table 4. Here as the Fashion-MNIST is a more complex dataset, we modified the autoencoder and show the modified autoencoder architecture in Fig. 7. It can be seen that comparing to using raw input features in K-Means clustering, our method boosts ARI from 0.3039 to 0.4702, yields to a improvement of 54.7%. We then apply our method to the HAPT dataset and report the results in Table 5. Here as this data set’s inputs are of lower dimension than MNIST, we modified the autoencoder accordingly and show the modified autoencoder architecture in Fig. 8. It can be seen that even with this temporal sequence dataset, our method could effectively improve the K-Means algorithm’s performance by 30%. These results also show that our method could be generally applied to other data types. We also show some of the reconstructed results by our trained autoencoder in Fig. 6. It shows that our trained autoencoder can properly reconstruct the raw input fashion images. Table 4. Clustering results on Fashion-MNIST testing dataset.
K-Means DAC ARI 0.3039
0.4702
Table 5. Clustering results on HAPT testing dataset.
K-Means DAC ARI 0.4290
5
0.5594
Limitation
While the proposed approach efficiently improves the performance of K-Means clustering, it has some limitations. First, the models used are not adaptive to
214
S. Lu and R. Li
Fig. 7. Overview of our deep autoencoder-based clustering on Fashion-MNIST dataset.
Fig. 8. Overview of our deep autoencoder-based clustering on HAPT dataset.
different input sizes. This means that we need to train different models for different data sets with various sample input sizes. Second, we are using limited information of the ground truth labels during training. In the future, we plan to exploit more details from the ground truth labels by using more advanced network architectures. For example, when feeding samples from different digital groups to the model, the output encoded features should be as different as possible. On the other hand, when feeding samples from the same digital group to the model, the output encoded features should be similar to each other.
Deep Autoencoder-Based Clustering
6
215
Conclusion
In this paper, we propose DAC, Deep Autoencoder-based Clustering, a generalized data-driven framework to learn low dimensional clustering representations using trained deep neuron networks. Specifically, we train a multi-layer deep autoencoder to encode and decode the raw input samples. The encoded output of the encoder is then fed to a classic K-Means algorithm to do clustering. We design a scheme to compute a clustering-based weight in the training objective function to train the autoencoder and let it focus more on the reconstruction of more important features. Experimental results show that our approach could effectively boost the performance of a classic clustering algorithm: K-Means by 30% to 90% on MNIST dataset. In addition, our method could be also applied to other types of clustering datasets, such as Fashion-MNIST and Human Activities and Postural Transitions Data Set (HAPT). Experimental results show that our framework could still be able to improve K-Means algorithm’s performance by as much as 55%.
References 1. Buades, A., Coll, B., Morel, J.M.: A non-local algorithm for image denoising. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2, pp. 60–65 (2005) 2. Chen, F., Zhang, L., Yu, H.: External patch prior guided internal clustering for image denoising. In: IEEE International Conference on Computer Vision (ICCV), pp. 603–611 (2015) 3. Dabov, K., Foi, A., Katkovnik, V., Egiazarian, K.: Image denoising by sparse 3D transform-domain collaborative filtering. IEEE Trans. Image Process. 16(8), 2080–2095 (2007) 4. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, vol. 2. John Wiley & Sons Inc., New York (2001) 5. Han, J., Pei, J., Kamber, M.: Data Mining: Concepts and Techniques. Elsevier, New York (2011) 6. Hosseinimotlagh, S., Papalexakis, E.E.: Unsupervised content-based identification of fake news articles with tensor decomposition ensembles. In: Proceedings of the Workshop on Misinformation and Misbehavior Mining on the Web (MIS2) (2018) 7. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 8. Lu, S.: Good similar patches for image denoising. In: 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1886–1895. IEEE (2019) 9. Lu, S., Ren, X., Liu, F.: Depth enhancement via low-rank matrix completion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3390–3397 (2014) 10. Pearson, K.: LIII. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 2(11), 559–572 (1901) 11. Yunchen, P., et al.: Variational autoencoder for deep learning of images, labels and captions. Adv. Neural Inf. Process. Syst. 29, 2352–2360 (2016) 12. Reyes-Ortiz, J.-L., Oneto, L., Sam` a, A., Parra, X., Anguita, D.: Transition-aware human activity recognition using smartphones. Neurocomputing 171, 754–767 (2016)
216
S. Lu and R. Li
13. Sun, Q.-S., Zeng, S.-G., Liu, Y., Heng, P.-A., Xia, D.-S.: A new method of feature fusion and its application in image recognition. Pattern Recogn. 38(12), 2437–2448 (2005) 14. Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017) 15. Yang, X., Deng, C., Zheng, F., Yan, J., Liu, W.: Deep spectral clustering using dual autoencoder network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019 16. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524 (2002)
Enhancing LSTM Models with Self-attention and Stateful Training Alexander Katrompas(B) and Vangelis Metsis Texas State University, San Marcos, TX 78666, USA {amk181,vmetsis}@txstate.edu
Abstract. When using LSTM networks to model time-series data, the standard approach is to segment the continuous data stream into fixedsize sequences and then independently feed each sequence to the LSTM network for training in a stateless fashion (i.e. in a fashion that resets the LSTM cell state per fixed-size sequence). As a result, long-term dependencies between patterns appearing in the data stream may be lost. In this work, we introduce a hybrid deep learning architecture that enables long-term inter-sequence modeling while maintaining focus on each sequence’s local characteristics. We use stateful LSTM training to model long-term dependencies that span the fixed-size sequences. We also utilize the attention mechanism to optimally learn each training sequence by focusing on the parts of each sequence that affect the classification outcome the most. Our experimental results show the advantages of each of these two mechanisms independently and in conjunction, compared to the standard stateless LSTM training approach. Keywords: Recurrent neural networks · LSTM · Deep learning Attention mechanisms · Time series data · Self-attention
1
·
Introduction
Recurrent neural networks (RNNs) are well known for their ability to model temporal dynamic data, especially in their ability to predict temporally correlated events [24]. RNNs form a family of neural networks in which a key feature is the additional input of the previous time-step’s network “state,” also known as “memory”. This memory allows RNNs to retain temporal relationships by creating an association between the current time-step and the previous time-step, thereby representing a chain of causation [8,24]. A vanilla RNN’s memory length is relatively short and typically newer information is weighted heavier than older information. However, ideally, an RNN should not only retain longer past information, but it would also weigh information based on importance to the model and not simply on its recent proximity in time. Well established developments in these areas are the RNN architectural variant known as the Long Short-Term Memory (LSTM) network [9,15], as well as the Back-propagation Through Time (BPTT) learning algorithm [4,28]. In c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 217–235, 2022. https://doi.org/10.1007/978-3-030-82193-7_14
218
A. Katrompas and V. Metsis
this study, LSTM networks and variants of BPTT will be studied with the further enhancements: attention mechanisms and stateful training. The goal of such enhancements is enhancing RNN memory in memory length (stateful training), feature importance, and inter-sequence weighting (self-attention). We have built a hybrid deep neural network architecture that enhances the ability of LSTM networks to both “focus” on importance within a sequence and “remember” long term patterns, thereby not only increasing accuracy but also reducing or eliminating the need for extensive data preparation. The first enhancement, a mechanism known as attention [1,11], allows the network to focus on more salient sequences within the LSTM memory space. Specifically, in this discussion, we examine self-attention, also known as intra-attention, which focuses on the important relationships between features within sequences [10,25]. This enhancement is network-level architectural in nature, altering the structure of the network while leaving the LSTM layer untouched. The second enhancement, a training model enhancement, overrides the typical LSTM backpropagation through time algorithm. This enhancement, which we term “stateful training,” allows the LSTM layer to retain its state between error correction updates while also retaining its “batch update” behavior, thereby capturing long sequences of information in an efficient manner. These enhancements are studied individually and in conjunction so that four models are compared and contrasted for temporal classification performance: 1) baseline LSTM, 2) LSTM w/Attention, 3) stateful LSTM, 4) stateful LSTM w/Attention. The remainder of this paper is organized as follows. We first introduce some background work on recurrent neural networks, LSTM networks, and the attention mechanism. Subsequently, we describe the details of our methodology and our proposed solution. We then evaluate our method and compare its performance against a baseline LSTM model as well as against results of other studies on the same publicly available datasets. We discuss our observations on the training behavior of the proposed architecture. Finally, we summarize and conclude this work.
2
Background
RNNs and their associated learning algorithms are typically some variation or enhancement to the standard feed-forward neural network architecture and the back-propagation learning algorithm. A brief introduction to the feed-forward back-propagation (FFBP) algorithm is presented here to frame the challenge and solutions presented in this study. More details about these algorithms can be found in [5,22,25]. 2.1
Feed-Forward Networks, Recurrent Neural Networks, Back Propagation Through Time
A FFBP network trains very simply by feeding information through the network forward, and then back-propagating errors in the reverse, typically with
Enhancing LSTM Models
219
some form of gradient descent. In the simplest case, each neuron’s activation in the network is “fed forward” through a simple sigmoid activation per neuron, errors are calculated in the output layer, and back-propagated through the network for correction of weights between neurons. This feed-forward and backpropagation process is executed with each iteration through the data. A complete pass through all data is known as an epoch [6]. A RNN and its training is derived from the basic FFBP network. The most basic RNN simply captures its current “state” as the output of the RNN layer, and “feeds” this output back to itself as an extra input in the next time step. Typically the RNN structure forms the first layer of a deeper network where a FFN is fed from the output of the RNN. Layered RNNs are also common. In the case of a layered network, the output of the RNN is “hidden” within the network and is therefore called a hidden layer, its neurons termed hidden nodes, and its output termed hidden output [21]. In addition to the aforementioned architectural change, the BP learning algorithm is typically modified into what is known as Back-propagation Through Time (BPTT). Rather than updating the weights with each iteration of input data, input is “batched,” where each batch is some uniform fraction of the total data. Data are fed forward in batches without error correction, collecting all neural output, and updating the network over the entire batch at once [4,23,28]. 2.2
Long Short-Term Memory and Truncated BPTT
Long Short-Term Memory (LSTM) networks are a variant of RNN which not only feed the previous hidden output back into the input of the LSTM but also maintain a separate “cell state,” which updates with each iteration, independent of batch error correction. This cell state is not directly affected by the backpropagation of errors thereby giving the network the ability to avoid the wellknown vanishing/exploding gradient problem [21]. Unlike vanilla RNNs, LSTMs can learn tasks which require memories of events that happened hundreds of discrete time-steps earlier [9,21]. LSTMs also use the batched BPTT algorithm using (aka Truncated BPTT). In typical TBPTT some batch size, k, is chosen between 2 and n/2 where n is the number of instances in the training set. When training an LSTM the internal cell state of the LSTM is typically reset between batches. This reset effectively removes the ability of the network to retain state (i.e. memory) across batches. A form of TBPTT time that allows for information flow across boundaries is known as accelerated TBPTT (A-TBPTT). In this case, k1 is chosen to a batch size and k2 the error size, where k1 < k2. In other words, k1 is when to correct the network and k2 is the amount by which to correct. In this fashion, some portion of a previous batch’s state is incorporated into the current batch [4,23,28]. Extending the idea of A-TBPTT it can be imagined the network could be trained using TBPTT but trained using maximal information on everything seen to that point (i.e. k1 < k2 where k2 is all information seen to that point). The advantage to this could be to both take advantage of TBPTT while also maintaining maximal state information from one batch to the next.
220
A. Katrompas and V. Metsis
However, if it were simply a matter of choosing k1 normally, and k2 to be n (i.e. all instances), to retain all state information, this would simply devolve into a very inefficient form of classical BPTT (i.e. k = n). This method of training also tends to “overload” the network with long past information unlikely to be relevant to the current time-step thereby creating noise. 2.3
Self-attention
Attention mechanisms are a well-known technique in natural language processing using Seq2seq encoder-decoder models. Standard encoder-decoders generally operate with the encoder processing the input sequence and then “compressing” or “summarizing” the information into a context vector of a fixed length for passing to the decoder. A disadvantage of this fixed-length context vector is the inability of the system to remember longer sequences as well as weighing recent information as more important regardless of its true relevance. Attention mechanisms are designed to resolve these problems [1,11]. Self-attention, also known as intra-attention, is an attention mechanism relating different positions of a sequence in order to model dependencies between different parts of the sequence. This differs from general attention in that instead of seeking to discover the “important” parts of the sequence relating to the network output, self-attention seeks to find the “important” portions of the sequence that relate to each other. This is done in order to leverage those intra-sequence relationships to improve network predictions [3,10,16,17,25]. Originally designed for text processing, the benefit of self-attention can be seen in the following example. In order to understand the sentence, “the dog did not run home because it was too tired,” the word “it” must be related to the word “dog” or the sentence makes no sense. However, if we change the word “tired ” to “far,” then the word “it” must be related to the word “home.” Obviously, the relationship between “it” and the subject of the sentence is extremely important to the general understanding of the sentence as a whole. In general attention, the mechanism would seek to process the entire sentence and then emphasize the portions that are most important based on the correctness of network output. Conversely, self-attention seeks to relate portions of the sentence that are most important to each other prior to prediction, thereby enhancing understanding and prediction in a more context-specific manner. [3,10,16,17] This technique has proven so successful that in the case of text processing it has been shown to stand alone without the need for RNN or CNN layers and perform as well or better on its own [25]. 2.4
Experimental Rationale
While it could be argued that text processing is temporal in nature since words have order and are related through time, strictly speaking, text processing is not time-series data. In fact, it could be argued that a text sentence, or even an entire paragraph, is more related to an image in that it represents a single “picture” conceptually in the mind of the reader. In fact, attention mechanisms designed for text processing found almost immediate further success being adapted to image
Enhancing LSTM Models
221
processing. This further emphasizes this “single concept” idea between image and text processing.[29] The analogy goes further in that attention in a sentence or paragraph is generally focusing on subject/verbs/adjectives for understanding, just as in an image attention is focusing on objects/actions/attributes. This study seeks to conduct a preliminary analysis on attention’s efficacy on true time-series data, specifically in temporal classification tasks. As detailed in the landmark paper, Attention is all you need [25], the temporal layer can be removed from text and image processing. However, we seek to understand attention’s role when the temporal aspect of the data is its primary feature. To test this, we re-introduce the LSTM layer and study the interplay between LSTM and attention layers where the LSTM layer is responsible for temporal relationships and attention is responsible for relationships between features. We propose that there is a benefit to understanding the data both “vertically” (i.e. through time) and “horizontally” (i.e. feature to feature) when learning true time-series data. This study seeks to investigate this empirically prior to the next logical step: theoretical study (should it prove worthy empirically).
3
Methodology
This study seeks to investigate the following challenge: How do we maintain maximum relevant temporal state information, without picking up noise and irrelevant information, without over-training, while also leveraging relevant feature importance? In other words, how do we make the LSTM maximally “stateful” but have it pay “attention” to only relevant information? The solutions proposed here study both the concepts of statefulness to preserve information through batches, and the concept of “attention” to focus training on specific, short-term, feature-to-feature, high-value information. 3.1
Statefulness
In the context presented here, “statefulness” refers to the LSTM’s ability to preserve its cell state through batches [14,19]. Typically LSTMs are trained without any preservation of state between batches (i.e. k1 = k2 < n and n%k1 = 0). This can be partially solved through A-TBPTT. It should be noted that carrying state forward is not always desirable and this is highly data-dependent. Stateful training on data which has many short-term dependencies, and/or causation is a near-term event, and/or the data has clear and uniform temporal “sections,” may actually be harmful to the model’s performance. However, what is of concern in this study is data that is continuous with longer-term relevant knowledge throughout the data. To achieve this we begin with setting the batch size to 1. This is a matter of the TensorFlow/Keras API used to model the data, and not part of the general algorithm. Setting batch size to 1 has the effect of making the training sequence equal to 1. This normally would cause the loss of all LSTM cell state information since the LSTM cell state will be reset with every iteration. However,
222
A. Katrompas and V. Metsis
we will alter the LSTM behavior to maintain state between batches (i.e. do not reset the cell state) by setting “stateful” to true (again, this is a matter of the TensorFlow/Keras API used as a method to achieve our algorithmic goals). In this programmatic form (batch size = 1, stateful = true), training is analogous to classical BP, however, we will also structure the data into time slices from 10s to 100s of steps (i.e. a “sequence”), thereby allowing TBPTT to be performed. LSTM cell state will be reset only at the end of an epoch, as opposed to at the end of a sequence, and multiple epochs will be presented. This can be seen in algorithm 1 where the difference between common LSTM batched training and “stateful” training is the placement of the step, “reset LSTM cell state.” In typical LSTM training, this is performed automatically and immediately following the step, “execute TBPTT.” The end result of this altered training algorithm is an LSTM network that will maintain cell state throughout an entire data set (i.e. epoch) while still training and correcting in batches according to TBPTT [9,12,14,19,23,28]. Algorithm 1. Stateful Training Algorithm Data: 3D matrix of r (x) c (x) s, where r is the number of training instances per sequence, c is the number of features, s is the number of sequences, and where N%s=0, where N = total number training instances. Initialize network; while epochs remaining do foreach s do for i ← 1 to r do propagate si forward; E += ei ; end execute TBPTT; end reset LSTM cell state; end
3.2
LSTM and Attention
When combined with LSTM architectures, attention operates by capturing all LSTM output within a sequence and training a separate layer to “pay attention” to (i.e. to weigh) some parts of the output more than others. Note that the LSTM is set to return sequences, i.e. for an input sequence x = (x1 , x2 , ..., xT ) the LSTM layer produces the hidden vector sequence h = (h1 , h2 , ..., hT ) and output y = (y1 , y2 , ..., yT ) of the same length, by iterating the following equations from t = 1 to T . (1) ht = H(Wxh xt + Whh ht−1 + bh )
Enhancing LSTM Models
yt = Why ht + by
223
(2)
where the W terms denote weight matrices, the b terms denote bias vectors, and H is the hidden layer function. Details about LSTM networks can be found in [7]. Attention is essentially a neural network within a neural network, which is learning to weigh portions of a sequence for relative feature importance [27,30]. The general concept of attention can be modified to work with temporal classification problems where the sequences are a collection of instances of time-series data and the “decoding” is classification. In the models presented here, rather than a sequence of words, the sequences are fixed-length vectors generated by segmenting the continuous data stream. Each value of the sequence vector is a time-step (data point) represented as a numeric value. This value can be a sensor measurement, a stock market price, etc. [18]. The attention used in this study is multiplicative self-attention1 and uses the following attention mechanism: ht = tanh(Wx xt + Wh ht−1 + bh )
(3)
et = σ(xTt Wa xt−1 + bt )
(4)
at = sof tmax(et )
(5)
where ht is the hidden node output from the LSTM layer in a two-dimensional matrix (i.e. the entire hidden output achieved in Keras through setting return sequences to true). et is the sigmoid activation output of the attention two-layer network, where Wa is the attention network weights, producing a corresponding matrix of the attention network activations. at is the softmax activation of et producing a vector “alignment score” weighting the importance of the individual parts of the batched input sequences.
4
Data
In this section, we discuss the characteristics of the data for which the proposed architecture is advantageous as well as the datasets used in our experiments. 4.1
Data Characteristics
Sequential Nature: The data to be modeled must be time-series data, continuous and in-order, sampled at reasonably regular rates, with dependencies through time. For example, environmental data such as barometric pressure, air moisture, current temperature, etc. in the prediction of future temperature. Gathering data such as environmental data, process control data, physio-metric data, biometric data, etc. can be done continuously, in order, and at regular intervals, and is of high value to many classification problems. Natural Order : The data to be modeled must be reasonably natural and not artificially staged into discrete, disparate groups. For example, the data cannot 1
pypi.org/project/keras-self-attention/.
224
A. Katrompas and V. Metsis
be EEG data in ordered experimental events such as hearing a noise on the left/right, or a vision event on the left/right [20]. Since the events (auditory or visual stimulus) in this dataset follow a predetermined pattern scripted by the researchers, the model very quickly learns the experimental design pattern and not the EEG signal characteristics that are associated with the stimulus type. This leads to dramatic over-fitting and no generalizability. It should be noted this does not apply to data collected experimentally in which purposeful natural randomness is simulated with uneven events. Temporal Event Classification: The classifications to be modeled must be temporal events through time, and not single-point, discrete classifications. In other words, the events being predicted are things that happen over time continuously. For example, predicting a human fall based on smart-device accelerometer readings. The movements leading up to a fall can be running, walking, standing, etc. followed by a fall which happens over time with a series of time-steps including the initial falling period, striking the ground, remaining in the fall position, recovery, and then back to some non-falling activity. 4.2
Data Sets
Three different datasets were used in our experiments. SmartFall : The data set consists of raw (x, y, z) accelerometer readings representing activities of daily living (ADLs) such as walking and running with falls interspersed [13]. MobiAct: The data set consists of raw (x, y, z) accelerometer readings with various ADLs (jogging, walking upstairs, falls, etc.) recorded and labeled [26]. Occupancy Data: The data set consists of recorded ambient features of an enclosed space (temperature, humidity, light, CO2, and humidity ratio) and the associated event label that space is occupied or not occupied for some period of time [2]. In our experiments, the SmartFall and MobiAct data are not pre-processed other than to concatenate various subjects together into a single training, test, and validation set. Conversely, the original SmartFall study, and especially the MobiAct study, both do extensive pre-processing and feature extraction. The occupancy data is not pre-processed and is taken as-is, in temporal order, in both our study and the cited work. However, the cited study does extensive statistical analysis to achieve the optimal model and feature set whereas our technique simply uses the data as-is with the complete feature set.
5
Models
Four models were used to demonstrate the effectiveness of the enhancements discussed here. All models are built using TensorFlow 2.0 with Keras and the third-party library mentioned above for achieving attention models.
Enhancing LSTM Models
5.1
225
Architectures
Model 1: Vanilla LSTM : This model is a typical LSTM deep-learning model and consists of an LSTM input layer, a dense layer wrapped in a timedistributed layer, another dense layer, and an output classifier. The LSTM return sequences parameter is set to true which enables the complete LSTM hidden layer sequences to be sent forward to the time distributed later as shown in Fig. 1a. The time-distributed wrapper allows each set of hidden layer sequences to be applied to individual identical copies of the first dense layer. This conforms to the idea we want to capture and train on all hidden states equally, and not on just the resulting context vector of the hidden states. This also is analogous to the next model 1b wherein ‘return sequences’ is required to implement the attention layer. This also allows for a consistent comparison between models. The output of the time-distributed dense layer is forwarded to the subsequent dense layer, and finally to the output layer. It is assumed the reader is familiar with such models [9].
Fig. 1. The figure shows the architectures of two networks designed for sequence classification.
Model 2: LSTM with Attention: This model replaces the time-distributed dense layer with an attention layer. Return sequences is set to true enabling the complete hidden layer sequences to be sent forward to the attention layer where they are processed similarly to the previously explained encoder/decoder model and the vanilla LSTM model (see Fig. 1b). Model 3: Stateful LSTM : This model is architecturally identical to the vanilla LSTM (Fig. 1a), however, the learning algorithm is altered to maintain state as described in the section on stateful training. Both return sequences and maintain state parameters are set to true. The state is reset at the end of each epoch as described in Algorithm 1.
226
A. Katrompas and V. Metsis
Model 4: Stateful LSTM with Attention: This model utilizes the TensorFlow functional API and uses both stateful training and attention in parallel layers, which are then merged and fed forward to a common dense layer. In this model, each “side” of the network is trained according to its architecture as described in the previous two models, respectively (Fig. 2).
Fig. 2. Stateful LSTM with attention
5.2
Hyperparameters
In each case, the models were tuned with the number of nodes, time-steps, and epochs that performed the best for the dataset at hand, so that the best performance of each was measured both against each other and against the existing published work. These parameters were selected in a grid search pattern varying hidden layer nodes, time-steps, and the number of epochs in all combinations until the optimal parameters were discovered for each model. Figure 3 shows the typical Stateful LSTM w/Attention summary. From this summary and the following general parameter ranges, it should be sufficient to reproduce all models. Hidden Layer Nodes were selected between 100 and 300 with an increasing number needed from models 1 to 4, in order, as described in the architectures sub-section. Time-Steps were chosen to be 40 in the case of an attention model and 200 in the case of a non-attention model (models 2 and 4, as described in the architectures sub-section).
Enhancing LSTM Models
227
Epochs were chosen between 120 and 35 in generally decreasing numbers from models 1 to 4, as described in the architectures sub-section. This is especially notable in that as the number of nodes increased from model to model, epochs decreased dramatically.
Fig. 3. Typical stateful LSTM with attention model used in the study.
6
Experiments and Results
We first present the experimental results of comparing the four different architectures studied in this work (i.e. Vanilla LSTM, LSTM w/Attention, Stateful LSTM, and Stateful LSTM w/Attention) against each other. We show these results per data set, including accuracy, precision, recall, and F1 scores. Figure 4 shows the bar graph of accuracy per data set. Finally, we compare the results of our best model (Stateful LSTM w/Attention) with the results obtained by previously published work on the same datasets. 6.1
Model-to-Model and Model-to-Study Comparisons
Each of the Tables 1 through 7 show the results of optimally training each model on each dataset. Tables 8, 9, 10, 11, 12 compare the results of each of the best models studied here (measured by accuracy) with the results from the cited studies from which each data set was acquired. The first model-to-study comparison presented is the SmartFall study which used a combination of (x, y, z) accelerometer readings, derived features, and
228
A. Katrompas and V. Metsis Table 1. SmartFall fall detection results SmartFall LSTM Attn State Attn State Accuracy
.939
.946 .958
.960
Precision
.687
.777 .828
.857
Recall
.824
.809 .844
.847
F1
.750
.793 .836
.852
ROC AUC .912
.941 .963
.974
PR AUC
.827 .859
.893
.819
Table 2. MobiAct: Fall detection results MobiAct - Fall LSTM Attn State Attn State Accuracy
.929
.936 .945
.952
Precision
.814
.799 .929
.941
Recall
.871
.912 .847
.864
F1
.841
.852 .886
.901
ROC AUC .960
.966 .990 .990
PR AUC
.933 .960
.966
.970
Table 3. MobiAct: Jogging detection results MobiAct - Jogging LSTM Attn State Attn State Accuracy
.963
.970 .970
Precision
.991
.990 .990 .988
.972
Recall
.969
.977 .978
.981
F1
.980
.984 .984
.985
ROC AUC .973
.980 .982 .965
PR AUC
.996 .997 .991
.986
Enhancing LSTM Models Table 4. MobiAct: Detecting walking down stairs MobiAct - Stairs down LSTM Attn State Attn State Accuracy
.919
.943 .941
.948
Precision
.949
.960 .967
.969
Recall
.929
.953 .944
.953
F1
.939
.957 .955
.961
ROC AUC .950
.965 .968 .968
PR AUC
.955 .965
.956
.968
Table 5. MobiAct: Detecting walking up stairs MobiAct - Stairs up LSTM Attn State Attn State Accuracy
.900
.919 .926
.933
Precision
.944
.953 .973
.975
Recall
.919
.935 .928
.935
F1
.931
.944 .950
.955
ROC AUC .946
.980 .964
.973
PR AUC
.996 .979
.989
.976
Table 6. Detecting occupancy of an enclosed space - Door closed Occupancy 1 LSTM Attn State Attn State Accuracy
.978
.961 .978
.980
Precision
.998
.996 .998
.999
Recall
.944
.903 .942
.948
F1
.907
.947 .969
.973
ROC AUC .990
.991 .994 .990
PR AUC
.980 .986
.977
.990
229
230
A. Katrompas and V. Metsis Table 7. Detecting occupancy of an enclosed space - door open Occupancy 2 LSTM Attn State Attn State Accuracy
.925
.955 .948
.970
Precision
.778
.957 .922
.993
Recall
.860
.850 .840
.890
F1
.817
.901 .879 .939
ROC AUC .984
.993 .984
.993
PR AUC
.965 .959
.972
.929
Fig. 4. Model to model accuracy comparison
post-processing labels into categories of events as fall or not fall. Table 8 shows our results using only raw (x, y, z) accelerometer readings as input and no postprocessing. Table 9 shows our results when post-processing is applied similar to the SmartFall study. Both results are compared to the deep learning model presented in the SmartFall study. Also presented is the MobiAct fall detection results from our experiments, however, the MobiAct study did not include fall detection. The results of our MobiAct fall detection experiments are presented in Table 10 as a comparison to both our SmartFall results and the original SmartFall study results. This is presented simply as a re-enforcement of the overall results of our models in a similar activity, with a different but comparable data set. Again, no preprocessing of our data was done, and we use post processing similar to SmartFall for comparison. Table 11 shows the MobiAct results for jogging detection, walking downstairs, and walking upstairs. In each case, the input data for our models was raw
Enhancing LSTM Models
231
accelerometer readings. Conversely, the input to the “multilayer perceptron” in the MobiAct study was a series of complex derived features that was termed in the study the Optimal Feature Set (OFS). This is notable in that we achieve results that in two out of three cases are superior. In the third case, our results have a lower accuracy score but are comparably close and notable given the difference in pre-processing effort. Table 8. SmartFall: Stateful LSTM w/Attn versus study without post processing SmartFall w/o post processing Attn State Study Accuracy .960
.850
Precision .857
.770
Recall
1.0
.847
Table 9. SmartFall: Stateful LSTM w/Att versus study with post processing SmartFall w/Post processing Attn State Study Accuracy .995
.850
Precision 1.00
.770
Recall
1.0
.989
Table 10. Stateful LSTM w/Att SmartFall, Stateful LSTM w/Att MobiAct, SmartFall comparison MobiAct fall data versus SmartFall Attn State SmartFall Attn State MobiAct Smart-Fall Study Accuracy
.995
.984
Precision 1.0 Recall
.850
.968
.989
.770
1.0
1.0
Table 11. MobiAct detecting ADLs versus study MobiAct Data Jogging Stairs-D Stairs-U Attn+ State Study Attn+ State Study Attn+ State Study Accy. .972
.996
.948
.915
.933
.925
232
A. Katrompas and V. Metsis Table 12. Occupancy detection versus study Occupancy detection data Test 1 Test 2 Attn+ State Study Attn+ State Study Accy. .980
.979
.970
.993
Table 12 shows occupancy detection compared to the cited study. This comparison is notable in that the cited study did not use a neural network model, but rather used several statistical models. Our results are presented as a comparison to these statistical models, specifically the best of statistical model results (Linear Discriminant Analysis). While our best results were similar to the cited study’s best results, there are several things of note that make our approach novel and valuable. Again, our models used no pre-processing or pre-selection of inputs, whereas the cited study was in fact a study of the statistically “best” input selection. In other words, we achieved slightly better results (test set 1), and slightly worse but comparable results (test set 2), by simply using the entire feature set without the need for extensive comparative statistics. This comparison is of value as a demonstration that our enhanced deep learning models achieve similar or better results than most of the statistical methods in the cited paper.
7
Discussion: Training Behavior
A notable and surprising effect on training became evident as the attention models were trained and studied. In each case, models that included attention resisted over-fitting, sometimes dramatically so. Even when trained well past the minimum achievable test error, the models did not exhibit over-fitting. This occurred in both the attention-only model and the stateful model with attention. Figure 5 through 7 show this effect. Figure 5 shows that as the standard LSTM model continued to train, the over-training effect becomes more and more pronounced. This was also observed in the stateful-only model (Fig. 6b). However, in the attention models (Fig. 6a and Fig. 7) the test error closely parallels the training error as the training error continues to decline. Even when both errors “flat-lined,” training and test error remained closer and parallel. This was an unexpected result, however, upon closer inspection, it is seemingly intuitive. The purpose of attention mechanisms is to reduce noise (i.e. irrelevant information) and focus on the relative “important” part of the sequences. It seems intuitive this should reduce over-fitting in that the model has a more difficult time memorizing the training set, and is instead constantly corrected to the important and predictive input sequences and feature relationships. However, this is only a preliminary hypothesis and necessitates further study and validation.
Enhancing LSTM Models
233
Fig. 5. MobiAct walking down stairs - Standard LSTM
Fig. 6. MobiAct walking down stairs - Attention versus statefulness
Fig. 7. MobiAct walking down stairs - Stateful LSTM with attention
8
Conclusions
Our study conducted into LSTM model enhancements demonstrates clearly that LSTM models with the enhancements of statefulness and attention are capable of equal or better classification results than many state-of-the-art models, and
234
A. Katrompas and V. Metsis
most notably with far less pre-processing. This is an important finding in that pre-processing is not only cumbersome, it very often leads to human bias. With the enhancements presented here it is possible to effectively process raw data into accurate temporal classification models. This is an important consideration when attempting to train models in real-time and online while the models are in service, an area that warrants further study. In addition, this study demonstrates both the benefits of attention mechanisms as applied outside their typical domains (e.g. seq2seq text processing models) and re-examines the usefulness of a RNN layer(s) used in conjunction with attention for temporal classification. Furthermore, stateful training is an area gaining ground in the study of long-term pattern recognition and these results support those efforts. Based on these results, attention mechanisms, specifically self-attention, can benefit from RNNs and vice versa, and that this is an area worthy of further investigation.
References 1. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014) 2. Candanedo, L.M., Feldheim, V.: Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Energy Build. 112, 12 (2015) 3. Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading (2016) 4. De Jeses, O., Hagan, M.T.: Backpropagation through time for a general class of recurrent network. In: IJCNN 2001. International Joint Conference on Neural Networks. Proceedings (Cat. No. 01CH37222), vol. 4, pp. 2638–2643 (2001) 5. Dematos, G., Boyd, M.S., Kermanshahi, B., Kohzadi, N., Kaastra, I.: Feedforward versus recurrent neural networks for forecasting monthly Japanese yen exchange rates. Finan. Eng. Jpn. Markets 3, 59–75 (1996) 6. Gershenson, C.: Artificial neural networks for beginners (2003) 7. Graves, A., Jaitly, N., Mohamed, A.-R.: Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 273–278. IEEE (2013) 8. Hewamalage, H., Bergmeir, C., Bandara, K.: Recurrent neural networks for time series forecasting: current status and future directions. Int. J. Forecast. 37(1), 388–427 (2020) 9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 56, 9:1735–9:1780 (1997) 10. Lin, Z., et al.: A structured self-attentive sentence embedding (2017) 11. Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation (2015) 12. Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks (2018) 13. Mauldin, T., Canby, M., Metsis, V., Ngu, A., Rivera, C.: Smartfall: a smartwatchbased fall detection system using deep learning. Sensors 18(10), 3363 (2018)
Enhancing LSTM Models
235
14. Mohajerin, N., Waslander, S.L.: State initialization for recurrent neural network modeling of time-series data. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2330–2337 (2017) 15. Moldovan, D., Anghel, I., Cioara, T., Salomie, I.: Time series features extraction versus LSTM for manufacturing processes performance prediction. In: 2019 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pp. 1–10 (2019) 16. Parikh, A.P., T¨ ackstr¨ om, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference (2016) 17. Paulus, R., Xiong, C., Socher, R.: A deep reinforced model for abstractive summarization (2017) 18. Qin, Y., Song, D., Chen, H., Cheng, W., Jiang, G., Cottrell, G.: A dual-stage attention-based recurrent neural network for time series prediction (2017) 19. Rahman, L., Mohammed, N., Al Azad, A.K.: A new LSTM model by introducing biological cell state. In: 2016 3rd International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), pp. 1–6 (2016) 20. Rivet, B., Souloumiac, A., Attina, V., Gibert, G.: xdawn algorithm to enhance evoked potentials: Application to brain-computer interface. IEEE Trans. Biomed. Eng. 56(8), 2035–2043 (2009) 21. Squartini, S., Paolinelli, S., Piazza, F.: Comparing different recurrent neural architectures on a specific task from vanishing gradient effect perspective. In: 2006 IEEE International Conference on Networking, Sensing and Control, pp. 380–385 (2006) 22. Struye, J., Latr´e, S.: Hierarchical temporal memory and recurrent neural networks for time series prediction: an empirical validation and reduction to multilayer perceptrons. Neurocomputing 04, 396 (2019) 23. Tang, H., Glass, J.: On training recurrent networks with truncated backpropagation through time in speech recognition (2018) 24. Tomiyama, S., Kitada, S., Tamura, H.: On a new recurrent neural network and learning algorithm using time series and steady-state characteristic. In IEEE SMC 1999 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028), vol. 1, pp. 478–483 (1999) 25. Vaswani, A., et al.: Attention is all you need (2017) 26. Vavoulas, G., Chatzaki, C., Malliotakis, T., Pediaditis, M., Tsiknakis, M.: The mobiact dataset: recognition of activities of daily living using smartphones. In: Proceedings of the International Conference on Information and Communication Technologies for Ageing Well and e-Health - Volume 1: ICT4AWE, (ICT4AGEINGWELL 2016), pp. 143–151. INSTICC. SciTePress (2016) 27. Wang, Y., Huang, M., Zhu, X., Zhao, L.: Attention-based LSTM for aspect-level sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 606–615. Association for Computational Linguistics, November 2016 28. Werbos, P.J.: Backpropagation through time: what it does and how to do it. Proc. IEEE 78(10), 1550–1560 (1990) 29. Xu, K., et al.: Show, attend and tell: Neural image caption generation with visual attention (2016) 30. Zeng, J., Ma, X., Zhou, K.: Enhancing attention-based LSTM with position context for aspect-level sentiment classification. IEEE Access 7, 20462–20471 (2019)
Domain Generalization Using Ensemble Learning Yusuf Mesbah(B) , Youssef Youssry Ibrahim, and Adil Mehood Khan Machine Learning and Knowledge Representation Lab, Innopolis University, Republic of Tatarstan, Russian Federation [email protected]
Abstract. Domain generalization is a sub-field of transfer learning that aims at bridging the gap between two different domains in the absence of any knowledge about the target domain. Our approach tackles the problem of a model’s weak generalization when it is trained on a single source domain. From this perspective, we build an ensemble model on top of base deep learning models trained on a single source to enhance the generalization of their collective prediction. The results achieved thus far have demonstrated promising improvements of the ensemble over any of its base learners.
Keywords: Neural networks Generalization
1
· Ensemble learning · Domain
Introduction
Ensemble learning is a method in supervised learning that combines multiple predictive models to get better and more robust predictions, which makes ensemble learning methods the best choice when the performance is of high importance. When it comes to the number of classifiers in the ensemble, the work done by R. Bonab, Hamed; Can, Fazli (2016) demonstrating the law of diminishing returns in ensemble construction can be referred to. Their theoretical framework shows that the highest accuracy is achieved by using the same number of independent component classifiers as class labels [6]. The theoretical base of neural networks was proposed by Alexander Bain (1873) and William James (1890) independently. Later on, McCulloch and Pitts (1943) made a mathematical model based on neural networks and called it “threshold logic”. After that, back-propagation was introduced by Rumelhart, Hinton, and Williams (1986). Over the following years, with the scientific and technological advancements, neural networks algorithms became more sophisticated and able to solve bigger and more challenging problems, including object recognition [15], anomaly detection [26,30], accident detection [4,7], action recognition [11,16,27], scene classification [25], hyperspectral image classification [1,2], machine translation [17,28], medical image analysis [10,12], etc. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 236–247, 2022. https://doi.org/10.1007/978-3-030-82193-7_15
Domain Generalization Using Ensemble Learning
237
Fig. 1. Domain generalization is the problem of transferring the knowledge from a source domain (such as SVHN cropped on the left) to a different target domain (such as MNIST on the right) to solve the same task, with the absence of any knowledge regarding the distributional shift in the feature space of the inputs.
Nowadays, deep learning (DL) and convolutional neural networks (CNN) are widely used in our everyday life. For example, modern smartphones have an option of authenticating using facial recognition, and all new self-driven cars are based mainly on a combination of many Deep CNNs to process road images. This increase in use raises the bar for computer vision systems to be more robust and stable. As useful as DL techniques are, some problems are faced when deploying them in the real world that we do not commonly encounter while working on toy datasets or training data in general. As powerful as deep CNNs are, they have a considerable shortcoming in that they are heavily dependant on the dataset used for training; this problem is also known as over-fitting. The problem at hand (called domain-shift) is mainly due to the fact that the training data set (source domain) comes from a different distribution than the deployment data (target dataset), resulting in a decrease in the model’s performance. Such discrepancy can occur in real life from slight changes in variables such as image resolution or picture brightness. Domain Generalization (DG) is a sub-field of Transfer Learning (TL) that aims to solve this problem by combining multiple data sources to train a more resilient model in hopes of generalizing to unseen domains. DG assumes the existence of multiple sources of data that are used for the same task, and a target domain dataset that is harder to work with (i.e.: harder to label and/or to collect). All domains share the same task but have a different marginal distribution. DG is very closely related to Domain Adaptation (DA) which also aims at solving the domain shift problem using one source domain and one target domain. DA can be approached in different ways regarding the existence of labels in the target domain: Supervised, Unsupervised, or Semi-Supervised. DG differs from DA in the fact that we do not have access to the target data nor to its labels. So, Domain Generalization aims at building a model that can generalize well to unseen domains rather than generalizing to a single known domain. Researchers have approached this problem in many ways. One traditional - yet very commonly used - technique is to treat this problem as an over-fitting problem and use regularisation techniques to help the (parametric) model generalize well [33]. Many techniques have proven useful in the case of deep neural networks such
238
Y. Mesbah et al.
as weight decay, dropout, batch normalization, and L1 and L2 regularization. Although these techniques were proven effective to help the model generalize well within the same data set and achieve higher test accuracy, they are not the most effective methods for Domain Generalization. In this paper, we deal with the case of Domain Generalization in its largest definition, where we handle the case of generalizing from a source domain to an unknown target domain. More specifically, we compare the performance of ensemble models with an individual Deep Neural Network on a single source domain generalization. Since ensemble models have shown an increase in accuracy in difficult learning scenarios, we will be investigating how much benefit can ensemble models give us when dealing with Domain Generalization problems. Accordingly, in this paper, we have implemented various ensemble models that consist of CNNs and different traditional machine learning models. We have tested them on five different datasets, and are reporting very interesting findings.
2 2.1
Related Work Ensemble Learning
Ensemble methods have been extensively researched. The main idea is to train multiple predictors for the same problem and merge their output to get better results. Ensemble methods have been commonly used in competitive machine learning competitions such as ILSVRC, where many CNNs are trained and merged to improve performance [19,22,34]. One main difference between the traditional model ensembling and our approach (when it comes to hyperparameter tuning) is the size of the models, in that even though we can use bigger models that will have a better performance on the training data set (and subsequently an ensemble of them), we preferred weaker models that still perform well on the training dataset (0.9+ accuracy) while providing much better generalization (more on that in Sect. 3.2). 2.2
Transfer Learning
Transfer learning (TL) in machine learning is the topic that explores how to store and apply the knowledge gained while solving one problem in a different but related problem. For example, the knowledge gained about recognizing cars could apply when trying to recognize trucks [29]. This is useful to decrease the training time of the models and helps if the target dataset is small. Similarly, Semi-Supervised classification [3,20,32] tackles the problem of the labelled data not being large enough to build a strong classifier, utilizing the large amount of data and the small number of labels. For example, Zhu and Wu [35] discussed how to deal with noisy labels, and Yang et al. considered cost-sensitive learning [31]. Semi-Supervised classification assumes that the distributions of the labelled and unlabelled data are from the same domain, while Transfer Learning allows the domains and tasks used in training and testing to be different[24].
Domain Generalization Using Ensemble Learning
2.3
239
Domain Generalization
Domain Generalization is less explored as a topic than Domain Adaptation [5], even though the ability to access multiple source domains allowed for more innovative and creative techniques. These techniques mainly fall into two streams: i Combining the source domains in a way that helps the model learn domain invariant features that can generalize well to unseen domains. For example, one state of the art method tries to learn domain-agnostic representations by re-arranging the input images and asking the network to solve it as a jigsaw puzzle. Although it has proven very effective, it faces a risk when different classes can share the same sub-components but are linked together differently. ii Measuring the similarity between each target image and potential source domains and then using this information, later on, to either combine or choose a certain classifier to use for this sample as in BSF [22].
3
Methods
Fig. 2. Base CNN used to learn CIFAR10 for the ensembles
We will be comparing three different ensembles with a single Neural Network to evaluate which one performs better on different various generalization problems. In supervised machine learning, there is some dataset D that consists of input data points, where every data point denoted by x has a class label y, with the assumption that there exists a function f that maps from the data point to the class label as y = f (x). The purpose of learners is to search through a space of possible functions, called hypotheses, to find the function h which is the best approximation to f used to assign the label y to x. Such a function is called a classifier. Learners that use a single hypothesis approximation for predictions could suffer from three main problems [9]: i The statistical problem is when the learner is searching in a space of hypotheses that is too big for how much training data is available. In this case, there might be two or more hypotheses that get the same accuracy on the training
240
Y. Mesbah et al.
data but perform differently while predicting future data. An ensemble can reduce the risk of this problem by taking the vote of different learners with different hypotheses, as it reduces the overall variance. In [13], the authors illustrated the variance reduction property of an ensemble system. ii The computational problem is when the learner is not guaranteed to find the best hypothesis and can get stuck in a local minimum as is the case with neural networks and decision tree algorithms. However, as with the statistical problem, an ensemble can help mitigate the computational problem because the weighted combination of several different local minima can help avoid choosing the wrong local minimum. iii The representational problem is when the hypothesis space does not contain a good approximation of the true function f . An ensemble can help in some cases, as a weighted vote of the hypotheses can expand the hypothesis space and result in a better approximation of f . The aforementioned problems can become even more severe when there is a domain gap between the training (source) data and the test (target) data. Usually, this problem is alleviated by training a model on multiple, different source domains. However, if there is a single domain to learn from, generalization could become extremely difficult. Therefore, it is interesting to see whether an ensemble model that uses a single source domain but benefits from having different base learners could help in improving the generalization performance. If so, what kind of ensemble model would perform better? Accordingly, our experiments are tailored to figure out answers to these questions. For every experiment conducted in our paper, we will have a single source dataset Ds and a single target dataset Dt that has a different domain, then we will have N CNN models (similar to Fig. 2) (m1 , m2 , ..., mN ) that will be trained independently on Ds . Then, they will be tested on the target domain Dt , give ), then by getting the average of their output us their predictions (y1 , y2 , ..., yN y¯ , we get our first ensemble (average ensemble, denoted by EnA). For the second ensemble with the meta learner (EnM), we will take the models’ outputs and train a layer of perceptrons as a meta learner to give us a weighted average of the models’ outputs. See Fig. 3. For the third ensemble, which is with meta learner v2 (EnM2), it will be similar to the previous ensemble with the only difference being that it has a multi layer perceptron meta learner. For the last ensemble, we compose different traditional ML algorithms (Random Forest (RF), Support Vector Machines (SVM), and Logistic Regression (LR)) into an average ensemble (EnT), see Fig. 3. Lastly, we will be adding to the comparison a single huge CNN (HCNN) that has as much trainable parameters as the sum of all the CNNs in the ensemble to see how the different use of trainable parameters might affect the results.
Domain Generalization Using Ensemble Learning
241
Fig. 3. General ensemble model
3.1
Data Preparation
There will be two datasets from different domains; one of them will be used for training and hyperparameter-tuning, and the other will be for testing to see how the ensemble will perform on a different domain. As for data preparation, for every neural network in the ensemble, a different data augmentation technique will be applied to the training dataset to increase the variance in the training data for every network. 3.2
Experiments
For the first experiment, we will be using three digits datasets: MNIST [21], USPS [14] and SVHN cropped [23] (henceforth referred to as SVHN). MNIST and USPS are composed of white handwritten digits on a black background, but USPS is small and zoomed to fill the frame, while MNIST is large and padded. On the other hand, SVHN is composed of colored images on a colored background (see Fig. 1). Moreover, the digits in SVHN are not perfectly isolated; there can be more than one digit in the one image, and the label for this image would be the middle digit in the image. We will train on one dataset and test on another (for every possible pairing of the 3 datasets). For the second experiment, we will use natural objects datasets CIFAR10 [18] and STL10 [8]. CIFAR10 is a colored dataset that consists of 10 natural objects: 5 animals, and 5 vehicles. Similarly, STL10 has the same setup except that CIFAR10 has images of frogs and STL10 does not. On the other hand, STL10 has images of monkeys while CIFAR10 does not, so we removed the uncommon labels, leaving us with 9 labels in common between the 2 datasets. In experiments involving USPS, the other datasets were resized to 16 × 16 to match USPS. In all other experiments, all the datasets were re-scaled to be 32 × 32 pixels. SVHN was converted to gray-scale to match MNIST [9]. 3.3
Hyperparameter Tuning
For every experiment that was done there were two datasets: source S and target T datasets. The source dataset is further divided into two parts: train and validation, so we will call them Strain and Sval , respectively.
242
Y. Mesbah et al.
CNNs and Ensemble Meta Classifier. To train each CNN (Fig. 2), we used Strain for training and Sval for validation. To achieve independence between the base models, we have a set of different types of augmentations A = {a1 , a2 , ..., an }, and every model i in the ensemble is trained using a unique subset of augmentations Ai ⊂ A. On the other hand, the ensemble meta classifier and the single CNN that will have the same number of parameters as the ensemble were trained using the full set of augmentations A. Traditional ML Algorithms. Similarly, we used Strain for training and Sval to tune some parameters such as the number of trees in a random forest. Table 1. Results for object recognition experiments: (1) from CIFAR10 as the source domain to STL10 as the target domain, (2) from STL10 as the source domain to CIFAR10 as the target domain. Model
4
CIFAR10 to STL10 Strain Sval T
STL10 to CIFAR10 Strain Sval T
model 1 0.987
0.886
0.706
0.721
0.597
0.460
model 2 0.978
0.879
0.675
0.944
0.664
0.557
model 3 0.978
0.877
0.686
0.903
0.636
0.504
model 4 0.976
0.868
0.684
0.984
0.641
0.509
model 5 0.969
0.888
0.696
0.818
0.633
0.515
EnA
0.99
0.903
0.724
0.964
0.681
0.558
EnM
0.99
0.904 0.727 0.964
EnM2
0.99
0.684 0.559
0.903
0.725
0.973
0.68
0.563
HCNN 0.971
0.878
0.683
0.466
0.423
0.358
EnT
0.958
0.487
0.366
0.709
0.371
0.285
RF
1.0
0.498
0.373
1.0
0.459
0.305
SVM
0.081
0.077
0.091
0.193
0.184
0.177
LR
0.464
0.429
0.305
0.654
0.359
0.281
Results
Tables 1, 2, 3, 4 show all the accuracy scores for every model on every problem. By analyzing the tables, we can notice the poor performance of the traditional ML models because they are being trained and tested on image datasets. However, the Random Forest model achieves high accuracy on the training set due to the fact that it is composed of many decision trees and can easily over-fit the training data, but we can see that when tested on the target domain we get very low accuracy. Moreover, while tuning the hyperparameters for the random forest,
Domain Generalization Using Ensemble Learning
243
Table 2. Results for digit recognition experiments: (1) from MNIST as the source domain to SVHN as the target domain, (2) from SVHN as the source domain to MNIST as the target domain. Model
MNIST to SVHN Strain Sval T
SVHN to MNIST Strain Sval T
model 1 0.971
0.973
0.069
0.934
0.936 0.647
model 2 0.971
0.973
0.069
0.933
0.935
model 3 0.978
0.978
0.069
0.934
0.936 0.649
model 4 0.974
0.976
0.069
0.933
0.935
model 5 0.967
0.968
0.07
0.934
0.936 0.649
0.648 0.649
EnA
0.98
0.98
0.069
0.934
0.936 0.649
EnM
0.979
0.979
0.069
0.85
0.842
EnM2
0.979
0.978
0.069
0.527
0.933
0.935
0.649
HCNN 0.992
0.991 0.072 0.933
0.935
0.649
EnT
0.99
0.97
0.104
0.332
0.27
0.093
RF
1.0
0.971
0.068
1.0
0.718
0.366
SVM
0.182
0.188
0.068
0.069
0.064
0.183
LR
0.935
0.927
0.108
0.265
0.242
0.053
we noticed that the more we increase the number of trees the higher the training and validation accuracy until the training accuracy reaches 1.0, at which point the validation accuracy starts to plateau. Another observation is that the CNN-based ensembles (EnA, EnM, EnM2) always give better accuracy in both domains across all the experiments, such as in CIFAR-to-STL (Table 1) where they reached 99% accuracy in the training set and increased over the best individual model (of its base models) in the target domain by 2% (from 66.4% to 68.4%). A similar outcome was observed in the SVHN-to-USPS experiment (Table 4). On the other hand, we can notice a slight drop in accuracy in the ensemble compared to its best base model such as in the USPS-to-MNIST experiment (Table 3) where on the target domain the best performing model got 85.9%, yet none of the ensembles got higher than that. This is because the other models in the ensemble have significantly less accuracy than the best model. However, the ensembles generally still have higher accuracy than the mean accuracy of their base models. For some of the experiments, we do not see good generalization, such as in MNIST-to-SVHN experiment (Table 2), which is due to the huge domain gap between them. Even though the models achieve 95%+ accuracy on training, they get very bad results on the target domain on testing, and in such cases, ensemble methods do not help much.
244
Y. Mesbah et al.
Table 3. Results for digit recognition experiments: (1) from USPS as the source domain to MNIST as the target domain, (2) from MNIST as the source domain to USPS as the target domain. Model
USPS to MNIST Strain Sval T
MNIST to USPS Strain Sval T
model 1 0.996
0.976
0.776
0.998
0.994
0.968
model 2 0.999
0.981
0.85
0.997
0.994
0.888
model 3 1.0
0.98
0.794
0.996
0.993
0.958
model 4 1.0
0.975
0.801
0.998
0.994
0.919
model 5 0.999
0.981
0.859
0.996
0.993
0.973
EnA
0.982 0.852
0.998
0.995 0.962
0.999
EnM
1.0
0.982 0.852
0.998
0.995 0.962
EnM2
0.999
0.982 0.864
0.998
0.995 0.957
HCNN 1.0
0.977
0.885 0.995
0.993
0.904
EnT
0.999
0.942
0.112
0.962
0.945
0.113
RF
1.0
0.941
0.098
1.0
0.968
0.118
SVM
0.999
0.915
0.152
0.921
0.918
0.194
LR
0.301
0.308
0.372
0.803
0.798
0.084
Table 4. Results for digit recognition experiments: (1) from USPS as the source domain to SVHN as the target domain, (2) from SVHN as the source domain to USPS as the target domain. Model
USPS to SVHN Strain Sval T
SVHN to USPS Strain Sval T
model 1 0.998
0.974
0.115
0.957
0.952
0.707
model 2 0.998
0.976
0.138
0.954
0.948
0.675
model 3 0.999
0.974
0.11
0.957
0.947
0.714
model 4 0.999
0.973
0.144
0.955
0.947
0.719
model 5 0.998
0.972
0.123
0.951
0.952
0.737
EnA
0.977
0.125
0.961
0.957 0.756
0.999
EnM
0.998
0.981 0.159 0.961
0.957 0.755
EnM2
0.999
0.977
0.125
0.96
0.957 0.748
HCNN 1.0
0.979
0.08
0.985
0.962
0.604
EnT
0.933
0.115
0.334
0.284
0.11
0.995
RF
1.0
0.94
0.068
1.0
0.694
0.465
SVM
0.994
0.947
0.148
0.124
0.125
0.167
LR
0.301
0.308
0.068
0.261
0.239
0.06
Domain Generalization Using Ensemble Learning
5
245
Conclusion
By providing a different data augmentation for each base learner, we improved the generalization from a single source domain to an unseen target domain. Thus, this proved the usefulness of our ensemble approach, making it the simplest known method for domain generalization. Moreover, it can utilize weak models to get a more robust model. Additionally, note that the more base models there are, the more time it would need for training. For future research, we can explore the effectiveness of the ensemble methods when using multiple source domains, how to use ensemble methods in domain adaptation, and how to best utilize the fact that we have access to the target domain. Also, we will explore how we can incorporate ensemble methods in current approaches for solving the domain adaptation and generalization problems.
References 1. Ahmad, M., Khan, A.M., Mazzara, M., Distefano, S.: Multi-layer extreme learning machine-based autoencoder for hyperspectral image classification. In: VISIGRAPP (4: VISAPP), pp. 75–82 (2019) 2. Ahmad, M., Khan, A.M., Mazzara, M., Distefano, S., Ali, M., Sarfraz, M.S.: A fast and compact 3-D CNN for hyperspectral image classification. IEEE Geosci. Remote Sens. Lett. (2020) 3. Baralis, E., Chiusano, S., Garza, P.: A lazy approach to associative classification. IEEE Trans. Knowl. Data Eng. 20(2), 156–171 (2007) 4. Batanina, E., Bekkouch, I.E.I., Youssry, Y., Khan, A., Khattak, A.M., Bortnikov, A.: Domain adaptation for car accident detection in videos. In: 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA), pp. 1–6. IEEE (2019) 5. Bekkouch, I.E.I., Youssry, Y., Gafarov, R., Khan, A., Khattak, A.M.: Triplet loss network for unsupervised domain adaptation. Algorithms 12(5), 96 (2019) 6. Bonab, H., Can, F.: A theoretical framework on the ideal number of classifiers for online ensembles in data streams, pp. 2053–2056 (2016) 7. Bortnikov, M., Khan, A., Khattak, A.M., Ahmad, M.: Accident recognition via 3D CNNs for automated traffic monitoring in smart cities. In: Arai, K., Kapoor, S. (eds.) CVC 2019. AISC, vol. 944, pp. 256–264. Springer, Cham (2020). https:// doi.org/10.1007/978-3-030-17798-0 22 8. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223. JMLR Workshop and Conference Proceedings (2011) 9. Dietterich, T.G., et al.: Ensemble learning (2002) 10. Dobrenkii, A., Kuleev, R., Khan, A., Rivera, A.R., Khattak, A.M.: Large residual multiple view 3D CNN for false positive reduction in pulmonary nodule detection. In: 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–6. IEEE (2017) 11. Gavrilin, Y., Khan, A.: Across-sensor feature learning for energy-efficient activity recognition on mobile devices. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2019)
246
Y. Mesbah et al.
12. Gusarev, M., Kuleev, R., Khan, A., Rivera, A.R., Khattak, A.M.: Deep learning models for bone suppression in chest radiographs. In: 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pp. 1–7. IEEE (2017) 13. Hansen, L.K., Salamon, P.: Neural network ensembles. IEEE Trans. Pattern Anal. Mach. Intell. 12(10), 993–1001 (1990) 14. Hull, J.J.: A database for handwritten text recognition research. IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994) 15. Khan, A., Fraz, K.: Post-training iterative hierarchical data augmentation for deep networks. In: Advances in Neural Information Processing Systems, vol. 33 (2020) 16. Khan, A.M., Lee, Y.-K., Lee, S., Kim, T.-S.: Accelerometer’s position independent physical activity recognition system for long-term activity monitoring in the elderly. Med. Biol. Eng. Comput. 48(12), 1271–1279 (2010) 17. Khusainova, A., Khan, A., Rivera, A.R.: Sart-similarity, analogies, and relatedness for tatar language: New benchmark datasets for word embeddings evaluation. arXiv preprint arXiv:1904.00365 (2019) 18. Krizhevsky, A., Nair, V., Hinton, G.: Cifar-10 (Canadian institute for advanced research) 19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012) 20. Kuncheva, L.I., Rodriguez, J.J.: Classifier ensembles with a random linear oracle. IEEE Trans. Knowl. Data Eng. 19(4), 500–508 (2007) 21. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010) 22. Mancini, M., Bul` o, S.R., Caputo, B., Ricci, E.: Best sources forward: domain generalization through source-specific nets (2018) 23. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011) 24. Pan, S.J., Yang, Q.: A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22(10), 1345–1359 (2010) 25. Protasov, S., Khan, A.M., Sozykin, K., Ahmad, M.: Using deep features for video scene detection and annotation. Sig. Image Video Process. 12(5), 991–999 (2018) 26. Rivera, A.R., Khan, A., Bekkouch, I.E.I., Sheikh, T.S.: Anomaly detection based on zero-shot outlier synthesis and hierarchical feature distillation. IEEE Trans. Neural Networks Learn. Syst. (2020) 27. Sozykin, K., Protasov, S., Khan, A., Hussain, R., Lee, J.: Multi-label classimbalanced action recognition in hockey videos via 3D convolutional neural networks. In: 2018 19th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 146–151. IEEE (2018) 28. Valeev, A., Gibadullin, I., Khusainova, A., Khan, A.: Application of low-resource machine translation techniques to Russian-tatar language pair. arXiv preprint arXiv:1910.00368 (2019) 29. West, J., Ventura, D., Warnick, S.: Spring research presentation: a theoretical foundation for inductive transfer. Brigham Young Univ. Coll. Phys. Math. Sci. 1(08) (2007) 30. Yakovlev, K., Bekkouch, I.E.I., Khan, A.M., Khattak, A.M.: Abstraction-based outlier detection for image data. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1250, pp. 540–552. Springer, Cham (2021). https://doi.org/ 10.1007/978-3-030-55180-3 40 31. Yang, Q., Ling, C., Chai, X., Pan, R.: Test-cost sensitive classification on data with missing values. IEEE Trans. Knowl. Data Eng. 18(5), 626–638 (2006)
Domain Generalization Using Ensemble Learning
247
32. Yin, X., Han, J., Yang, J., Yu, P.S.: Efficient classification across multiple database relations: a crossmine approach. IEEE Trans. Knowl. Data Eng. 18(6), 770–783 (2006) 33. Zhang, C., Bengio, S., Recht, B., Vinyals, O., Hardt M.: Understanding deep learning requires rethinking generalization (2017) 34. Zhou, Z.-H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Boca Raton (2012) 35. Zhu, X., Xindong, W.: Class noise handling for effective cost-sensitive learning by cost-guided iterative classification filtering. IEEE Trans. Knowl. Data Eng. 18(10), 1435–1440 (2006)
Research on Text Classification Modeling Strategy Based on Pre-trained Language Model Yiou Lin(B) , Hang Lei, Xiaoyu Li, and Yu Deng School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China [email protected]
Abstract. Fine-tuning the pre-trained language model is the current mainstream method of text classification. Take the fine-tuning BERT model as an example, this kind of approach has three main problems: the first one is that the training of massive parameters will cause high training costs The second is that the model is very easy to over-fit in trainable samples, resulting in low transferability. Third, the fine-tuning model is not good at long text classification task. In this paper, we take the sentiment classification task as an example and use the classification accuracy as a metric. We compared two methods to problem one: using a compressed language model (decreased 0.1%) and using entirely frozen weight (reduced by 4%). For the second problem, in the case of using fixed weights (reduced by 4%), Convolutional Networks (CNN) and Capsule Networks (CAP) are used to extract n-gram features and clustering features so that the classification accuracy is improved (increased by 0.5% and decreased by 0.2% respectively). The corpus transfer test shows that CAP’s accuracy is greater than the fine-tuning model (increased by 0.6%). Meanwhile, this article proposes a method to process the long text classification task by expanding position embedding to support long text input (increased by 1.1%). Finally, this paper compares the training speed and parameter scale of different models under the combination of different strategies, and uses the F1 value, precision rate, recall rate and accuracy to measure each model. Keywords: Deep learning · BERT model · Chinese sentiment analysis · Capsule Networks · Position embedding
1
Introduction
Text classification is a fundamental topic in Natural Language Processing (NLP). Before a computer program processes the text classification task, it needs to use the text representation to convert the character features into digital features. In the past, text representation was the discrete representation of text in sparse vectors, such as one-hot representation. But using discrete features means c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 248–260, 2022. https://doi.org/10.1007/978-3-030-82193-7_16
Text Classification Modeling Strategy
249
that it is impossible to measure the semantic relationship between two language units through Euclidean distance. Therefore, cluster-based language representations (such as Brown clustering word vectors [1]) are proposed. Although similar words are mapped to the same cluster to realize the representation of identical word meanings, it still leads to the problem of “multiple words with one meaning”. To solve this problem, distributed representation of words is proposed. Distributed representation refers to using points in a continuous semantic space to represent a word, which translates the relationship between words. With the rise of deep learning, language model (such as Word2vec [2]) which stores the latent grammatical and semantic features in the neural networks have become the mainstream method of text representation. After that, the pre-trained deep neural networks language model represented by BERT [3] has received extensive attention from the academic community. Since the fine-tuning model does not need to be learned from scratch, compared with the model that does not use pretrained, the fine-tuning model achieves higher performance under the premise of fewer data and computational cost. Taking fine-tuning BERT as an example, there are three main problems: 1) The amount of trainable parameters is exceptionally huge. The basic version of BERT has 100 million trainable parameters. 2) The fine-tuning model is more likely to be over-fitting to the corpus field. For example, a model that predicts the sentiment of hotel review will perform extremely badly when predicting the sentiment of product reviews. Retrain a new fine-tuning model takes time and effort. 3) When the pre-trained model is unsupervised training, the length of the input text is limited, and the text that exceeds the length will be truncated. For example, the acceptable token sequence length of BERT is 512. Therefore, the fine-tuning model has information loss for long texts that exceed the acceptable input length, and predictions are often not as accurate as short texts. Therefore, in response to each of the above problems, this article has the following work: 1) This article compares two methods including compressed language model and totally freeze weights because of the large number of trainable parameters. 2) After using totally freeze weights, this article finds using the convolutional networks (CNN) and capsule networks (CAP) to extract n-gram features and clustering features into the forward classification networks will make the classification accuracy recovered and slightly improved. Since this strategy maintains the pre-trained model’s invariance, the fine-tuning model has better domain migration capabilities. 3) This paper proposes a method to extend position embedding. Although the extended position embedding lightly changes the model structure, it can be adapted to text input of any length. 4) Finally, this article compares the training time and parameter scale of different models constructed under the strategy mentioned above and calculate the F1 value, accuracy and recall rate of each model.
250
Y. Lin et al.
Fig. 1. Three commonly used pre-trained language model networks structure diagrams
2
Related Work
Traditional text classification methods mainly focus on feature representation, feature selection and classification algorithm selection. The emergence of finetuning pre-language model blurs the boundaries of the three issues. The pretrained language model represented by BERT is a dynamic text representation model based on transfer learning. Dynamic text representation means that the representation of the current text depends not only on the text itself, but also by the context of the current text. As shown in Fig. 1, Peters et al. [9] proposed an EMLo model based on two-layer BiLSTM, in which the arbitrary embedding Ti is not only related to the input Ei , but also related to the hidden state representing context information. Dynamic text representation has a wealth of text information. The experiment in [9] shows that the first layer of the EMLo model extracts a large amount of grammatical information, and the second layer extracts semantic information. After that, the Transformer-based GPT model [10] overcomes the shortcomings of ELMo that it cannot be calculated in parallel. Furthermore, the BERT model uses a bidirectional Transformer structure to comprehensively consider the context word order to better extract effective text information. The attention mechanism is a new achievement in the current deep learning field, which is used to mine the text’s most representative features. It reduces the design and screening of artificial features like TF-IDF (term frequency–inverse document frequency). In 2017, Transformer first appeared in machine translation [11]. Transformer solves the shortcomings of Recurrent Neural Network (RNN) which is non-parallel computing and solves the dilemma of CNN’s lack of context dependence. In 2018, the Transformer-based dynamic text representation model BERT has been widely used in natural language processing and achieved optimal results in 11 tasks in natural language processing [3]. Although a pre-trained language model with a simple feedforward neural networks can achieve good results, the latest research is more inclined to regard it as a text representation encoder [7,8]. Researchers uniformly encode character sequences into fixed-size vectors or matrices and use other networks to extract advanced features. CNN and CAP are the most commonly used feature extraction networks. Traditional views generally believe that CNN is good at extracting local features, unable to consider long-distance dependent information, and does not consider word order information [4]. The use of pre-trained language models as the underlying text representation makes up for the shortcomings of CNN.
Text Classification Modeling Strategy
251
Although the CNN and BERT strategy has been widely used, the convolution operation has no translation invariance, and the pooling operation completely discards the position information [5]. Therefore, Hinton et al. proposed the capsule networks and dynamic routing algorithm make up for the shortcomings of CNN [12]. CAP borrows ideas from neuroscience and believes that the brain is composed of a series of modules called capsules. These capsules are good at handling features such as posture (position, size, direction), deformation, speed, albedo, hue [6]. In 2017, Sabour et al. introduced the CAP into the handwriting classification field, replaced each neuron scalar in the original neural networks with a vector, and used a dynamic routing algorithm to replace the backpropagation algorithm [13]. In 2018, Zhao et al. first introduced the CAP into text classification, by using CNN to extract the n-gram information of the text, and then using the three-layer CAP to extract more advanced text features [14]. In addition to directly using pre-trained language models to generate text representations, a considerable number of researchers also focus on streamlining the model scale with as little loss of accuracy as possible [15,16]. Among them, ALBERT is a famous example of model simplification which shares all the parameters of the 12-layer attention structure, experiments show that the training speed of ALBERT is significantly faster than the corresponding BERT, and the super-large-scale ALBERT-xxlarge performs better which surpasses BERT in all aspects [17]. The above works focused on how to better use the fine-tuning language model, but the construction strategies used are not comprehensive. The systematic research on the construction strategy is also at a relatively blank stage. Therefore, this article explored the construction the text classification modeling strategy based on pre-trained language model.
3 3.1
Model Architecture Model Input
Fig. 2. Schematic diagram of BERT input embedding
As shown in Fig. 2, during BERT pre-trained, each input sample is composed of a pair of sentences. The basic structure in a sentence is called a “word piece”.
252
Y. Lin et al.
Randomly replace a certain percentage of word pieces with masks. The ID of the word piece, the position of the word piece in the sentence, and the label of the sentence are input through the Embedding layer to form the word piece embedding, position embedding, and segmentation embedding, and then add them as the input of BERT. After 12 layers of Transform structure, complete cloze (predict the true value of the mask position in the picture) and upper and lower sentence matching tasks (judge whether the sentence pair in the picture is context) to achieve the purpose of the joint training model. In the fine-tuning model, supervised training is performed to make it suitable for specific tasks. BERT has strong universality. Almost all NLP tasks can apply this two-stage solution idea, and the effect is significantly improved. At the same time, we can see it is the position embedding that limits the length of the input text. BERT uses the absolute position embedding trained from random initialization, and the general maximum number of positions is set to 512. In the case of limited resources, an ideal solution is to find a way to extend the position embedding of the trained BERT. Specifically, assuming that the trained absolute position embedding is p1 , p2 , ..., pn , we hope to construct a new set of absolute position embedding q1 , q2 , ..., qm , where m > n. To this end, we set q (i−1)×n+j = pj + α(pi − p1 )
(1)
where i ∈ [1, n], j ∈ [1, n] and α = 1. We empirically set α = 0.9. Now, we can get the representation of n2 position embedding, and the first n position embedding are compatible with the original BERT model. 3.2
Transformer
Fig. 3. Schematic diagram of transformer layer connected to the embedding layer
As shown in Fig. 3, the Transformer model of the BERT model contains two sub-layers, namely the self-attention layer and the feed-forward layer. Attention calculation uses the following similarity calculation formula.
Text Classification Modeling Strategy
Attention (Q, K, V) = softmax
QKT √ dk
253
V
(2)
Among them: Q, K, and V are the query, key, and value matrix for calculating self-attention respectively; QKT is the attention matrix, which weights the V matrix; dk represents the dimension of the key. The self-attention model is a special case of the attention model, while Q, K, and V are used by the same matrix. Similar to the concurrent operation of multiple sets of convolution kernels, BERT’s Transformer linearly maps the input tensor into multiple sets of tensors, and concurrently performs self-attention model calculations on each set of tensors, called multi-head attention. In terms of parallelism, the multi-head attention model, like CNN, does not rely on the previous calculations, and can be parallelized well and is better than RNN. In terms of long-distance dependence, since the self-attention model calculates attention for each word and all words, no matter how long the distance between them, the maximum path length is only 1, which can still capture long-distance dependence. There is a residual connection around each sub-layer (self-attention, feed-forward networks) in each encoder, and is followed by a “layer-normalization” step. The output of the former encoder is used as the input of the latter encoder, and 12 Transformers form a complete BERT encoding networks. Unlike the original application scenario of Transformer, BERT is only used to extract dynamic representations of text and does not require a decoder. 3.3
Capsule Networks
Since the introduction of CNN is very common, this article only introduces the related construction of CAP. The essence of neural networks calculation is tensor transformation. Different from the scalar-based calculation of ordinary feedforward neurons, each neuron (capsule) receives a vector as input (it can also be extended to a higher-dimensional tensor). The capsule networks uses a dynamic routing algorithm to iteratively calculate the clustering core of the bottom capsule as the output of the high-level capsule. The calculation process follows Algorithm 1 [14]: assuming that the capsule i networks of layer l is connected to the capsule j of layer l + 1, the output-input relationship between the two capsules is expressed as ˆ j|i = W j|i v i (3) u where W j|i is the trainable weight matrix. The dynamic routing algorithm executed r times at the layer l + 1 is: The compression function of step 5 is defined as follows: squash (sj ) =
sj
2
1 + sj
2
·
sj sj
(4)
The compression formula is an innovative point of CAP, by using a new type of nonlinear activation function calculated by the vector. The main function of
254
Y. Lin et al.
Algorithm 1: The Dynamic Routing Algorithm
1 2 3 4 5 6 7
ˆ j|i , r, l Input: input parameters u Output: output result v j For each capsule i in layer l and each capsule j in layer (l + 1):bij ← 0; for r > 0 do For each capsule i in layer l:ci ← softmax (bi ); For each capsule j in layer l + 1:sj ← i cij u ˆj|i ; For each capsule j in layer l + 1:v j ← squash (sj ); For each capsule i in layer l and each capsule j in layer ˆ j|i · v j ; (l + 1):bij ← bij + u r =r−1
the formula is to make the length of the output vector vj not exceed the value 1 and to maintain the same directionality of sj . After the experimental screening, we set r to 3 and l to 1. Consider that the output of the first layer of CAP is equal to the output of BERT, and set the number of neurons in the second layer of the capsule to 32 and the vector length to 18. 3.4
Model Framework
Fig. 4. Schematic diagram of BERT input coding
Text Classification Modeling Strategy
255
The experimental verification framework proposed in this paper is shown in Fig. 4. The bottom layer is the BERT-base model, which is the core module to deal with problem one and problem three. Module two is the output matrix of module one, in which the first column vector is generally regarded as a sentence vector. It is used as the only input of the classification networks in the baseline fine-tuning model. The third module is an advanced feature extraction. The figure uses CAP as example. In order to avoid trial and error in fine-tuning, this article only uses a layer of CAP with a dimension of 32 and a number of capsules of 18. Module three can also use CNN to replace CAP. The experiment sets up three 1d convolution kernels, the window size is 1, 2, 3, the number of convolution kernels is 128, and the maximum pooling method is used to finally obtain three one-dimensional vectors. Finally, the three output vectors calculated by these three convolution kernels are connected and input to the fully connected networks. In module four, the softmax function is used as the activation function, and the cross-entropy is used as the loss function to train the model parameters.
4
Experiment Design and Analysis
4.1
Experiment Corpus
Chinese sentiment classification corpus is not only scarce but also of uneven quality. This paper selects and organizes three public Chinese sentiment corpora1 . Among them, MioChnCorp-2 is obtained by de-duplication on the corpus of literature [18]. There are 120,000 balanced two-category samples. Corpus 2 is Chinese data set of the 2014 NLPCC Conference Sentiment Analysis and Evaluation Task (NLPCC-SCDL), with a total of 12,500 balanced samples. Su-CD is a public commercial evaluation corpus with a total of 21,000 balanced samples. Table 1 is a detailed description of the these three corpora. Among them: L represents the average sample character length, V represents the size of the character dictionary, Train represents the number of training samples, and Test represents the number of test samples. Table 1. Most important features for linear regression model and decision tree model Corpus MioChnCorp-2
1
L
V
Train
Test
84.1 6090 100000 20000
NLPCC-SCDL 100.4 4778
10000
Su-CD
10522 10523
63.8 4304
2500
https://pan.baidu.com/s/1GrgqQXk5vg6aiaJaZhUAew password: lel4.
256
4.2
Y. Lin et al.
Evaluation Metrics
This paper uses four evaluation metrics including accuracy, precision, recall and F1 measure to evaluate the classification effect of the sentiment classification model. For a binary classification problem, let N be the total sample size. Let TP be the number of samples predicted to be positive and actually positive, TN is the number of samples predicted to be negative and actually negative, FP is the number of samples predicted to be positive and actually negative, FN is predicted to be negative, actual The number of positive samples, then P = T P/(T P + F P )
(5)
R = T P/(T P + F N )
(6)
F1 = 2P R/(P + R)
(7)
A = (T P + T N )/(T P + F P + T N + F N )
(8)
In the above formulas, F1 represents the F1 measurement, P represents the precision, R represents the recall rate, and A represents accuracy. 4.3
Experimental Setup
The experimental setup in this paper is as follows: the validation set is divided from the training set with a ratio of 20%. Training round epoch = 10, the maximum length of each text texts ize = 256, dropout = 0.2. In order to reduce the risk of model over-fitting, set the detection parameter earlys top = 100, that is, if the model does not significantly improve the validation set index after training for 1000 batches, the training is ended early. 4.4
Comparative Experiment
This paper tested the following five models over MioChnCorp-2. The benchmark model is the fine-tuning BERT model with all trainable parameters. 1) Fine-tuning BERT: Use the pre-trained model to directly output the sentence vector and then input it into the classifier with all trainable parameters. 2) Fixed BERT + CAP: Use the vector of each word patch output by the capsule networks clustering pre-trained model, and then input it into the classifier. 3) Fixed BERT + CNN: Use CNN to extract the local features of the dynamic text representation, pool and splice, and then enter the classification networks. 4) Fixed BERT + CAP + CNN: concate clustering features and local features, and then input them into the classifier. 5) Fine-tuning BERT + CAP + CNN: concate clustering features and local features, and then input them into the classifier with all trainable parameters. The evaluation results are shown in Table 2.
Text Classification Modeling Strategy
257
Table 2. Evaluation results of the four comparison models
4.5
Model ID Positive P R
F1
Negative P R
F1
1
0.938
0.911
0.924
0.914
0.941
0.927
2
0.936
0.924
0.930
0.925
0.937
0.931 0.930
3
0.922
0.924
0.923
0.924
0.923
4
0.927
0.936 0.931 0.936 0.926
5
0.953 0.894
0.923
0.901
A
0.923
0.925 0.923
0.931 0.931
0.957 0.928
0.926
Ablation Experiment
This section compared the model’s training speed and parameter scale under different construction strategies and measured the average F1 value and accuracy of each model. The evaluation results are shown in Table 3. Table 3. Evaluation results of models built by different strategies Language model Trainable layer CAP CNN Max length Each step √ √ BERT All 256 1.31 s √ √ 256 0.45 s √ 256 0.36 s √ 256 0.43 s All ALBERT
All
√
√
√
√
√
√
√ √
4.6
Trainable parameters F1 102.7 M
A
0.925 0.926
1.0 M
0.931 0.931
443.5 K
0.931 0.930
591.0 K
0.923 0.923
256
1.17 s
101.6 M
0.926 0.925
256
0.33 s
1.5 K
0.875 0.879
256
1.16 s
9.2M
0.924 0.925
256
0.43 s
1.0M
0.930 0.930
768
0.43 s
1.0M
0.941 0.941
768
0.35 s
443.5K
0.937 0.938
768
0.41 s
591.0K
0.933 0.922
Experiment Analysis
Comparing Model 2 and Model 3 in Table 1, we found that both the CAP and the CNN are helpful for the extraction of advanced features. CAP is 0.7% higher than CNN, so it is more effective. It can also be seen from Table 1 that Model 4 achieved the best results and is 0.6% higher than Model 1 in F1 which means Fine-tuning the model maybe unnecessary. Comparing Model 4 and Model 5, we can find that fine-tuning the language model even reduces the optimal result by 0.5% which means fine-tuning the language model may even be harmful to the extraction of high-level semantic features. Observing Model 1 and Model 5, we can find that after fine-tuning the language model, the final model is more inclined to give unbalanced predictions. We replaced the test corpus with corpus 2 and corpus 3. The evaluation on transfer corpus is shown in the Fig. 5. The
258
Y. Lin et al.
Fig. 5. Results of the corpus transfer test
Fig. 6. Results of the corpus transfer test
results show that the model without fine-tuning has better generalization ability with an increase of F1 about 1%. Compared to the reported optimal classification accuracy 0.915 reported based on Dynamic Convolutional Neural Networks [19], the pre-trained language model represented by BERT has a significant impact on text classification tasks and significantly improves performance. From the ablation experiment we can find: 1) The final result of using BERT as a pre-trained model is better than the corresponding ALBERT, but the training speed is reduced by 10%. 2) The computing time and resources based on the pre-trained language model are mainly concentrated on the calculation and weight update of the dynamic representation. 3) The position embedding exten-
Text Classification Modeling Strategy
259
sion proposed in this paper can greatly improve the accuracy. We measured the accuracy of Model 2 in each sample length interval when the longest input is 256 and the longest input is 768. The result is as shown in Fig. 6. It can be seen that when the sample length is greater than 256, the model based on extended position coding has obvious advantages.
5
Conclusion and Future Work
At present, the pre-trained language model based on fine-tuning is the mainstream method for NLP tasks. This article discussed the modeling strategy based on pre-trained language models. Aiming to deal with the huge number of parameters in the pre-trained language model (such as BERT), we find that using the compressed pre-trained model (such as ALBERT) can take a minimal loss of accuracy and significantly reduce the number of parameters. This paper also finds that changing the pre-training model’s weight may be unnecessary, and even hurts further feature extraction. At last, this paper found that the use of CAP and CNN to extract the clustering features and local features of dynamic text representation has better generalization ability and can avoid the retraining of massive parameters without affecting the accuracy. For long text classification task, this paper proposes an extended position embedding method so that the BERT model can support the input of text of any length. Although the model construction strategies proposed in this article have achieved significant results, they also have some apparent shortcomings especially in more training rounds than traditional fine-tuning the language model. In the future, this article will study a compressed language model based on knowledge distillation, to further accelerate the training of the model, and use the attention mechanism to accelerate the model convergence.
References 1. Brown, P.F., Della Pietra, V.J., Desouza, P.V., Lai, J.C., Mercer, R.L.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–480 (1992) 2. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26, pp. 3111–3119 (2013) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 4. Peng, H., et al.: Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Trans. Knowl. Data Eng. 33(6), 2505–2519 (2019) 5. Safaya, A., Abdullatif, M., Yuret, D.: KUISAIL at SemEval-2020 Task 12: BERTCNN for offensive speech identification in social media. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp. 2054–2059 (2020) 6. Wang, Z., et al.: A novel method for intelligent fault diagnosis of bearing based on capsule neural network. Complexity (2019)
260
Y. Lin et al.
7. Liu, Y.: Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019) 8. Rodrigues Makiuchi, M., Warnita, T., Uto, K., Shinoda, K.: Multimodal fusion of BERT-CNN and gated CNN representations for depression detection. In: Proceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop, pp. 55–63, October 2019 9. Peters, M.E., et al.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018) 10. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding by generative pre-training (2018). https://s3-us-west-2.amazonaws. com/openaiassets/researchcovers/languageunsupervised/language understanding paper.pdf 11. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017) 12. Hinton, G.E., Krizhevsky, A., Wang, S.D.: Transforming auto-encoders. In: Honkela, T., Duch, W., Girolami, M., Kaski, S. (eds.) ICANN 2011. LNCS, vol. 6791, pp. 44–51. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-64221735-7 6 13. Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, pp. 3856–3866 (2017) 14. Zhao, W., Ye, J., Yang, M., Lei, Z., Zhang, S., Zhao, Z.: Investigating capsule networks with dynamic routing for text classification. arXiv preprint arXiv:1804.00538 (2018) 15. Frosst, N., Hinton, G.: Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017) 16. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015) 17. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019) 18. Lin, Y., Lei, H., Wu, J., Li, X.: An empirical study on sentiment classification of Chinese review using word embedding. arXiv preprint arXiv:1511.01665 (2015) 19. Jia, X., Li, N., Jin, Y.: Dynamic convolutional neural network extreme learning machine for text sentiment classification. J. Beijing Univ. Technol. (01), 28–35 (2017)
Discovering Nonlinear Dynamics Through Scientific Machine Learning Lei Huang1(B) , Daniel Vrinceanu2 , Yunjiao Wang2 , Nalinda Kulathunga2 , and Nishath Ranasinghe1 1
Department of Computer Science, Prairie View A&M University, Prairie View, TX 77446, USA [email protected] 2 Texas Southern University, Houston, TX 77004, USA https://computinglab.wixsite.com/computinglab
Abstract. Scientific Machine Learning (SciML) is a new multidisciplinary methodology that combines the data-driven machine learning models and the principle-based computational models to improve the simulations of scientific phenomenon and uncover new scientific rules from existing measurements. This article reveals the experience of using the SciML method to discover the nonlinear dynamics that may be hard to model or be unknown in the real-world scenario. The SciML method solves the traditional principle-based differential equations by integrating a neural network to accurately model the nonlinear dynamics while respecting the scientific constraints and principles. The paper discusses the latest SciML models and apply them to the oscillator simulations and experiment. Besides better capacity to simulate, and match with the observation, the results also demonstrate a successful discovery of the hidden physics in the pendulum dynamics using SciML. Keywords: Scientific machine learning · Scientific simulation Computational science · Nonlinear dynamics
1
·
Introduction
Scientific Machine Learning (SciML) [1] has recently emerged as a new method to solve the scientific computing problems using machine learning models. The method leverages the success of traditional scientific computational models and the advances in data-driven machine learning models to augment the efficiency and accuracy of scientific simulation and inversion. Moreover, it facilitates the scientific discovery by modeling both well-known scientific rules and the unknown patterns based on observed data. The traditional scientific computational models mostly are developed to simulate the physics, chemistry, biology and other scientific phenomenons by using various numerical methods, such as the finite difference or finite element methods, to solve a variety of differential equations. These methods can achieve highly c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 261–279, 2022. https://doi.org/10.1007/978-3-030-82193-7_17
262
L. Huang et al.
accurate simulation results; however, they are also notoriously expensive in consuming computational resources. It is why scientific computing is typically conducted on the supercomputers using complex programming models. Moreover, the scientific computing highly depends on our understanding of the theoretical principles, which do not fully represent the complexity and nonlinear dynamics in many real-world phenomenons. Many times, the parameters and other constraints are not well known and simulation scientists have nothing better to rely on educated guesses. In theory, SciML combines the deterministic scientific principles with the universal approximation of machine learning to thus provide a more efficient yet reliable and explainable model-based data-driven solution. SciML provides a sound scientific theoretic foundation to unveil the new scientific governing rules with a collection of data and models. For example, SciML may integrate a deep learning model into a partial differential equation (PDE) to fit the observed data, which models the well-known principles using the PDE and models the unknown portion such as noise and friction using the deep learning model. The latest theoretical and practical advances in machine learning, especially deep learning, dramatically increase the capacity and accuracy of the universal approximation of nonlinear functions. Despite the progress, it is still not reliable and explainable to learn a complex system with nonlinear dynamics or chaos using deep learning along. Moreover, it requires huge mostly unrealistic big data sets to train a deep learning model to cover all possible features. Even if we can successfully train a deep learning-based surrogate model, the model’s extrapolation is not questionable. It would be much more reliable and explainable if we can embed the physical principles to determine the nonlinear dynamics, and only leave the unknown functions to machine learning. SciML converges the computational science and data science disciplines powers to improve the accuracy, performance, and interpretability in scientific simulation. Moreover, it may unveil scientific rules hidden inside the nonlinear dynamics learned by the machine learning models. In this paper, we present the latest four SciML models using a couple of physics experiments to report our experience, the benefits, and limitations of the SciML method. Section 2 describes the state-of-art of SciML models; Sect. 3 shows the physics experiments, simulation, and data collection; Sect. 4 discusses the results of the SciML models; and the Sect. 5 concludes the findings of the paper.
2
Scientific Machine Learning Models
Scientific machine learning is developed to facilitate the scientific computation either by developing a surrogate model to replace the numerical model or combine the data-driven model to achieve better accuracy and performance. In this paper, we applies the Physics-Informed Neural Networks (PINNs) [29], the Universal Differential Equation (UDE) method [27], the Hamiltonian Neural Networks (HNNs) [11], and the Neural Ordinary Differential Equation (NODE) [7] to learn the nonlinear dynamics in several physics experiments.
Discovering Nonlinear Dynamics
2.1
263
Physics-Informed Neural Networks
The Physics Informed Neural Networks (PINNs) is one of SciML solutions that solves the differential equations by modeling the latent solution u(t, x) directly with a deep learning model and solves the differential equation by taking advantage of automatic differentiation functionalities [2] in machine learning (ML) software. The solution u(t, x) is replaced by a neural network or other machine learning model and its derivatives satisfy the definition of the governing differential equation. For example, the harmonic pendulum differential equation is defined as the Eq. (1). g d2 θ (1) + sin(θ) = 0 2 dt L where g is gravity, L is the length of the pendulum and θ is the angle with respect to the vertical in radians. PINNs redefines the equation by substitute its solution f (t) with a neural network Np (t), where N is the neural network and p is its trainable parameters. The new equation is depicted as the Eq. (2). g sin(θ) = 0 (2) L By creating a loss function to minimize the Eq. (2), PINNs utilizes the automatic differentiation capability in machine learning software to calculate the second order derivatives of Np (t) and optimize the loss function. As the result, the Np (t) is an approximation of the solution of Eq. (1). Furthermore, PINNs can be effectively used to solve the forward problem as well as the inverse problem with minimum modifications to computational codes [18]. Additionally, a Petrov-Galerkin [10] version of PINNs have been employed to solve variational form of PDEs to reduce the training cost [14]. Likewise, modified versions of PINNs have been employed to solve fractional differential equations [23] and stochastic differential equations [34]. As a method to address lack of uncertainty quantification in PINNs, Zhang et al. [35] put forward the idea of using multiple deep neural networks to quantify the parametric uncertainty and dropouts to model the uncertainty stemming from the approximations resulting from the neural networks. As an effort to develop a theoretical basis of PINNs, Shin et al. [31] studied the convergence of the sequence of minimizers generated from PINNs corresponding to the sequence of neural networks to the solution of the given PDE. They found sequence of the minimizers strongly converges to the PDE solution in L2 space as well as each minimizer satisfies both initial and boundary conditions.
Np (t) +
2.2
Universal Differential Equations
The Universal Differential Equations (UDE) method has some similarities with the PINN method: both rely on the scientific principles represented as differential equations to guide the computation and impose constraints. However, UDE is
264
L. Huang et al.
more flexible to model the unknown functions and combine them with existing scientific knowledge. UDE relies on the numerical differential equation solvers to solve the problem while learning the unknown functions during the calculation. The pendulum equation using UDE is depicted as the Eq. (3), which introduces a neural network Np that represents the unknown function is the experiment, such as the friction and/or the external force. As the result, the UDE solution better fits with the observed experiment results as described in Sect. 3. d2 θ g + sin(θ) + Np (u) = 0 (3) dt2 L The method designs a machine learning model representing unknown physical functions while computing the ODE numerically using the ODE solver. The benefit of using the UDE method is that the neural network does not learn the full dynamics, which may be extremely hard or even impossible due to the nonlinear dynamics in chaotic systems. The approximation caused by the neural network may much diverge the long-term prediction results if we simply use the data-driven statistics-based machine learning model. It is hard to believe that a neural network’s universal approximation can be accurate enough for nonlinear dynamics prediction. The physical principles in dynamics need to be honored during the calculation. The UDE method applies the powerful universal approximation capacity in machine learning and respects physical constraints such as symmetry, invariance, and conservation. 2.3
Hamiltonian Neural Networks
In classical mechanics, Hamiltonian equations are widely adopted to model continuous time evolution of dynamic systems with physically conserved quantities such as energy and they can be effectively used to predict the phase space of dynamic system’s using the current state of the generalized position and momentum. Additionally, Hamiltonian mechanics are smooth, time reversible and provide integral paths that conserve certain physical quantities such as energy. Greydanus et al. [11] introduced Hamiltonian neural networks (HNN) by incorporating Hamiltonian equations into the loss function of the neural network to learn the Hamiltonian of simple systems with noisy phase space data. Additionally, Toth et al. [33] used a generative model to infer the Hamiltonian from dynamic systems using high dimension data (pixel). Matteakis et al. [20] embedded physical constraints into the structure of the neural network using the Hamiltonian equations deviating from other studies using the HNN method. HNN may also help reduce the expensive computational costs for solving scientific problems. The HNN method creates a neural network Np that meets the following requirements: dx1 ∂H ∂Np = = , dt ∂x2 ∂x1
∂H dx2 ∂Np =− =− dt ∂x1 ∂x2
(4)
Discovering Nonlinear Dynamics
265
where (x1 , x2 ) are two inputs of the HNN network, and denote the position and momentum. 2.4
Neural Ordinary Differential Equations (Neural ODE)
Since it was first observed by Weinan E [8], the relation between ResNet [13] and dynamical systems has been widely explored and utilized to increase the capability and stability of deep networks [4–6,17,19,32]. Connecting deep networks with ODE was largely inspired by the success of ResNet, whose network architecture is strikingly similar to the well-known Euler method for differential equations. With this observation, a natural idea is to generalize existing numerical methods to deep networks [19,36]. Neural ODE, proposed by Chen et al. [6], went one step further: it replaces deep networks such as ResNet with existing efficient ODE solvers. One key difference between solving ODE and training deep networks is that their goals are different. The goal of training deep networks is to find functions that fit the data while numerical ODE is to approximate solutions of the ODE. The idea of the dynamical systems approach for deep networks is to tune the vector field so that its flow map can reproduce nonlinear functions needed to fit the data [8]. More specifically, consider dz = f (z, t, θ), dt
z(0) = x
(5)
Let z(t, x, p) be the solution to the initial value problem (5), let T > 0 be a fixed time and p be a set of parameters. The flow map x → z(T, x, θ) defines a function from input to output, which is generally nonlinear [8]. Here f could be a neural network to model vector fields. Neural ODE uses existing solvers to solve the ODE (5) for a given set of parameter and input values. The step after solving the ODE is to adjust the values of p and repeat the process to find optimal values for p so that the flow map fits the data best. Just as regular optimization, this process requires to compute the gradient of a designed loss function with respect to p. A beautiful benefit coming out of Neural ODE approach is that the computation of gradient is easier and independent of the solver and can be carried out by the classical adjoint sensitivity method [26]. Another benefit of this method is that Neural ODE can naturally used for time dependent data as the pendulum data discussed in this paper. A computational disadvantage is that ODE solver often requires a larger number of evaluations than in a standard deep network, which tends to get worse over the training [16].
266
3 3.1
L. Huang et al.
Physical Experiments Quadruple Spring Mass System
A quadruple-springs-mass system allowed to move in a 2-D frictionless surface (Fig. 1) can exhibit simple harmonic motion as well as non-linear dynamic motion. The motion depends on the initial conditions of the system and also on the physical properties of the springs (i.e. spring constants and unstretched lengths of the springs). For simplicity, here we only consider massless springs. The time independent Hamiltonian of the quadruple-springs-mass system can be given using generalized position(x, y) and momentum(px , py ) as; H=
n=4 p2x + p2y 1 + ki (li − ai )2 2m 2 i=1
(6)
where: ai = un-stretched lengths of the springs, li = stretched lengths of the springs, ki = springs constants, m = mass of the particle in the middle
Fig. 1. Quadruple-springs-mass system
2 2 2 2 2 2 Where, l1 = (a1 + x) + y , l2 = (a2 − x) + y , l3 = (a3 − y) + x 2 2 and l4 = (a4 + y) + x . We simulate one instance of simple harmonic motion and another instance of non-linear dynamic motion of the quadruple-springsmass system by solving the Hamiltonian equations (Eq. 4) while utilizing the Hamiltonian given in Eq. 6. Initial conditions for the simple harmonic motion and the non-linear dynamic motion are given in Table 1. The data generated for a period of 5π from the two experiments.
Discovering Nonlinear Dynamics
267
Table 1. Initial conditions for the simple harmonic motion and non-linear dynamic motion Motion type
Unstretched length Spring const Init. pos
Init. moment Mass
Parameters
a1 , a2 , a3 , a4
k 1 , k 2 , k 3 , k 4 x 0 , y0
Px0 , Py0
SHM
1, 1, 1, 1
1, 1, 1, 1
−0.2, −0.2 0.1, 0.1
1.0
Nonlinear dynamics 1, 2, 3, 4
4, 3, 2, 1
−0.2, 0.1
1.0
3.2
0.1, −0.2
m
Pendulum
A pendulum is a classical physical phenomenon that has been studied to understand its dynamics for a long time. Figure 2 shows a simple gravity pendulum with angle (θ) and the length of slender rod L, the mass of pendulum bob m, and its angular velocity ω = dθ dt .
Fig. 2. Pendulum motion
The simple gravity pendulum [22] is a harmonic motion without any friction or external forces, which is governed by the simple second-order differential Eq. 1. For Eq. (1), we may use numerical methods to solve the ODE by specifying a small-enough time step. The results are a sequence of the angles θ and the angular velocities dθ dt for each time step during the pendulum simulation time span. The differential equation (1) is changed by adding the air resistance or friction component to simulate a damping harmonic pendulum, linear or nonlinear, depending on the scenario. Equation (7) shows linear air resistance/friction integrated into the motion to slow the pendulum down gradually.
268
L. Huang et al.
g d2 θ dθ + μ + sin(θ) = 0 dt2 dt L
(7)
where: μ dθ dt = the linear air resistance/friction. The air resistance/friction may also be nonlinear represented as a polynomial dθ 2 function such as: μ2 ( dθ dt ) + μ1 dt , then the Eq. (7) is changed to Eq. (8). 2 dθ g dθ d2 θ + sin(θ) = 0 (8) + μ2 + μ1 dt2 dt dt L Besides the gravity and resistance, an external force may interfere with the pendulum motion, which creates a non-harmonic oscillator. The external force f (θ) can be a motor or wind that varies based on the pendulum’s radiant. The differential equation (8) is expanded to become a non-homogeneous differential equation (9) including the external force in Eq. (10). 2 dθ dθ d2 θ g (9) + μ2 + μ1 + sin(θ) = f (θ) dt2 dt dt L and
6 cos(θ) (10) mL2 where m is the mass of pendulum bob and f is the external force driven by a wind or a motor. In Sect. 3.3, we assume that the external force f (θ) is independent of time. However, it could be time-dependent as f (t, θ), as the example detailed in Sect. 3.4, or even stochastic. f (θ) =
3.3
Simulated Pendulum
It is challenging or impossible to analytically solve the nonlinear dynamics equation since it is tough to simplify or divide-and-conquer the problem. Fortunately, we can solve the problem approximately using the numerical method using the finite difference method (FDM), the finite element method (FEM), or the finite volume method (FVM). These differential equations described in Sect. 3.2 can be solved using the ODE numerical solvers implemented in SciPy or Julia. There are many numerical algorithms for these ODE solvers to choose to solve these equations that simulate the temporal behavior of the pendulum dynamics concerning the pendulum’s angle (θ) and its angular velocity (ω) during a time frame. The work uses the Julia [3] programming environment and its DifferentialEQuations.jl package [28] to solve these ODEs for pendulum simulations. Julia provides a high-level programming interface similar to Matlab/Python with salable performance on both CPUs and GPUs. It includes a rich set of computational packages such as differential equations of ODEs/PDEs, linear algebra,
Discovering Nonlinear Dynamics
269
optimizations, automatic differentiation, dynamical systems, and data science packages such as its machine learning package Flux and Boltzmann Machines. The Julia code that defines the pendulum ODE and initial values is listed in Fig. 3. The ODE solver uses the Tsit5 algorithm - the Tsitouras 5/4 Runge-Kutta method with the free fourth-order interpolant, which is efficient and accurate in solving the pendulum ODE equation. The code defines a non-homogeneous differential equation with an external force and polynomial friction. The initial θ value is π/2 and the velocity ω is 0. The period is set from 0 to 10 s, with 0.1 s as the time step. It generates 101 samples of (θ, ω) after the calculation.
Fig. 3. Pendulum motion ODE code
Figure 4(a) shows the phase space of the pendulum dynamics based on the θ and the ω for 10 s motion with nonlinear friction and external torque. Figure 4b illustrates the pendulum angles θ and the angular velocity ω temporal changes during the 10 s of pendulum motion simulation. Due to the external torque and friction, the motion is non-harmonic, leaning toward its right-hand side gradually. 3.4
Simulation of Wind Forced Pendulum
In this simulation a quick air flow is used to start the oscillations of pendulum. The goal of the experiment is to infer the time profile of the air flow pulse from the simulated measurements of the angular position of the pendulum (Fig. 5). We assume that the drag force that acts on the pendulum is proportional to the relative velocity of the pendulum with respect to the air, according to Stoke’s law: Fd = b(u − v )
270
L. Huang et al.
(a) Pendulum Phase Space in 10 Seconds (b) Pendulum Motion Simulation in 10 Seconds
Fig. 4. Pendulum motion simulation shown in the phase space
Fig. 5. A pendulum forced to oscillate by a quick air blow of wind
where the drag coefficient is b = 6πηr for a spherical object of radius r, and η is the viscosity. The air flow is assumed to be uniform, and oriented along the x-axis u = u(t)ˆ x. Under these assumptions, the differential equation for the pendulum is d2 θ b dθ b g + u(t) cos θ (11) = − sin θ − 2 dt L m dt mL with initial conditions: θ(0) = dθ/dt(0) = 0. The solution θ(t) depends on the airflow profile u(t) and the drag coefficient b as external parameters, while gravity g and length L of the pendulum are assumed to be known. Given a set of measurements of time and angular position ({tk , θk }, k = 1, 2, . . . , N ), the unknown airflow profile, as well as the drag coefficient, can be inferred by minimizing the loss function 1 2 (θ(tk ) − θk ) 2 N
L(u(t), b) =
k=1
Discovering Nonlinear Dynamics
271
This can be obtained by using a Conjugate Gradient Descent method where better candidates for b and u(t) are calculated at each iteration as b → b = b − η
∂L , ∂b
u(tk ) → u (tk ) = u(tk ) − η
∂L ∂u(tk )
where η is a small learning rate chosen appropriately and the airflow profile is discretized at the same temporal points tk , for convenience. The partial derivatives of L with respect to b and u(tk ) are obtained in turn as ∂L ∂θ = (θ(tk ) − θk ) (tk ) ∂b ∂b N
k=1
and
∂θ ∂L = (tk ) (θ(tk ) − θk ) ∂u(tk ) ∂u(tk ) N
k=1
The sensitivities of the ODE solution ∂θ/∂b and ∂θ/∂u(tk ) can be calculated in several ways [25]. For our example, we used forward differentiation package ForwardDiff [30], that employs dual numbers [12] during the iterative calculation of the solution of Eq. (11). Each time step during the iterative calculation is calculated according to Heun’s modification of Euler’s method [9]. At the start of integration all parameters are set as dual numbers with zero dual part, except the parameter for which the sensitivity is required, which is set with 1. The solution obtained at the grid points tk will in turn be dual numbers that represent the solution, as the main part, and the sensitivity of the solution with respect to the chosen parameter, as its dual part. The advantage of this approach is that all calculations are done in place with modest memory requirements. Figure 6 show the results of our experiment. We simulated a 1.0 kg pendulum of length 1.0 m that has a drag coefficient of b = 0.25 kg/s. The pendulum is forced to oscillate by a short gaussian blow pulse of amplitude 4.0 m/s, centered around t = 2.8, with standard deviation of 0.2 s. Starting from random guesses for b and u(t), the procedure converges toward the anticipated values. At every time step, only positive values of u(t) are allowed. The convergence is slow, but it can be accelerated by using more refined strategies, like ADAM or RMSProp [15]. 3.5
Physical Experimental Pendulum
Besides the simulation, we also recorded a one-minute video for the pendulum experiment shown in Fig. 7(a). In the experiment, we measured the mass of the pendulum bob and the length of the pendulum. The angle θ and angular velocity ω were calculated based on the image processing algorithms. The friction is unknown in the experiment. To process the experiment video, we first extract the frames out of the video that is recorded with the frame rate of 60 per seconds. We then apply the Blob detection algorithm from the Scikit-image image processing package,
272
L. Huang et al.
Fig. 6. Left panel: comparison between the exact and calculated airflow profiles, and angle vs. time (inset). Right panel: convergence of the loss function.
(a) Experimental Pendulum Video
(b) Labeled Experimental Pendulum Video
Fig. 7. Pendulum experiment recorded in video
which detects the coordinates of the pendulum bob and center. The Difference of Hessian (DoH) algorithm in Blob detection gives us the best performance and less false positives. Figure 7(b) shows the detected coordinates of pendulum center and bob. These coordinates detected are used to compute the angle θ and angular velocity ω based on the geometry and the prior state. The results are a collection of 3600 pendulum angles and angular velocity states in one minute with 1/60 s for each time step.
4
Learning the Nonlinear Dynamics with Scientific Machine Learning
In Sect. 3, the paper shows the simulation results pendulum nonlinear dynamics with assumptions of known functions of the friction and external torque. Can SciML augment the scientific machine learning by using the collected data set?
Discovering Nonlinear Dynamics
273
For real-world experiment, we may not know some of these functions, but we can collect the motion data (θ and ω) based on experiments Sect. 3.5. The question is if the SciML model can learn the unknown nonlinear dynamics hidden in these systems? 4.1
What Do These SciML Models Learn?
In the pendulum study, we knew that the simple harmonic pendulum’s motion is governed by the Eq. (1). The initial conditions include the pendulum angle, the angular velocity, the length, the mass, and the constant gravity. All of them can be measured to determine the motion. In reality, what we do not know at the beginning is the friction and external force in the Eq. (9) and (10). The SciML model only models the friction and external force functions using a neural network and learns the two functions through the recorded data. The ODE solver calculates the harmonic motion. g dθ d2 θ + sin(θ) = Np ( , m, L) dt2 L dt
(12)
The Eq. (9) and (10) is revised as the new Eq. (12), in which the Np ( dθ dt , m, L) is a neural network with four inputs and one output that learns the friction and external torque. The Np is a four-layer fully connected neural network with 464-64-1 neurons in each layer, and it uses the hyperbolic tangent tanh as its activation function. The software package used in the paper is one of the SciML packages named DiffEqFlux implemented in Julia software stack. The neural network is trained by using a small data set from the pendulum simulation with a time span of [0, 10] and a time step 0.1, which gives us 101 samples. The training starts with the Adam optimizer for the first 100 iterations and then switch to the BFGS optimizer after the 100th iteration. Figure 8 shows the loss values of using these two optimizers. In this experiment, the BFGS optimizer learns the function faster than Adam optimizer.
Fig. 8. Loss values during the UDE training
274
L. Huang et al.
Once the loss value becomes small enough ( (S2 ), then the tuple is called a normalrepresentation. If and there is some and some j ∈ p2−1 (i)
We call (f , p1 , p2 , p) is an entangled representation of T . Any variator T : IS0 → IS1 evidently have a x-variator representation
That is, for any I ∈ IS0 we have
Corollary 1. Given any variator T , it is a x-variator. The representation of a variator as a x-variator is not unique. Theorem 2 (Impossibility of Slicing Theorem). Given a variator T , if for any repreholds implies that the sentation (f , p1 , p2 , p) of T , that the condition representation is entangled, then T is not sliceable.
288
W. Pan
Proof. It is clear. Definition 9. We say a tensor variator T : IS1 → IS2 is weakly sliceable, if and only if . T has a normal representation (f , p1 , p2 , p), such that Let’s see another example. The variator ⎡⎡ ⎤⎤ ⎤⎡ 0 0 0 0 0 0 ⎥⎥ ⎥⎢ ⎢⎢ ⎥⎥ ⎥⎢ ⎢⎢ 101 101 ⎥ ⎢ ⎢ E_3 = ⎢ ⎥ ⎥ ⎥⎥ ⎥⎢ ⎢⎢ ⎣ ⎣ 0 1 0 ⎦ ⎣ 0 1 0 ⎦⎦ 111 111 is weakly sliceable. Because there are picks p_3_1 = [0] p_3_2 = [1] p_3 = 0 2 1 and a variator f _3 whose provision tensor is
Definition 10. Given a tensor variator T : IS0 → IS1 which has a normal representation (f , p1 , p2 , p), where p is a shaffle. Let F be the provisioner tensor of the variator f : Ip1 (S0 ) → IS2 , and let V be a tensor with shape S0 , then (F, p1 , p2 , p, V ) is called a x-sparse tensor representation, or simply x-sparse tensor. A x-scattering is a binary e = (A, X ), where A = (F, p1 , p2 , p, V ) is the x-sparse tensor, X is a tensor. A result of x-scattering e is a tensor B defined as for any J , if J ∈ T IS0 , then there is some I ∈ IS0 , such that J = p(J0 ) where J0 = p1 (I ) + p2 (I ) and B[J ] = V (I ); if J ∈ ISB and J ∈ / T IS0 , then B[J ] = X [J ] holds. The result tensor of a x-scattering also cannot be certainly determined. We implement x-scattering using python code [7] and call it scatterX API. C++ and CUDA code of x-scattering are also provided to move slices parallelly and utilize CUDA streams.
Tensor Data Scattering and the Impossibility of Slicing Theorem
289
7.3 Counting Sparsity and Analyzing Performance . A dense tensor E with shape S has a x-sparse tensor representation Once we randomly remove few elements from E and get a sparse tensor E , then E has a x-sparse representation where indices is a provisioner of a and V is a one-dimensional tensor which contains elements of E . variator Thus, the inner variator in a x-sparse tensor indicates the efficiency of storing sparse indices. Definition 11. Given a x-sparse tensor X = (F, p1 , p2 , p, V ), the sparsityof the xsparse tensor is defined as
Now we can count the sparsity of former examples:
The sparsity is 1 means that the x-sparse tensor hardly can be parallelly used. The x-sparse tensor has smaller sparsity will have high possibility to be parallelly used.
7.4 Mocking Current Scattering APIs The counterpart scattering of the TensorFlow scatter API as in Sect. 6 has a x-scattering representation ((indices, p1 , p2 , p, updates), ts) where
The sparsity
where
290
W. Pan
It can be any number smaller than 1. Whereas the counterpart scattering of the pyTorch scatter API as in Sect. 6 has a x-scattering representation ((Eindex , q1 , q2 , q, src), self ) where
The sparsity
This means that pyTorch scatter API is not sliceable. The key difference of these two kinds of APIs is how the variator in scattering is formed.
8 Conclusion Tensor data scattering is a kind of task that is difficult to use the hardware features of machine learning accelerators. This article theoretically analyses the reasons for this difficulty. And a general theory and algorithm of tensor data scaterring is established in this article. Based on the theories and algorithms in this article, we will be able to implement algorithms that can make better use of accelerator features. Moreover, a standard approach is proposed to represent sparse tensor, which can facilitate parallel computing and data transporting in AI accelerators, and which can also provide a way to efficiently store sparse indices of sparse tensors. A sparsity measuring formula is provided at last section, which can effectively indicate the storage efficiency of sparse tensor and the possibility of parallelly using it. More experiments and comparisons with APIs in other deep learning frameworks require more time. We will continue our work in this area and display the results in the GitHub project [7].
References 1. Soyata, T.: GPU Parallel Program Development using CUDA. CRC Press, Boca Raton (2018) 2. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse variators. CoRR abs/1904.10509 (2019). http://arxiv.org/abs/1904.10509 3. TensorFlow API: tf.tensor_scatter_nd_update, https://www.tensorflow.org/api_docs/python/tf/ tensor_scatter_nd_update 4. PyTorch Docs: torch.Tensor.scatter. https://pytorch.org/docs/stable/tensors.html?Highlight= scatter#torch.Tensor.scatter 5. Harris, C.R., Millman, K.J., van der Walt, S.J., et al.: Array programming with NumPy. Nature 585, 357–362 (2020) 6. Zhang, T., Liu, X., Wang, X., Walid, A.: cuTensor-tubal: efficient primitives for tubal-rank tensor learning operations on GPUs. IEEE Trans. Parallel Distrib. Syst. 31(3), 595–610 (2020) 7. Algebraic Tensor Project. https://github.com/wmpan/AlgebraicTensor
Scope and Sense of Explainability for AI-Systems 1 ¨ A.-M. Leventi-Peetz1(B) , T. Ostreich , W. Lennartz2 , and K. Weber2 1
Federal Office for Information Security, BSI, Bonn, Germany [email protected] 2 inducto GmbH, Dorfen, Germany [email protected]
Abstract. Certain aspects of the explainability of AI systems will be critically discussed. This especially with focus on the feasibility of the task of making every AI system explainable. Emphasis will be given to difficulties related to the explainability of highly complex and efficient AI systems which deliver decisions whose explanation defies classical logical schemes of cause and effect. AI systems have provably delivered unintelligible solutions which in retrospect were characterized as ingenious (for example move 37 of the game 2 of AlphaGo). It will be elaborated on arguments supporting the notion that if AI-solutions were to be discarded in advance because of their not being thoroughly comprehensible, a great deal of the potentiality of intelligent systems would be wasted. Keywords: Artificial Intelligence (AI) · Machine Learning (ML) · Explainable AI (XAI) · Chaos · Criticality · Attractors · Echo State Networks (ESN) · Time series · Causality
1
Introduction
The next generation AI-systems are expected to extend into areas that correspond to human cognition, such as real time contextual events interpretation and autonomous system adaptation. AI solutions are mostly based on neural networks (NN) training and inference developed on deterministic views of events that lack context and commonsense understanding. Many successful developments have been done in the direction of explainable AI algorithms while further advancements in AI will still have to address novel situations and abstraction to automate ordinary human activities [15]. There exist already various approaches to explain the results of machine-learning systems (ML systems), there are methods and tools which can interpret and verify for example classification results and decisions produced on the basis of sophisticated complex ML systems. The explanations vary with the task and the method which ML systems employ to reach their results. The aim of this work is to give a short but not exhaustive report about known ambiguities, shortcomings, flaws and even mistakes which ML explainability methods imply, underlining the association of these problems c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 291–308, 2022. https://doi.org/10.1007/978-3-030-82193-7_19
292
A.-M. Leventi-Peetz et al.
to the growing complexity of the systems which have to get interpreted. Furthermore, there will be discussed the necessity of taking chaos theoretical approaches for ML into account, and some implemented examples which demonstrate the potentiality of this new direction will be discussed. In conclusion, there will be naturally formulated the doubt as to whether it is possible, or it makes sense, to follow the intention of finding ways to make every ML system explainable. In the section following this introduction, the importance of making AI explainable will be emphasized, by reference to some prominent applications of AI systems, which directly implicate the necessity of understanding the reasoning behind machine made decisions. In this context explainability is seen as a requirement of trustworthy AI. In Sect. 3, the difference between the explainability of rulebased systems of the first generation AI and that of modern ML systems will be emphasized. Technical aspects of the feature-based explainability methods for advanced, dynamically adaptive deep learning systems are discussed in Sect. 4, with focus on the evaluation of a number of recent improvements, introduced to increase reliability of explanations. In Sect. 5 examples will be given to justify the comparison of the behavior of ML systems to the behavior of chaotic systems, whose results are sensitively dependent on their initial conditions. In the following section, advantages of using echo state networks and reservoir computing as a computationally efficient and competitive alternative to deep learning methods will be considered, especially with respect to their ability to simulate both deterministic but also chaotic systems. In Sects. 6 and 7 the proof of causality in ML results will be presented as an indispensable part of any sound explanation of ML supported decisions. At the same time, references will be given to scientific work, which asserts that the problem of assigning causation in observational data has not yet been solved. Some AI specialists assign to XAI the property of being brittle, easy to fool, unstable or wrong. In the conclusions there will be posed the question if it is absolutely necessary to make all AI systems interpretable in the first place. At the present state of developments, interpretability does not necessarily contribute to the trustworthiness of AI systems.
2
Superhuman Abilities of AI
Of crucial importance is the application of AI in so called safety and security critical systems, for example in transportation and medicine, where there is very little or zero tolerance of machine errors. For instance, the interpretation of ML models employed in computer-aided diagnosis (CADx) to support cancer detection on the basis of digital medical images is often the recognition of certain patterns which pixels in the images form [28]. These patterns are combinations of so called features (for example gray levels, texture, shapes etc.) which the algorithm has extracted from the test image in order to infer a result. The term inference means “make a prediction on the basis of experience”, in this case the experience which the model has gained during its training phase, exploiting information stored in large labeled datasets. This would be the case of supervised learning which tought the model to discern between pathological and normal forms. The increasing accuracy of imaging methods calls for an
Scope and Sense of XAI
293
increase of accuracy and reliability of the algorithmic predictive mechanisms. Imaging examination has no longer only qualitative and pure diagnostic character, it now also provides quantitative information on disease severity, as well as identification of so called biomarkers of prognosis and treatment response. ML systems are committed with the objective of complementing diagnostic imaging and helping the therapeutic decision-making process. There is a move toward the rapid expansion of the use of ML tools and leading radiology in daily life of physicians, making each patient unique, in the concept of multidisciplinary approach and precision medicine. The move from the well established predictive analysis to the so called prescriptive one, one that should expect systems to be even more efficient and in a way smarter gets stronger. The quality of these systems concerns not only the health sector but also industry and economy, regarding for example the emergence of smart factories and the approaching realization of the fourth industrial revolution (industry 4.0 ) with the planning of self-organizing intelligent systems, that is systems which can anticipate and find solutions for suddenly arising problems, and most probably also unforeseen problems by themselves. This new generation of system automations will probably have an enormous social and economic impact world wide. People, societies, will have to rely on the decisions and the advices of machines to organize life. But can advices and decisions of machine systems get completely trusted eventually even without the final approval of some reviewing human experts committee? Could they be accounted as reliable and secure? Could people perhaps trust these systems if their behavior becomes somehow explainable? In this case could the development of adequate norms and criteria as to how machine explanations should look like be enough in order to inspire trust? And who should be able to comprehend these explanations? These questions have received a great deal of attention in the last years and will stay in focus of research for a long time to go. Explainability has received special attention ever since AI algorithms managed to reach what is being called superhuman abilities. People have realized that they can develop systems that are not only faster in solving problems but can also do better, because they can find solutions which no expert has ever been able to find so far. One has to recall the famous creative and unique stone move 37 of the game 2 of AlphaGo which was evaluated by AlphaGo as having a probability of being played by a human close to one in ten million [5]. Experts have been asked about the implications of this kind of creativity. Some of the experts attributed the move to clever programming, and not creativity of the software. In other articles the advancements from AlphaGo to AlphaGo Zero (a program that can win a play without any use of information based on human experience) has been seen as an example of the AI becoming self-aware and creating its own AI which is as smart as itself if not smarter. Experience shows that experts in general cannot always make explanations of their decisions understandable not even for fellow experts! However it is expected that the self-awareness of AI systems should enable them to explain their decisions to humans. In fact on AI systems there are made much higher demands than on humans when they have to make decisions. In autonomous driving for example it is expected that the
294
A.-M. Leventi-Peetz et al.
technology must be at least 100 times better than humans, according to Prof. Trapp of Fraunhofer ESK [29].
3
Forms of Explainability
The rule-based systems, or expert systems of the first generation of AI, were deterministic. Their intelligence was fixed, following a definite series of rules and instructions, their inference was made based on Boolean or classical logic. The explanation of the decisions of those systems was the demonstration of the inference rules that led to a decision. But these systems followed rules which would be determined by humans. They were as causal, fair, robust, trustworthy and usable as their developers had made them to be. These systems wouldn’t change or update on their own, they would not learn from mistakes. They simulate AI but for many experts they were not true AI systems. The first so called reason tracing explanations were saying nothing about the system’s general goals or resolution strategy [12]. The utilization of the fact that knowledge of the problem to be solved, if expressed in a form that computers can handle, offered advantages, motivated domain experts, so named knowledge engineers, to encode experts’ advice in the form of associational (also referred to as heuristic or empirical) rules that mapped observable features (evidence) to conclusions. For a large portion of real-world problems it is significantly easier to collect data and identify a desirable behavior, than to explicitly write a program, as Karpathy aptly stated (2017) [22]. ML systems, nowadays powered by NN and deep learning shifted the paradigm from one in which the programmer must provide rules and inputs in order to obtain results, to one where specialists and no specialists can provide inputs and results to derive rules. The promise of this approach is that learned rules can be applied to many new inputs, without requiring that the user has the expertise needed to derive results. This is sometimes also observed as democratization of AI. The motivation in this respect is that representing knowledge in datasets is much easier than having to provide methods of encoding and manipulating symbolic knowledge. Because in this case updating and improving learning systems can be done more smoothly as the datasets grow and evolve over time. Furthermore, rule-based systems are not of help for solving problems in complex domains and there are many cases (e.g., cancer detection in medical images), where no explicitly defined rules in a programmatic or declarative way are possible. The hope of AI research is to implement general AI by creating autonomously learning systems. These systems should become finally unlimited in their ability to simulate intelligence, they should be able to demonstrate all signs of an adaptively growing intelligence: Previous knowledge should be modified, eliminated if not needed any more, while new knowledge should be continuously gained. Hence, these systems should be able to build and update their rules actively on the fly. This is the difference between ML systems and rule-based ones. Neural networks represent instances of learning systems. A learning system implements a utility function representing the difference between the system’s prediction
Scope and Sense of XAI
295
and reality and this difference will be minimized for example with the help of optimization techniques, which will change the system’s parameters. These optimization techniques (e.g., gradient descent, stochastic gradient descent) are in fact rule-based techniques because they just compute gradients needed to adjust the weights and biases to optimize its utility function. The approach of the calculation varies considerably (e.g., between supervised and unsupervised learning). The learning process is deterministic (including the statistical and probabilistic part of the method), however it is practically impossible to describe the learning system with a model because this would involve millions of dynamic parameters (e.g., weights, biases) which make the description of internal system processes untraceable. Their enormous complexity makes learned systems very hard to explain, so that they can hardly get understood by humans [32]. It can’t become entirely clear for trained systems how they make their decisions. That’s the dark secret at the heart of learning systems according to Will Knight, Senior Editor of MIT Technology Review.1 According to Tommi Jaakkola (MIT, Computer Science)2 this is already a major problem for many applications; whether it’s an investment decision, a medical decision, or a military decision, one doesn’t want to just rely on a black box. The European Union issued the so named EU General Data Protection Regulation [14] which is practically a right to explanation. Citizens are entitled to ask for an explanation about algorithmic decisions made about them. There arises the question if GDPR will become a game-changer for AI technologies. The consequences of this regulation are not yet really clear. It remains to be seen whether such a law is legally enforceable. It’s not clear if that law is more a right to inform rather than a right to explanation. Therefore, the impact of GDPR on AI is still under dispute. For the explainability of NN models, a large body of work focuses on post-training feature visualization to qualitatively understand the dynamics of the NNs. The following properties are important for explanations: – – – – –
Causality Fairness Robustness and Reliability Usability Trust
There have been long discussions about biased decisions, the famous husky which has been misidentified as wolf, because of the snow in the picture of his environment, is known to almost everybody. The bias in the data is a serious issue especially because as experts point out, algorithms tend to amplify existing biases, they actually learn from differences and any difference can under circumstances become a bias in the process of learning. However one cannot discard the possibility that even if all training datasets were balanced, so that no biases were possible, there could always still exist some kind of biases in the opinion of users who are meant to understand the algorithm’s interpretation and judge about the 1 2
https://www.technologyreview.com/author/will-knight/. https://people.csail.mit.edu/tommi/.
296
A.-M. Leventi-Peetz et al.
algorithmic fairness. There are many subtleties involved in interpretation which should be of concern in parallel to the technical refinement of algorithms and software.
4
Complex Dynamical Systems
Learning setups can not always be static. The necessity of learning in continuous time, by using continuous data streams to which also online learning belongs, has established incremental learning strategies to account for situations that training data become available in a sequential order. The best predictor for future data gets updated at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once. Online learning algorithms are also known to be prone to the so named catastrophic interference, which is the tendency of an artificial NN to completely and abruptly forget previously learned information upon learning new. This is the well-known stability-plasticity dilemma [16]. An algorithm has to dynamically adapt to new patterns in the data, when the data itself is generated as a function of time, e.g., stock price prediction. In time series forecasting a model is employed to predict future values based on a previously performed time series analysis and the thereof values observed. That is historical data is used to forecast the future. Such predictions are delivered together with confidence intervals (CI) that reflect the confidence level for the prediction. The size of the sample and its variability belong to the factors which affect the width of the confidence interval, as well as the confidence level, usually set at 95% [4]. A larger sample will tend to produce a better estimate of the population parameter, when all other factors stay unchanged. However, NNs belonging to specific settings do not provide a unique solution, because their performance is determined by several factors, such as the initial values, usually chosen randomly from a distribution, the order of input data during the training cycle and the number of training cycles [10,19,27]. Other variables belonging to the mathematical attributes of a specific NN, like learning rate, momentum, affect also the final state of a trained NN which makes a high number of different possible combinations possible. Evolutionary algorithms have been proposed to find the most suitable design of NNs, in order to allow a better prediction, given the high number of possible combinations of parameters. Also many different NNs can be trained independently with the same set of data, so that an ensemble of artificial NNs that have a similar average performance but a different predisposition to make mistakes on their individual level of prediction will be created [7]. If one needs to estimate a new patient’s individual risk, for example in cardiovascular disease prediction, or the riskiness of a single stock, or one must classify the danger of some unknown data traffic pattern that might hide a cyber attack, a set of independent NN models acting simultaneously on the same problem should be of advantage. An ensemble of models performs better than any individual model, because the various errors of the models average out therefore it has dominated recent ML competitions [8,17]. Using model ensembles also
Scope and Sense of XAI
297
requires a much larger training time as compared to training only one model. Each model is trained either from scratch or derived from a base model in significant ways. In all kinds of ensemble methods, concatenated, averaged, weighted etc., one has certain advantages and disadvantages and a reported accuracy of up to 89% on test data. Explainability refers to the ability of a model or an ensemble of models to explain its decisions in terms of human observable decision boundaries or features. Should the user get a proof that a different choice of ensemble weights would not have resulted to a different classification in his case? How do the decision boundaries look like that resulted to the decision concerning him? One can also develop ensembles during fine-tuning operations dividing the procedure in subtasks. Incremental and active learning remain a field of research aiming at developing recognition and decision systems that are able to deal with new data from known or even completely new classes by performing learning in a continuous fashion. Active learning and active knowledge discovery are approaches, which require continuously changing models. How should continuous learning with a series of update steps get performed robustly and efficiently is a question that still remains open. And how explainable are these models for the user? If it is allowed to assume that the parameters of the NN vary smoothly with the time-varying training dataset, one can apply warm-start optimization for each time step, using the parameters of the previous step as initialization for the current parameters. In this case a network fine-tuning is performed under the assumption that the introduction of new categories is not necessary for the classification of the new data. If however the new datasets have little or nothing in common with the datasets of the previous step, new classes (known or unknown) have to be added with additional nodes at the output layer of the network, together with some new parameters and a new normalization for this network. Questions of convergence under time limitation or perhaps data sparsity are in general open. How many layers must be adapted so that a robust solution can be found for real-world and real-time applications. For example how many SGD (stochastic gradient decent) iterations would be necessary for each update in order to achieve calculation accuracy without the need of overwhelming computational effort. There exist empirical studies which have investigated various factors among others the fraction of older to new data to be considered during the SGD iterations as to avoid overfitting. The dropout technique randomly changes the network architecture to minimize the risks that learned parameters do not generalize well. This method in essence simulates ensembles of models without creating multiple networks. The dropout technique requires tuning of hyperparameters to work well, like change the learning rate, weight decay, momentum, max-norm, number of units in a layer, and for a given network architecture and input data requires experimentation with the hyperparameters. Dropout increases convergence time as one needs to train models with different combinations of hyperparameters that affect model behavior, further increasing training time [17]. However dropout acts detrimental to accuracy if used without normalization therefore normalization techniques have been developed, some also
298
A.-M. Leventi-Peetz et al.
going beyond the batch normalization to account for active learning. On top of this, wrong object labels (label noise) are not completely avoidable in real-world applications which considerably degrades the accuracy of the results. Researchers have managed to spot changes of a continuously learning deep CNN (convolutional neural network ) by visualizing the shifting of the mainly attended image regions, for example when a new class is introduced, by observing the strongest network-filter changes during a single learning step [2]. Visual explanations for DNN—for example CAM (class activation mapping) or Grad-CAM [3,24]—are posthoc, they work on a NN after training is complete and the parameters are fixed, when also for only a short time. The network produces a feature map at its last convolution layer, and weights of features or gradients with respect to feature map activations are posthoc calculated and plotted. The result is a class-discriminative localization map which determines the position of particular class objects. However explainability is not interpretability and therefore posthoc attention mechanisms, although perhaps helpful for following reactions of agents in video games, may not be optimal for real-world decisions connected with high risk. Explaining how a model made its decision delivers a chain of results, after a sequence of mathematical operations have been applied to the model and can perhaps help to better understand the functionality of the model but it does not also provide any known rules of the natural world which would make sense to humans. Moreover, model rules do not always translate to unique or comparable decisions, so that to find a way to translate model rules (explainability) to natural world rules (interpretability for humans) would not be the only problem that has to be solved. For instance studies have demonstrated that the overlap of features, which filters extract in high convolutional layers, leads to poor model expressiveness in CNNs. Methods have been developed to remove redundancies and feature ambiguity by inducing bias in the training process and confine each filter’s attention to just one or a few classes. Also methods to disentangle middle-layer representations of CNNs to correspond to objects and to object parts features have been developed, in order to assign semantic meanings to filters [35]. Because there is a trade-off between explainability and performance, in real-world applications additional networks, so called explainability networks, have been implemented and trained in parallel to the original performing networks with the task to make the former explainable. For the training and testing of explainable filters, benchmark datasets with ground truth annotations have been employed. In a number of cases the majority of classifications could be attributed to these new filters, but there have been also cases where the performing network achieved better classifications than its corresponding explaining network. The additional computational effort and time associated with the process of features disentanglement makes the concept not applicable for dense networks or when a great number of features have to be recognized [26]. CNNs use pooling which is the application of down sampling of the feature map to ensure that the CNN recognizes the same object in images of different forms and also to reduce the memory requirements of the model. The pooling operation introduces spatial invariance in CNNs which is also one of the
Scope and Sense of XAI
299
major weaknesses of CNNs. Max pooling for instance preserves the best features and the feature map gets flattened into a column matrix to be processed in the NN for further computations. As a result of pooling, CNNs can lose features in images and there would be needed a very big amount of training data for this weakness to get compensated. CNNs are also unable to recognize pose, texture and deformations in images or parts of images. CNNs lack equivalence because they don’t implement equivariance, however they use translational invariance therefore they can for example detect a face in a picture, if they have detected an eye, independent of the spatial location of the eye in respect to the rest of features which usually belong to a face. Alone on the basis of features the results of a CNN cannot generally get interpreted as it seems. Capsule networks or CapsNets have been proposed as an alternative to CNNs [23]. Their neurons accept and output vectors as opposed to CNNs’ scalar values. Features can be learned together with their deformations and viewing conditions. In capsule networks, each capsule is made up of a group of neurons with each neuron’s output representing a different property of the same feature. The output of a capsule is the probability that a feature is present and is delivered together with the so named instantiation parameters, expressing the equivariance of the network, or its ability to keep its decision unchanged regardless of input transformations. The introduction of CapsNets is considered to be promising for solving real life problems like machine translation, intent detection, mood and emotion detection, traffic prediction on the basis of spatio-temporal traffic data expressed in images etc. Even though the training time for CapsNets is better than CNNs, it is still not acceptable for time critical operations and highly unsuitable for online training. Research is currently ongoing in this area. CapsNets are considered to be explainable by design, because during learning they construct relevance paths that reduce unrelated capsules without the necessity of a backward process for explanation.
5
Stability and Chaos
An important issue concerning the trustworthiness of DNNs is their liability to mistakes when adversarial examples are introduced as inputs to them causing them wrong decisions. Intentionally designed examples to fool a model, are the adversarial attacks, which some call optical illusions for machines, as they mostly concern widely discussed examples of striking miscategorization of pictures. Quite famous is the case of the classification network which had been trained to distinguish between a number of image categories with panda and gibbon being two of them. The classifier determines with 57.7% of accuracy the image of a panda. If a small perturbation is added to the picture, the classifier classifies the image as gibbon with 99% accuracy [18]. Research has showed that the output of deep neural networks (DNN) can be easily changed by adding relatively small perturbations to the input vector. There exist also designed and successfully applied attacks with an one-pixel image perturbation, for example
300
A.-M. Leventi-Peetz et al.
based on what is called differential evolution (DE) which can fool more types of networks [30]. Reinforcement learning (RL) is the autonomous learning of agents who learn out of experience how to carry out a designated task, and discover the best policy of behavior, or the best actions to undertake through interaction with their environment and evaluation of the according collection of rewards and punishments. RL systems have been proved to be also liable to mistakes due to adversarial attacks. It has been demonstrated that learning agents can also be manipulated by adversarial examples. Research shows that widely-used RL algorithms, such as DQN (deep Q-learning), TRPO (trust region policy optimization), and A3C (asynchronous advantage actor critic), are vulnerable to adversarial inputs. Degraded performance even in the presence of perturbations which are too subtle to be perceived by a human, can cause an agent to make wrong decisions [9,21]. ML systems are highly complex and complexity makes a system itself highly dependent on initial conditions. The here mentioned examples, where a small perturbation causes the system to make a jump in category space, present an analogy to the behavior known of chaotic systems, small changes in the starting state can generate a big difference in the dynamics of the system later on. The noise needed to add to the panda picture in order to get the false classification was a so named custom made perturbation, especially generated by a GAN, a generative adversarial network, trained to fool models by exploiting chaos. Perturbations can be meticulously designed to serve certain purposes, and make a DNN take a wrong decision, however also completely random perturbations which can arise accidentally in very complex environments where ML is already applied or is planned to be applied in the near future, especially implicating systems with real time requirements, can cause serious mistakes with possibly catastrophic consequences. In certain cases it can be difficult to discern between input signal and perturbation. There is a close relationship between complex systems research and ML with a wide range of cross-disciplinary interactions. Exploring how ML works in the aspect of involving complexity is a subject of significant research which has to be considered also in the context of interpretation [31]. For time series classification problems (TSCP) features have to get ordered by time, unlike the traditional classification problems. CNNs have been applied on time series automatically, tailoring filters that represented repeated patterns, learned and extracted features from the raw data. Recurrent neural networks (RNNs) are a family of NN used especially to address tasks which involve time series as input, and are therefore deployed in sequential data processing and continuous-time environments. They are capable of memorizing historic inputs, they posses dynamic memory, as they preserve in their internal state a nonlinear transformation of the input history. They are characterized by the presence of feedback connections in their hidden layer which allows them short-term memory capability. However their learning of short and long-time dependencies is problematic when implemented by means of gradient descent (vanishing/exploding gradients) whereby their training with backpropagation through time is computationally intensive and often inefficient. The interpretability of the internal dynamics of RNNs is input dependent and
Scope and Sense of XAI
301
almost infeasible given the complexity of the time and space dependent activity of their neurons.
6
Nonclassical Approaches, Training of Attractors
Nonclassical approaches like for example some based on heteroclinic networks with multiple saddle fixed points as nodes, connected by heteroclinic orbits as edges in the phase space of the learning system have been elaborated to generate reproducible sequential series of metastable states and attractors to explain RNN behaviors. To this task, known engineering methods have been extended to enable data based inference of heteroclinic dynamics [34]. These approaches use reservoir computing (RC) and reservoirs, that can be employed instead of temporal kernel functions, to avoid training-related challenges associated with RNN (slow convergence and instabilities etc.). Echo state networks (ESNs) and liquid state machines (LSM) have been proposed as possible RNN alternatives, under the name of RC. Reservoirs, seen as generalizations of RNN-architectures and ESNs, are far easier to train and have been mainly associated with supervised learning underlying RNNs. They map input signals into higher dimensional computational spaces through the dynamics of fixed, non-linear systems, the reservoirs. ESNs are considered appropriate to be used as universal approximators of arbitrary dynamic systems. Furthermore, the NN of the reservoir is randomly generated and only the readout has to be trained. The trained output layer delivers linear combinations of the internal states, interpreting the dynamics of the reservoir and its perturbations by external inputs. Reservoir computing can be applied for model-free and data based predictions of nonlinear dynamic systems. Reservoirs can be also applied for continuous physical systems in space and/or time, allowing computations in situations where partially or completely unknown interactions or extreme variations of the input signal take place, allowing for very limited functional control and almost no predictability. Andrea Ceni, Peter Ashwin, and Lorenzo Livi have investigated the possibility to exploit transient dynamical regimes and what they define as excitable network connections to switch between different stable attractors of the model for classification purposes [6]. They demonstrated how to extract such excitable network attractors (ENAs) from ESNs, whereby the previous training induced bifurcations that generated fixed points in phase space so that the trained system under small perturbations as input could move from one stable attractor to another. The hope is, that this can get exploited for classification problems that involve switching between a finite set of classes (attractors) and could be used instead of RNNs. Input dependent excitability thresholds of excitable connections have been also defined to measure the minimum distance in phase space, which would be necessary in order for a solution to escape from a stable point and converge to another. The authors found out that there exist local switching subspaces (LSS) in the vicinity of attractors, the dimensions of which directly relate with the activity of connections in the network, when the ESN solves
302
A.-M. Leventi-Peetz et al.
a task, in dependence of the complexity of the input and its impact on the dynamics of the reservoir. And this has to be assessed on a case-by-case basis. Finding fixed points for the dynamics of the system depends on the convergence of the optimization algorithm and one can have similar solutions, which in dependence of the chosen tolerance can be numerically different. Excitability thresholds should be important for the robustness of the solutions. ESNs which yield network attractors with low excitability thresholds were found to be less robust to noise perturbations. But sensitivity and accuracy of the network do suffer under low excitability. Training of the reservoir is simply tuning the readout parameters using comparison between input and output data, and an autoregressive process to minimize the difference. The result of the training could be for example a classification system which, when a sequence of patterns is given, can recognize each pattern by itself. A trained reservoir should act as an autonomous dynamical system whose state evolution, given the initial conditions, represents the state evolution of the nonlinear dynamical system that has to be predicted (task system). The forecast horizon is used to estimate the quality of short-time predictions of such a trained system. It is defined as the time between the start of a prediction and the point where it deviates from the test data more than a fixed threshold. There have been investigations, as to how to choose training hyperparameters like reservoir size, spectral radius, network connectivity, training sample size, training window and so on, in order to get reliable predictions. The latter must compare to the typical time scales of the motion of the system, determined by the maximum Lyapunov exponent. However the calculation of the Lyapunov exponent is complex and numerically unstable and one needs to have a knowledge of the mathematical model of the system to calculate it. This is not the case if one has only the time series data. The dynamics of a system can also be multiscale, noisy which might sometimes lead to rare transition events. Some systems can also spend very long periods of time in various metastable states and rarely, and at apparently random times, due to some influencing signal, suddenly transform into a new, quantitatively different state. Such changes in the dynamical behavior of complex systems are also known as critical transitions and occur at so-called tipping points. Theories explain this behavior as due to a large separation of time scales between the system state and signal evolve. Also complex and multiscale data have to be analyzed for system behavior predictions. It is an open question, how good can events and also rare events get predicted in multiscale nonlinear dynamic systems, making use of only the slow system state data for the training and having perhaps only a partial knowledge of the physics of the data generating system. In this context there exist developments in the direction of what is called physics-informed ESNs, which are ESNs extended to represent solutions of ODE (ordinary differential equation system), aiming at introducing causality in ML. Physical information gets imposed in the reservoir by means of special constraints of invariant principles. The ESN-architecture should be represented by an ODE approximator, which implements a physics-informed training scheme for the reservoir computing model [13]. Jiang et al. (2019) [20] have
Scope and Sense of XAI
303
demonstrated for reservoir computing systems which were employed for modelfree prediction of nonlinear dynamical systems, that there exists an interval for their spectral radius within which the prediction error is minimized. The authors have performed many experiments keeping the many hyperparameters of the reservoir fixed and leaving only the edge weights free. Characteristic for a reservoir consisting of a complex network of N interconnected neurons, is its adjacency matrix, an N × N weighted matrix, whose largest absolute eigenvalue is the network’s spectral radius. The authors have used ensemble-averaged predictions to show that the spectral radius of the reservoir plays a fundamental role in achieving correct predictions. They substantiated this finding by experimenting with a number of spatiotemporal dynamical systems known from physics: the nonlinear Schr¨ odinger equation (NLSE), the Kuramoto Sivashinsky equation (KSE), and the Ginzburg-Landau equation (GLE), where they could compare between the evolution of the true solution with the according results delivered by trained ESNs. For all the examined systems there could be found optimal intervals for the values of the spectral radius, and it could be determined that, when the radius lies outside this interval the prediction error raises immensely. This result remained valid, independent of the rest of the network parameters. Also in a case where performed calculations showed that only about 50 out of 100 ensemble realizations resulted to acceptable predictions, the spectral radius still had to be taken out of the optimal interval in order to get reasonable results in terms of accuracy and time. Remarkable is that also in the case of a chaotic nature of the solution, the necessity of choosing the spectral radius out of the optimal interval in order to get a meaningful predictions remains valid. Furthermore, it could be demonstrated that using directed or undirected network topology strongly influences the magnitude of the spectral radius interval, the directed case leading to different spectral radius values and also to an absolute minimum of achievable prediction error [20]. While traditional methods for chaotic dynamical systems manage to make shortterm predictions for about one Lyapunov time, model-free reservoir-computing predictions based only on data demonstrate a prediction horizon up to about half a dozen Lyapunov times [20]. It has also been discovered that the computational efficiency of ESNs gets maximized when the network is at the border between a stable and an unstable dynamical regime, at the so called edge of criticality or the region at the edge of chaos. That makes especially interesting the state between ordered dynamics (where disturbances die out fast) and chaotic dynamics (where disturbances get amplified). The average sensitivity to perturbations of its initial conditions allows to decide if a dynamical system has ordered or chaotic dynamics. There seems to exist no standard recipe of how to design an RNN or an ESN so that it operates steadily at its critical regime independent of task properties. Researchers suggest the development of mechanisms for self-organized criticality in ESNs [25,33]. Could a guarantee for a very low error in results, finally substitute the demand for explanations of ML systems predictions, so as to categorize them as trustworthy, without case-dependent technicalities, like counterfactual explanations, feature-based explanations, adversarial perturbation-based explanations etc. It is quite obvious that using established XAI methods, the creation of
304
A.-M. Leventi-Peetz et al.
explanations would find it difficult to keep pace with the rate of production of results that need to be explained (dynamical systems, online learning, IIoT etc.).
7
Causality of Results?
It is plausible to consider that it is difficult to have trust or a comprehensible interpretation of the results of ML and deep learning, unless causality regarding the production of these results can be established as a basis for the interpretation. Causality implicates temporal notion in the sense that there is a direction in time which dictates how a past causal event in a variable produces a future event in some other variable, which leads to a natural spatiotemporal definition of causal effects, that can be used to detect arrows of influence in real-world systems [1]. Mechanistic models which get fitted to predict results in complicated dynamical systems, represent simplified versatile descriptions of scientific hypotheses, and they implement parameters which are interpretable as they have a correspondence in the physical world. It is different with causation inference from data, the so named observational inference, the causality of which constitutes a challenging problem for complex dynamical systems, from theoretical foundations to practical computational issues [2]. Granger’s causality formulation describes a form of influence on predictability (or the lack of predictability), in the sense that from time dependent observations of a free complex system, without any probing activities exercised on it, it examines if the knowledge of one time series is useful in forecasting another time series, in which case the former can be seen and interpreted as potentially causal for the latter. The question of causation is fundamental for problems of control, policy decisions and forecasts and there can be probably no decision explanation without revealing the causation inference of the decision supporting system. Measures based on the Shannon entropy informational-theoretic approach, allow for a very general characterization of dependencies in complex and dynamical systems from symbolic to continuous descriptions. In analogy to Wiener-Granger causality for linear systems, transfer entropy is a way to consider questions of pairwise information transfer between nonlinear dynamical systems. However several works have shown limitations in measuring dependence and causation. Some researchers examine the causation problem with respect to dynamical attractors and the concept of generalized synchronization. Convergent cross-mapping tests implement the examination of the so named closeness principle. Within the framework of structural causal models (SCMs) there have been examined conditions under which nonlinear models can be identified from observational data. This method does not always deliver unique solutions however.
8
Conclusions
ML algorithms and their implementations are inherently highly complex systems and the quality of their predictions under real-world operation conditions cannot be safely quantified. To explain the functionality of a deep-learning system
Scope and Sense of XAI
305
under the influence of an arbitrary input of the domain for which the system has been designed and trained for, is considered to be generally impossible. NN based ML systems will be explained mostly through observations of the magnitude of network activations along paths connecting their neurons, followed back to the network input. Especially popular are XAI visualizations for interpretability, which highlight those parts of an image which are mostly correlated to the classification result (attention-based explanations). Such explanations are not always unambiguous, they are not intuitive, repeatable or unique. Arun Das et al. (2020) [11] write about the “inability of human-attention to deduce XAI explanation maps for decision-making and the unavailability of a quantitative measure of completeness and correctness of explanation maps”. The authors recommend further developments, if visualization techniques should be used for mission-critical AI applications. Returning to causality, it has to be emphasized, that causal inference from observational data is an open issue and still a subject of research. Attempts to create explainable surrogate models, for example using ODE systems (for instance neural ODE architectures for sequential data processing) adapting the equations parameters with the help of ML, underlie uncertainty and errors. Could dynamical systems get endowed with some kind of self-awareness, that is could they manage to maintain an inbuilt mechanism of internal active control, able to instantaneously evaluate the system’s state, if it is ordered, critical or chaotic, this would empower them to even ask for human intervention. However, the time scale on which systems undergo phase transformations and the duration of their stay in new states are beyond control, so that a request might have lost actuality, before a human specialist can react, let alone the possibility to prevent undesired system decisions, by forcing some alternative decision or even stopping the system. Such an option would be a contradiction in itself because AI systems are developed and employed to produce decisions correctly and fast based on data alone, as they are intended for tasks which no human experts can efficiently perform. This accounts of course for the cases when the AI systems operate as desired by their developers. Another matter is the significance and the priority of explanations, for example when a new, unforeseen and therefore not assessable decision has been delivered. Getting back to the creative and unique move 37 of the game 2 of AlphaGo, which would have been chosen with probability close to one in ten million, how could it have ever been possible to explain this move to someone and convince him in advance that this is indeed the right move to make in order to win the game? The tendency goes to a growing need for creative and unique decisions generated by AI systems for a world of increasing complexity, to open the way to new perceptions and novel concepts. For example, could AI prevent a disaster by timely predicting unforeseen threats? In this sense many AI systems may have to stay unpredictable to deal with unpredictable and even chaotic circumstances, which call for unexpected solutions inherently lacking explanations, that build upon previous experience and already discovered knowledge.
306
A.-M. Leventi-Peetz et al.
References 1. Bianco-Martinez, E., Baptista, M.S.: Space-time nature of causality. Chaos 28, 075509 (2018). https://doi.org/10.1063/1.5019917 2. Bollt, E.M., Sun, J., Runge, J.: Introduction to focus issue: causation inference and information flow in dynamical systems: theory and applications. Chaos 28, 075201 (2018). https://doi.org/10.1063/1.5046848 3. Buhrmester, V., M¨ unch, D., Arens, M.: Analysis of explainers of black box deep neural networks for computer vision: a survey. arXiv e-print (2019). https://arxiv. org/abs/1911.12116 4. Brownlee, J.: Confidence intervals for machine learning. Tutorial at Machine Learning Mastery (2019). https://machinelearningmastery.com/confidence-intervalsfor-machine-learning/ 5. Canaan, R., Salge, C., Togelius, J., Nealen, A.: Leveling the playing field - fairness in AI versus human game benchmarks. arXiv e-print (2019). https://arxiv.org/ abs/1903.07008 6. Ceni, A., Ashwin, P., Livi, L.: Interpreting recurrent neural networks behaviour via excitable network attractors. Cogn. Comput. 12(2), 330–356 (2020). https:// doi.org/10.1007/s12559-019-09634-2 7. Cerliani, M.: Neural networks ensemble. Posted on towards data science (2020). https://towardsdatascience.com/neural-networks-ensemble-33f33bea7df3 8. Makhijani, C.: Advanced ensemble learning techniques. Posted on towards data science (2020). https://towardsdatascience.com/advanced-ensemble-learningtechniques-bf755e38cbfb 9. Chen, T., Liu, J., Xiang, Y., Niu, W., Tong, E., Han, Z.: Adversarial attack and defense in reinforcement learning-from AI security view. Cybersecurity 2(1), 1–22 (2019). https://doi.org/10.1186/s42400-019-0027-x 10. Cui, Y., Ahmad, S., Hawkins, J.: Continuous online sequence learning with an unsupervised neural network model. Neural Comput. 28, 2474–2504 (2016). https:// numenta.com/neuroscience-research/research-publications/papers/continuousonline-sequence-learning-with-an-unsupervised-neural-network-model/ 11. Das, A., Rad, P.: Opportunities and challenges in explainable artificial intelligence (XAI): a survey. arXiv e-print (2020). https://arxiv.org/abs/2006.11371 12. David, J.M., Krivine, J.P., Simmons, R.: Second generation expert systems: a step forward in knowledge engineering. In: David, J.M., Krivine, J.P., Simmons, R. (eds.) Second Generation Expert Systems, pp. 3–23. Springer, Heidelberg (1993). https://doi.org/10.1007/978-3-642-77927-5 1 13. Doan, N.A.K., Polifke, W., Magri, L.: Physics-informed echo state networks for chaotic systems forecasting. In: Rodrigues, J.M.F., et al. (eds.) ICCS 2019. LNCS, vol. 11539, pp. 192–198. Springer, Cham (2019). https://doi.org/10.1007/978-3030-22747-0 15 14. General Data Protection Regulation. https://gdpr-info.eu/ 15. Intel Labs: Neuromorphic Computing - Next Generation of AI. https://www.intel. com/content/www/us/en/research/neuromorphic-computing.html 16. French, R.M.: Catastrophic forgetting in connectionist networks. Trends Cogn. Sci. 3(4), 128–135 (1999). https://doi.org/10.1016/S1364-6613(99)01294-2 17. Garbin, C., Zhu, X., Marques, O.: Dropout vs. batch normalization: an empirical study of their impact to deep learning. Multimed. Tools Appl. 79, 12777–12815 (2020). https://doi.org/10.1007/s11042-019-08453-9
Scope and Sense of XAI
307
18. Goodfellow, I., Shlens, J., Szegedy, C.: Explaining and harnessing adversarial examples. In: International Conference on Learning Representations (2015). http:// arxiv.org/abs/1412.6572 19. Grossi, E.: How artificial intelligence tools can be used to assess individual patient risk in cardiovascular disease: problems with the current methods. BMC Cardiovasc. Disord. 6 (2006). Article number: 20. https://doi.org/10.1186/1471-2261-620 20. Jiang, J., Lai, Y.-C.: Model-free prediction of spatiotemporal dynamical systems with recurrent neural networks: role of network spectral radius. Phys. Rev. Res. 1(3), 033056-1–033056-14 (2019). https://doi.org/10.1103/PhysRevResearch. 1.033056 21. Ilahi, I., et al.: Challenges and countermeasures for adversarial attacks on deep reinforcement learning. arXiv e-print (2020). https://arxiv.org/abs/2001.09684 22. Karpathy, A.: Software 2.0. medium.com (2017). https://medium.com/@karpathy/ software-2-0-a64152b37c35 23. Patrick, M.K., Adekoya, A.F., Mighty, A.A., Edward, B.Y.: Capsule networks - a survey. J. King Saud Univ. Comput. Inf. Sci. 1319–1578 (2019). https://doi.org/ 10.1016/j.jksuci.2019.09.014 24. Montavon, G., Binder, A., Lapuschkin, S., Samek, W., M¨ uller, K.-R.: Layer-wise relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., M¨ uller, K.-R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 193–209. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-28954-6 10 25. Pathak, J., et al.: Using machine learning to replicate chaotic attractors and calculate Lyapunov exponents from data. Chaos 27, 121102 (2017). https://doi.org/ 10.1063/1.5010300 26. Raffin, A., Hill, A., Traor´e, R., Lesort, T., D´ıaz-Rodr´ıguez, N., Filliat, D.: Decoupling feature extraction from policy learning: assessing benefits of state representation learning in goal based robotics. In: SPiRL Workshop ICLR (2019). https:// openreview.net/forum?id=Hkl-di09FQ 27. Richter, J.: Machine learning approaches for time series. Posted on dida.do (2020). https://dida.do/blog/machine-learning-approaches-for-time-series 28. Singh, A., Sengupta, S., Lakshminarayanan, V.: Explainable deep learning models in medical image analysis. J. Imaging 6(6), 52 (2020). https://doi.org/10.3390/ jimaging6060052 29. Strehlitz, M.: Wir k¨ onnen keine Garantien f¨ ur das Funktionieren von KI geben. Interview with Prof. Dr. habil. Mario Trapp, director of Fraunhofer IKS (2019). https://barrytown.blog/2019/06/25/wir-koennen-keine-garantienfuer-das-funktionieren-von-ki-geben/ 30. Su, J., Vargas, D.V., Sakurai, K.: One pixel attack for fooling deep neural networks. IEEE Trans. Evol. Comput. 23(5), 828–841 (2019). https://doi.org/10. 1109/TEVC.2019.2890858 31. Tang, Y., Kurths, J., Lin, W., Ott, E., Kocarev, L.: Introduction to focus issue: when machine learning meets complex systems: networks, chaos, and nonlinear dynamics. Chaos 30, 063151 (2020). https://doi.org/10.1063/5.0016505 32. Tricentis: AI In Software Testing. AI Approaches Compared: Rule-Based Testing vs. Learning. https://www.tricentis.com/artificial-intelligence-software-testing/aiapproaches-rule-based-testing-vs-learning/ 33. Verzelli, P., Alippi, C., Livi, L.: Echo state networks with self-normalizing activations on the hyper-sphere. Sci. Rep. 9, 13887 (2019). https://doi.org/10.1038/ s41598-019-50158-4
308
A.-M. Leventi-Peetz et al.
34. Voit, M., Meyer-Ortmanns, H.: Dynamical inference of simple heteroclinic networks. Front. Appl. Math. Stat. (2019). https://doi.org/10.3389/fams.2019.00063 35. Zhang, Q., Wu, Y.N., Zhu, S.: Interpretable convolutional neural networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pp. 8827–8836 (2018). https://doi.org/10.1109/CVPR.2018. 00920
Use Case Prediction Using Deep Learning Tinashe Wamambo(B) , Cristina Luca, Arooj Fatima, and Mahdi Maktab-Dar-Oghaz Anglia Ruskin University, Cambridge Campus, East Road, Cambridge CB1 1PT, UK [email protected] {cristina.luca,arooj.fatima,mahdi.maktabdar}@aru.ac.uk
Abstract. Research into utilising text classification to analyse product reviews from e-commerce websites has increased tremendously in recent years. Machine Learning and Deep Learning classifiers have been utilised to organise, categorise and classify product reviews, enabling the identification of polarity and sentiment within product reviews. In this paper, we propose a methodology to classify product reviews using machine learning and deep learning with the intention to identify and predict the activity (use case) in which the consumer used the product they have reviewed. Keywords: Natural Language Processing Machine Learning · Deep Learning
1
· Text classification ·
Introduction
In the modern world the internet is the most valuable resource for learning, getting ideas, buying and selling products and services. E-commerce retail websites such as Amazon, Ebay, etc. experience a large armount of internet traffic as consumers purchase products for their websites. Everyday millions of reviews are generated by consumers as they provide feedback about products and services, and their experience using them. The increased popularity of e-commerce websites and the explosion of product reviews in record high numbers has seen increased research into sentiment analysis and text classification. Sentiment analysis (also known as opinion mining) is the process of analysing text documents to extract and understand the sentiments expressed in the text. A combination of natural language processing (NLP) with a machine learning capability (also known as text classification) is utilised to determine the polarity of a text document. i.e. identifying whether the opinion expressed in a text document is positive, negative or neutral. Text classification is also utilised to determine a text document’s sentiment orientation, i.e. identifying whether the opinion expressed in a text document is subjective or objective. Product reviews are packed full of subjective opinions because consumers provide feedback on their experience using a product or service. Millions of product reviews have been generated by consumers who have purchased a product and have had experience using and benefiting from that c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 309–317, 2022. https://doi.org/10.1007/978-3-030-82193-7_20
310
T. Wamambo et al.
product. New consumers looking to purchase the same or similar products utilise these product reviews as part of their decision making process to make sure they make informed purchasing decisions. However, given the large amount of reviews generated for a given product, the average consumer will most likely not read all of the reviews to gain a holistic view of the product and the sentiments other consumers who have already bought and used/experienced the product have towards that product. Currently the quickest and easiest decision making element that is tied to product reviews that consumers utilise is the star rating, the higher the star rating the more likely a consumer will purchase that product. However, a star rating is ambiguous and prone to grade inflation, e.g. having a 4.8 star rating does not mean the product is exceptional, the difference between a product with a 4.5 star rating and one with a 4.8 star rating could be massive which makes it very difficult for consumers to differentiate between OK products and very good ones. Another issue with star ratings is the fact that they are shallow, they do not truly provide summarised information about the sentiments expressed by consumers or why consumers expressed such sentiments in the product reviews. Star ratings also leave a lot of room for assumptions to be made about a product’s suitability for certain tasks. e.g. a pair of shoes with a high star rating may be suitable for running but not for hiking. New consumers looking to purchase that pair of shoes would not be provided with this information in a quick and easy manner. This research aims to identify the activities other consumers used a product for alongside the polarity and sentiment they expressed in the product reviews using sentiment analysis, text classification and machine learning. The intention is to provide new consumers with valuable summarised information that they can use to decide if a product is suitable for the activity they intend to perform. For this research, the activity a consumer used a product for will be known as a ‘use case’. For this research, a use case is an action or activity performed using the product as described in a product review. An example of this is “I bought these boots for walking my dogs around the local park, they have been fit for purpose thus far”. The use case in this example is “walking”. The rest of the paper is organised as follows: Sect. 2 describes the related work, Sect. 3 presents the proposed approach to detect and predict a use case, Sect. 4 discusses the experiments performed and the results, and Sect. 5 finally draws the Conclusions.
2
Related Work
This research is motivated by advancements in machine learning techniques, sentiment analysis and text classification. In many reviews, users express their opinions towards a product’s features and their user experience. So, aspect/feature based sentiment analysis is a suitable direction to pursue. Action words/terms (verbs) are of particular importance for this research, so Parts of Speech
Use Case Prediction Using Deep Learning
311
(POS) tagging will be considered for feature extraction. Many machine learning approaches have been implemented over the years, Support Vector Machines (SVM) and Naive Bayes have significant popularity at effectively performing text classification with high accuracy and dealing with large datasets. 2.1
Parts of Speech
Part of Speech (POS) tagging has been used for feature extraction. Devi, et al. [1] performed sentence level classification to detect words tagged as nouns because aspects/features are usually described by nouns or noun phrases. Alfrjani, et al. [2] applied POS tagging to determine if tokens in reviews were nouns, verbs, adjectives, adverbs, etc. with the intention of extracting the POS tags as token features. Hemmatian and Sohrabi [7] utilised frequent based nouns as well as order and similarity based filtering to improve feature extraction using POS tagging. Devi, et al. [1] used POS tagging to extract product features from product reviews, however Alfrjani, et al. [2] used POS tagging to categorise words in a review as part of an integration process between a semantic domain ontology and natural language processing. Likewise, Hemmatian and Sohrabi [7] used POS tagging to extract features described as nouns or noun phrases. However, their approach included word frequency, whilst Devi, et al. [1] identified grammatical dependencies between words in sentences. Feature extraction strategies used by Devi, et al. [1] and Hemmatian and Sohrabi [7] have been considered for this research because product features are expected to be described as nouns or noun phrases, so POS tagging is vital. 2.2
Deep Learning
Deep learning is a branch of machine learning that aims to enable machines to learn and evolve, similar to the way humans learn from their memories and experiences throughout their lifetime. Instead of using predefined equations, deep learning sets up “basic parameters about the data and trains the computer to learn on its own by recognizing patterns” [14] using neural networks. A neural network consists of “multiple hidden layers that can learn increasingly abstract representations of the input data” [5] using weights that are adjusted during training [3] to produce better predictions. Parvathi and Jyothis [11] proposed a text classification strategy that involved using a Convolutional Neural Network (CNN) to identify relevant text to determine the category a document belonged to. The proposed strategy used “layers of neurons and a bag of words approach” [11] to analyse text documents. In their conclusions, Parvathi and Jyothis [11] reported accurate and positive results. However, they noted that deep learning models have a limitation of learning through observation which means they only contain knowledge provided in the training data instead of learning in a generalised manner.
312
T. Wamambo et al.
Parwez, et al. [12] highlighted that traditional machine learning models suffered from a limitation of “relying on the bag of words representation of documents to generate features in which word order and context are ignored” [12] which could cause data sparsity problems. They proposed a neural network architecture that involved Convolutional Neural Networks (CNN) to classify short text documents, e.g. tweets. The CNN models used a combination of generic and domain specific word embeddings to predict class labels, whilst considering the contextual information within text documents. Results showed that CNN models outperformed traditional machine learning models in terms of validation accuracy and optimal feature generation which was used to analyse unlabelled text documents. Parwez, et al. [12] concluded that the proposed approach could be used to perform social media surveillance focused on predicting disease outbreaks. Kolekar and Khanuja [8] performed a comparison between machine learning algorithms and neural networks to find out the better approach to classifying the polarity of tweets. They applied Term Frequency and Inverse Document Frequency (TF-IDF) word embedding technique to the tweets and fed the features to Support Vector Machines (SVM), Naive Bayes and Convolutional Neural Network (CNN) developed using Keras and Tensorflow. Results showed that “using deep learning approach has given better result compare to traditional machine learning technique like SVM and NB” [8]. Subramani, et al. [15] implemented a neural network based topic modelling architecture to anaylse text documents and cluster highly similar documents together. They coined this approach as the Neural Topic Modelling approach. Their architecture used Latent Dirichlet Allocation (LDA), Keras and Tensorflow. According to the researchers, the approach provides a scalable and unsupervised learning framework that accurately discovers topics for a text corpus by considering the “semantic meanings of the words ensuring the usefulness of the topics” [15]. Results from testing with short and large text documents showed positive results, leading Subramani, et al. [15] to conclude that their proposed topic modelling approach had real world applications, i.e. movie recommendations and news clustering. To identify similar documents based on the semantic meaning of their text, Mo and Ma [10] built DocNet, a clustering system that combined word embedding vectors, a deep neural network and euclidean distance. The expectation was for small document to have “small distances among their vectors while distinct document have large distances” [10]. Results showed this to be true with DocNet performing better than traditional clustering techniques, i.e. TF-IDF and K-means clustering. However, Mo and Ma [10] stated that DocNet’s performance heavily depended on the similarity of “data distribution between data fed to DocNet and data to classificaton or clustering” [10] which means the DocNet system will most likely perform poorly on new datasets. At the Google I/O conference in 2019, Sara Robinson [13] a developer advocate at Google presented a text classification model that predicted the topic of a Stack Overflow question. As part of pre-processing the train/test dataset,
Use Case Prediction Using Deep Learning
313
words that specified the topic of the Stack Overflow question within the text were replaced with the word avocado. This was done to prevent the machine learning model from using signal words to perform classification, but instead generalise to find patterns within a dataset because some Stack Overflow questions may not specify the topic. A bag of words approach was used to encode words into matrices, applying a multi-hot encoding technique to convert the Stack Overflow questions into vocabulary size arrays. The training labels were also converted into a multi-hot array because the model was going to have the ability to identify and predict on multiple labels. A deep learning neural network was developed, it took the bag of words matrices as input data, feeding the data into hidden layers. The output layer of the neural network used sigmoid to compute the model’s output. Sigmoid returns a value between zero and one for each label which corresponded to the probability of the label being associated with the Stack Overflow question. To develop the deep learning text classification neural network, Sara Robinson [13] used Pandas, Scikit-learn and Keras to pre-process the data, an 80/20 train/test split was applied to the dataset and the neural network model architecture was built using TensorFlow. The model had 95% accuracy. The techniques applied in various research in sentiment analysis and text classification have been primarily focused on improving the classification of text to extract features, polarity, opinions and emotions expressed towards products by consumers. This has been valuable for understanding the sentiment expressed by consumers, however, it has not been able to identify why consumers hold those sentiments towards products or the activities consumers have used the products for which is a major limitation. This research aims to resolve this limitation by understanding why consumers express particular opinions towards products, through identifying the activities consumers have used the products for.
3
Proposed Approach
The proposed approach is a deep learning neural network that has been developed to perform text classification on product reviews to detect and predict a use case. This approach has similarities to the approach presented by Sara Robinson [13] at the 2019 Google I/O conference. However, key signal words that specify the use case within the product review text have not been replaced with the word avocado. Multi-hot encoding has been applied to the product reviews and their labels producing a dataset of matrices. The product reviews (extracted from Amazon) used for this research have only one label and, therefore, the model will not classified on multiple labels as [13], where the Stack Overflow questions being classified had multiple labels. A 90/10 train/test split has been applied to the dataset for this iteration, whereas Sara Robinson [13] applied an 80/20 train/test split to her dataset. In similar fashion to Sara Robinson [13], TensorFlow has been utilised to build, train and test a deep learning neural network model.
314
T. Wamambo et al.
This approach focuses on the use cases listed below: – – – – – –
Run Walk Hike Climb Swim Unknown.
30,000 product reviews have been extracted from Amazon to train and test the proposed model. They consist of 5000 reviews for each of the use cases on focus to make sure the train/test dataset is balanced. This provides a greater probability for the neural network model to have balanced classes. Product reviews have been pre-processed and analysed, spelling mistakes have been corrected and stop words have been removed using NLTK’s natural language processing capbilities. TextBlob, a Python library that sits on top of NLTK has been utilised to extract features that have been used to create a multi-hot encoded bag of words matrices. A MultiLabelBinarizer class which is part of the Scitkit-learn library has been used to multi-hot encode the labels. Parts of Speech (POS) tagging has been utilised to identify and extract nouns, noun phrases and verbs as features which is different to the approach proposed by Devi, et al. [1] who only extracted nouns. The verbs that have been extracted as features describe the actions that have been performed as denoted in the review text and the activity (use case) in which the consumer used the product is expected to be described by the verbs in some way. NLTK is referred to as “The Conqueror” in EliteDataScience [4]. It is a “leading platform for building Python programs to work with human language” [16] that provides an easy to use interface to a suite containing a variety of text processing libraries. Many breakthroughs have been made in the field of analysing and processing text using NLTK as it is “responsible for conquering many text analysis problems” [4]. TextBlob, referred to as “The Prince” in [4], is a text processing library that “sits on the mighty shoulders of NLTK” [4]. It provides a “simple API for diving into common natural language processing (NLP) tasks such as part-ofspeech tagging, noun phrase extraction, sentiment analysis, classification”, [9] and it also has a “gentle learning curve while boasting a surprising amount of functionality” [4]. TextBlob also allows the use of NLTK tools along side its own tools, enabling access to the NLTK tool kit and all of its benefits. The deep learning neural network has been created using a Sequential class which is part of the Keras library. Dense layers which are also part of the Keras library have been added to the neural network as three layers that are used to classify a data matrix in chunks spread across various hidden layers. The multi hot encoded bag of words matrices and the multi hot encoded labels have been provided as input to the neural network. The neural network trains using the training data matrices over five epochs, this means the neural network repeatedly goes through the entire training dataset five times.
Use Case Prediction Using Deep Learning
4
315
Experiments and Results
4.1
Datasets Description
Amazon product reviews have been used as training/testing data for this research. This is because Amazon has a large amount of free text product reviews it holds due to Amazon’s vast product range and user base. A fantastic extensive dataset1 containing millions of Amazon product reviews has been used. Approximately 5 million reviews from the clothing and shoes categories have been retrieved. 4.2
Metrics
Table 1. Neural network metrics Accuracy Precision Recall 90%
96%
44%
Table 1 shows the accuracy, precision and recall of the neural network model. As shown in the table, the model has high accuracy and high precision, but unfortunately it has low recall. This means 9 out of 10 positive predictions have a high probability of being correct (precision), however only 4 out of 10 positive predictions have a probability of actually being correct (recall). As a result, the model may not generalise well and has a high probability of producing a significant amount of incorrect predictions. According to Google Developers [6], “improving precision typically reduces recall and vice versa” [6] because of the tension that is present between precision and recall. The good news is that the neural network is not biased towards positive predictions because the prerequisites required for the model to behave in such a manner are a low precision and very high recall. Table 2. Neural network performance Accurately predicted use cases Inaccurately predicted use cases 61%
39%
Table 2 shows the performance of the neural network classifier when it is exposed to 1200 brand new product reviews extracted from the extensive dataset described in Sect. 4.1. 1
https://nijianmo.github.io/amazon/index.html.
316
T. Wamambo et al.
Even though the neural network classifier has a very high classification accuracy as shown in Table 1, it did not accurately predict an extremely large amount of use cases as expected. The classifier accurately predicts the use case for the majority of the 1200 product reviews. This is positive reflection of the neural network and its performance given it managed to classify completely new product reviews and accurately predict the use cases for a relatively large amount of the product reviews, even though the neural network recorded a low recall. The most likely cause for the neural network failing to accurately predict the use case for a larger amount of product reviews is that the classifier does not generalise well enough.
5
Conclusions
This paper proposes an approach that develops a deep learning neural network to classify product reviews and predict the activity (use case) in which the consumer used the product. Natural language processing techniques, text classification and machine learning have been applied to develop the neural network. As shown by the metrics, the neural network has high accuracy and high precision, however low recall showed that results have a high probability of consisting of false positives. This was evidenced by exposing the neural network to completely new product reviews it had never consumed. The neural network accurately predicted the use case for the majority of the product reviews, however approximately 40% of the product reviews were incorrectly classified which is a significant number of reviews for a neural network to incorrectly classify. The aim of this research is to classify product reviews and accurately predict the activity (use case) a consumer used the product for as described in the review text. Results and metrics from testing the neural network show that text classification recall needs to be improved to limit the prevalence of false predictions and make sure the accurate predictions are reliably produced as output. As further work is undertaken within this research, this will be the focus.
References 1. Devi, D.V.N., Kumar, C.K., Prasad, S.: A feature based approach for sentiment analysis by using support vector machine. In: 2016 IEEE 6th International Conference on Advanced Computing (IACC) (2016). https://doi.org/10.1109/IACC. 2016.11 2. Alfrjani, R., Osman, T., Cosma, G.: A new approach to ontology-based semantic modelling for opinion mining. In: 2016 UKSim-AMSS 18th International Conference on Computer Modelling and Simulation (UKSim) (2016). https://doi.org/10. 1109/UKSim.2016.15 3. Allibhai, E.: Building A Deep Learning Model using Keras (2018). https://towardsdatascience.com/building-a-deep-learning-model-using-keras1548ca149d37
Use Case Prediction Using Deep Learning
317
4. EliteDataScience: 5 Heroice Python NLP Libraries (2017). https:// elitedatascience.com/python-nlp-libraries 5. EliteDataScience:. Keras Tutorial: The Ultimate Beginner’s Guide to Deep Learning in Python (2018). https://elitedatascience.com/keras-tutorial-deep-learningin-python 6. Google Developers: Machine Learning Crash Course (2020). https://developers. google.com/machine-learning/crash-course/classification/precision-and-recall 7. Hemmatian, F., Sohrabi, M.K.: A survey on classification techniques for opinion mining and sentiment analysis. In: Artificial Intelligence Review (2017). https:// link.springer.com/article/10.1007/s10462-017-9599-6 8. Kolekar, S.S., Khanuja, H.K.: Tweet classification with convolutional neural network. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA) (2018). https://doi.org/10.1109/ICCUBEA. 2018.8697397 9. Loria, S.: TextBlob: Simplified Text Processing (2018). https://textblob. readthedocs.io/en/dev/ 10. Mo, Z., Ma, J.: DocNet: a document embedding approach based on neural networks. In: 2018 24th International Conference on Automation and Computing (ICAC) (2018). https://doi.org/10.23919/IConAC.2018.8749095 11. Parvathi, P., Jyothis, T.S.: Identifying relevant text from text document using deep learning. In: 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET) (2018). https://doi.org/10.1109/ICCSDET. 2018.8821192 12. Parwez, MD., A., Abulaish, M., Jahiruddin: Multi-label classification of microblogging texts using convolution neural network. In: IEEE Access, vol. 7 (2019). https://doi.org/10.1109/ACCESS.2019.2919494 13. Robinson, S.: Live Coding A Machine Learning Model from Scratch (Google I/O’19), Google Cloud Platform (2019). https://www.youtube.com/watch?v= RPHiqF2bSs 14. SAS Institute: Deep Learning What it is and why it matters (2020). https://www. sas.com/en us/insights/analytics/deep-learning.html 15. Subramani, S., Sridhar, V., Shetty, K.: A novel approach of neural topic modelling for document clustering. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI) (2018). https://doi.org/10.1109/SSCI.2018.8628912 16. The NLTK Project: Natural Language Toolkit (2017). https://www.nltk.org/
VAMDLE: Visitor and Asset Management Using Deep Learning and ElasticSearch Viswanathsingh Seenundun1 , Balkrishansingh Purmah1 , and Zahra Mungloo-Dilmohamud2(B) 1 Accenture Technology, Ebene, Mauritius 2 Department of Digital Technologies, FoICDT University of Mauritius,
Reduit, Moka, Mauritius [email protected] Abstract. Visitor management and asset management are crucial in restricted places. This paper focuses on how artificial intelligence and image recognition can drive innovation in both visitor and asset management spaces by employing the latest advances in technology. The proposed solution, VAMDLE, is an Android application which uses deep transfer learning and Elasticsearch to facilitate the registration of visitors as well as the management of borrowed assets. TensorFlow was used to train a pre-trained model for assets image recognition and the new model was integrated into the Android application with the aid of the TFLite library. A restful web API was developed with the aid of Spring Boot to manage all the data used by the client application. The unique identifiers of the assets and of the employees were read and recognized using text recognition and regular expressions and Elasticsearch was used to automatically fill in forms. The use of these various tools and technologies resulted in an app with a simple interface, a very good classification accuracy and good average response time. The proposed system was able to register a classification accuracy of up to 97%. Keywords: Asset management · Visitor management · Deep learning · ElasticSearch
1 Introduction Visitor management in protected areas requires information about the visitors such as who they are, where they are on the premises and how many there are. When there are a large number of visitors, the need to capture these information as quickly as possible becomes imperative. Asset management systems require information about the assets such as the type of equipment, the date purchased, availability of the asset and who the asset is currently assigned to. Often these 2 systems are disparate but in cases where these tasks are to be performed at the reception desk or kiosk and as quickly as possible it is much easier and more efficient to have a single application managing both. The use of technology coupled with machine learning and open source search and analytics engines can aid in speeding up these processes. Deep learning, a subcategory of machine learning, has been used in many diverse fields. It learns by using multiple layers to gradually extract higher level features from © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 318–329, 2022. https://doi.org/10.1007/978-3-030-82193-7_21
VAMDLE
319
the raw input [1]. One of the areas where deep learning excels is image classification [2]. Examples where deep learning has been used for the classification of images include crack detection in civil engineering structures [3], image detection in autonomous driving [4], visual search for any product that someone scans using his mobile phone after seeing it in a store on in a magazine and instantly orders it and in medical image analysis [5]. Deep transfer learning, a more recent technique, is also used extensively in image analysis but here the learning process does not start from scratch but rather starts from patterns which have been obtained while solving a different problem. This model is then applied to a new field. In this research work, the use of Deep Learning together with ElasticSearch for a single Android app handling both visitor and asset management was investigated. Although Deep Learning has been used extensively for image classification, it has not been used in the context of an app for both visitor and asset management. The relevant literature, the design of the system and the resulting app are detailed in the sections that follow.
2 Background 2.1 Visitor Management and Asset Management Visitor management software systems electronically track and record the usage of any public or private building or site [6]. They are used to increase security, ensure that visitors are compliant to the site’s rules and regulations, to know who is on site and finally to impress visitors by enhancing the company image. They make the visitor signin process more efficient, accurate, and consistent. These software are usually installed on a self-service kiosk or on a device such as a pc or tablet at the reception desk. Some examples of visitor management systems are Envoy [7], proxyclick.com [8], swipedOn [9] and trackforce Valiant [10] and although they provide for many advantages they do not allow for asset management. Asset management systems also known as enterprise asset management are systems that electronically track and record equipment and inventory that are used in a company. Examples include Asset Panda [11], EZOfficeInventory [12], GoCodes Asset Management [13]and UpKeep [14]. However most of the asset management systems are intended for many advanced features and not simple tasks like the borrowing and returning of assets. Furthermore, asset management tools do not have visitor management features. 2.2 CNN and MobileNet Convolutional Neural Networks (CNNs) are deep learning algorithms which can take in an image as input, assign importance (learnable weights and biases) to various aspects/objects in the image and are able to differentiate one from the other (Fig. 1). The architecture of CNN is different from regular Neural Networks which transform an input by passing it through several hidden layers (consisting of a set of neurons) where each layer is fully connected to all neurons of the layer before and at the end, there is the output layer representing the predictions. In CNN, the layers are organized in 3 dimensions (width, depth and height) [15].
320
V. Seenundun et al.
Fig. 1. The different CNN layers [16]
MobileNet is a CNN architecture model which has been designed for mobile and embedded vision applications [16]. MobileNets are based on a restructured architecture where depthwise separable convolutions which are made up of two layers: depthwise convolutions and pointwise convolutions, have been used. A standard convolution filters and combines inputs into a new set of outputs in a single step while the depthwise separable convolution splits this layer into two separate layers, one for filtering and one for combining. This results in significantly reduced computation and model size [16]. MobileNet has been pre-trained on the ImageNet dataset [17] which has more than 14 million images at date [18]. Deep learning neural networks can be customized by setting some variables which are known as hyper-parameters. These determine the network structure and how the network is trained. Hyper-parameters involve the application of different functions such as the activations, loss functions, probability distribution function, the number of epochs for training, the number of training iterations, the learning rate, optimizers and other regularization and the purpose is to avoid the problem of overfitting or underfitting [19]. These values are set before training and before optimizing weights and bias. Although the use of CNN and deep learning have resulted in improved accuracies for many tasks, they also depend on the availability of very large computational resources which in turn require substantial energy consumption. Hence, these models are costly to train and develop, both financially and environmentally [20]. 2.3 Deep Transfer Learning Deep transfer learning is a deep learning method where a model, initially built for a specific task, is reused as the base for another model performing a different task [21]. Many state-of-the-art deep learning models exist and may be repurposed for other tasks, in other domains. The use of transfer learning is widespread since models can be built more quickly while providing good accuracy. Using pre-trained models also results in lower hardware requirements compared to those needed to train a model from scratch. There are several factors which need to be considered when determining the best approach to repurpose a model and these include the available computational power and the size and similarity of the dataset.
VAMDLE
321
Figure 2 shows the different steps when using Transfer Learning. As can be seen from the figure, deep learning models are first trained on another dataset for another problem. In this case, the ImageNet dataset which contains 1.2 million images was used. The weights learned during the previous training are transferred to a new training process.
Fig. 2. Overview of transfer learning [29]
2.4 ElasticSearch ElasticSearch is a real-time distributed search and analytics engine built on Apache Lucene which is the most advanced, high-performance, and fully featured search engine library [22]. It is used for full-text search, structured search, analytics, or any combination of these. It is used by Wikipedia to provide full-text search with highlighted search snippets, by GitHub to query billion lines of code and by e-commerce websites for personalization [23]. Although full-text search, analytics systems and distributed systems are not new, ElasticSearch successfully combines these together and is easy to use [22]. ElasticSearch can be accessed by using a simple RESTful API. 2.5 High Performance Computing High Performance Computing (HPC) refers to the ability to process data and perform complex calculations at high speeds. HPC solutions have three main components which are compute, network and storage. A supercomputer is an example of a HPC solution. A high-computing performance architecture comprises compute servers which are networked together into clusters with each server being called a node. The nodes work in parallel to boost the processing speed. Each cluster is connected to a data storage where the output is captured. For the purpose of this research, only some assets have been tested but in a real-life example many more images will need to be processed and hence
322
V. Seenundun et al.
high computational resources will be needed. Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. However, these last few years has seen a major development in HPC with GPUs having evolved from being a simple graphics cards into a platform for HPC. GPUs have shown much better performance compared to both multi-core CPU and single-core CPU for different kinds of applications [24, 25].
3 Design 3.1 Architectural Design The architecture of the proposed system, VAMDLE, is shown in Fig. 3. First, images from the training dataset are to be fed to TensorFlow for training using the MobileNet architecture. The output of the training is a file in ‘.pb’ format. This file is then converted to the ‘.tflite’ format which is faster, has a relatively small model size without any noticeable loss in accuracy. This file is then exported to the android application where it is used for real time image recognition. Text detection and recognition are performed using Google’s ML kit library. The android application communicates with the Ngrok server, where the REST API is hosted, through HTTPS requests and the response is obtained in JSON format.
Fig. 3. Architectural design of VAMDLE.
3.2 UI and UX Design Nowadays when designing a system, both user interface (UI) and user experience (UX) have to be considered. According to the ISO standards [26], UX is defined as “the combined experience of what a user feels, perceives, thinks, and physically and mentally reacts to before and during the use of a product or service”. The interface has therefore been designed using all recommended UI and UX standards. Balsamiq WireFrames [27] was used to design all user interfaces and 2 such wireframes are shown in Fig. 4.
VAMDLE
323
Fig. 4. WireFrames of UI while scanning an asset and displaying the results
4 Implementation and Evaluation VAMDLE is expected to be used to scan the National ID of visitors to the building or office or to scan the asset that users are borrowing or returning at the reception desk of the same office or building. The app should allow receptionists to login, to register visitors by scanning their NID and having all fields being filled in automatically, allow receptionists to checkout visitors when they are leaving, view the time the visitor has been registered, view all visitors who are currently on premises, to assign an asset to an employee and to record returned assets. The system should be able to identify an asset based on its text label. 4.1 Dataset For asset management, the most common assets which are usually borrowed at the reception desk of a company were identified. These are HDMI converters, USB transmitters, meeting room speakers and projectors. Therefore 30 images of each of these assets were taken from different angles and used to build the dataset which was used for training the transfer learning model. 4.2 Image Pre-Processing and Data Augmentation Abundant and high quality data are crucial to the successful implementation of different deep learning models [28] and these are considered as important as the algorithms themselves. Hence data augmentation is often used in deep learning. Data augmentation refers to the application of one or more deformations to annotated dataset which results in new, additional training data [29]. Data augmentation increases the diversity of data available significantly without actually having to collect new data or take new photos.
324
V. Seenundun et al.
This technique can help in resolving data imbalance and can increase the overall accuracy of a model. Therefore, an initial step in this work was data augmentation. There exist different possible augmentation techniques as shown in Fig. 5 and for the purpose of this project the basic image manipulations shown in Table 1 have been used. These image augmentation techniques were not applied individually on the images but a random combination of these techniques were applied on each image with the objective of generating different images each time. Table 1. Summary of applied data augmentation techniques. #
Technique
Parameters
1
Image Flip
Horizontal
2
Random Rotation
Range: 18°
3
Random Translation
Range: −10 to 10 pixels
4
Shear: Right
Range: −10 to 10 pixels
5
Shear: Bottom
Range: −10 to 10 pixels
Once the dataset was ready, it was split into training and testing datasets. The training dataset was used to train the model and the testing dataset was used to validate and optimize the model by adjusting the hyper parameters. 4.3 Deep Transfer Learning Model The training of the deep learning was done in Ubuntu (Linux) where a virtual environment was created and the TensorFlow pip package was installed together with python 3. The pre-trained MobileNet was retrained to adapt it to the problem of recognition of images of different assets. Since transfer learning is being used, only the final layer of the neural network was trained. The MobileNet architecture can be configured to produce optimum results by changing the values of parameters. The parameters are the input image resolution which can be 128,160, 192 or 224px, the relative model size (example: 1.0, 0.75, 0.50 or 0.25) and the training steps. The MobileNet was tested with different values of these parameters. Predictably, using higher resolution images results in higher processing times with better performance. However, an equilibrium has to be achieved between processing time and the performance. The classification accuracy, the area under the receiver operating characteristic curve (AUROC) and the Mean Absolute Error (MAE) were used to as performance metrics to assess the deep learning model. It was found that choosing an image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave optimal results with an acceptable amount of time to retrain the model and very accurate results. Figure 6 shows how TensorFlow was integrated into the whole system.
VAMDLE
325
Fig. 5. Image augmentation techniques [30]
4.4 Android Application Since both visitor and asset management are sensitive and need to be secure, the application can only be accessed by using unique allocated credentials. Once the user has been authenticated, the various functionalities of the system can be accessed. Registration of visitors is done by using the mobile device’s camera to scan the text on an official identity document of the visitor, which can be his passport or national ID card. By using regex, the name and ID number of the visitor were extracted from the text read and used to auto fill the registration form. Assigning and returning an asset was also implemented by using the text reader, which scans the text label on the asset and spelling errors were handled by ElasticSearch in the backend. As assets are used a lot, often their text labels are damaged or the text is not readable anymore. Therefore, an alternative was provided where the system can identify the asset by detecting its shape using TensorFlow and deep transfer learning. By using this technique, there is no need to change the asset’s label or buy additional hardware. Employees borrowing assets need to be identified and this was done by scanning the text on the employee’s card and reading his name. Once again spelling mistakes are handled using ElasticSearch in the backend. This resulted in a rapid process since the employee simply needs to show his card to the camera and the system will process the information automatically. The mobile application was implemented using the Android Studio Integrated Development Environment (IDE). The CameraSource library was used to manage the camera
326
V. Seenundun et al.
on the phone or tablet having the app. The camera works with a detector which receives frames from the camera at a specified rate. Processing of the preview frames was done as quickly as possible to minimize any lag in the application. The model trained using deep transfer learning is executed by the Android application using the org.tensorflow: tensorflow-lite library. This library is used to attach a score and label to the preview frames which then determines which asset is a better match for the real time image. Google Mobile vision which is part of the Google ML Kit was used since it offers a reliable and robust system for reading text from real time images or video streams. Each time text was detected, the Text Recognition API was used to determine the corresponding text in each block and segmented it into lines and words. The data captured is sent to the backend as JSON objects through HTTP requests. Retrofit Library was used to ensure a safe connection between the android application and the REST API. Gson was used to convert Java objects into their JSON format before sending the request to the API. It was also used to convert the JSON response from the API into Java objects to be used by the application. Spring Boot was used for implementing the REST API. ElasticSearch was integrated and it asynchronously updates its cluster every 24 h to make sure that any CRUD operations performed in the database is reflected in the cluster. The REST API was used to communicate with both the ElasticSearch cluster and the MySQL database to obtain the required information. Spring Tool Suite 3 (STS) which is an extension of the Eclipse IDE was used in the implementation of the backend. MySQL-connector was used to connect to the MySQL database at runtime and the Java Persistence API (JPA) was used to ease the management of data between the Java objects and the database.
Fig. 6. Integration of Tensorflow into the System
4.5 Evaluation of the Proposed System The app that was implemented was compared to some of the most used Visitor and Asset Management tools currently on the market. The results are shown in Table 2. The
VAMDLE
327
proposed can be used for both visitor and asset management at the various reception desks or registration kiosks at a site. Table 2. Evaluation of the Proposed System with Existing Visitor and Asset Management Tools Visitor Management Apps
VAMDLE
Envoy
Proxyclick
SwipedOn
Asset Management Apps
TractionGuest
Assetpanda
EZOfficeInventory
GoCodes Asset Management
UpKeep
Visitor Check-in
Dashboard Analytics
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
Asset Registration
Asset Status Tracking
Authentication
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
ᅛ
5 Conclusion In this work a novel visitor and asset management system, VAMDLE, which makes use of deep transfer learning and Elasticsearch has been presented. An in-depth literature survey was conducted on the state-of the art in the field, findings were analyzed and an architecture design for an efficient model was proposed. The system which consisted of a number of different parts: the deep transfer learning model, the text analyzer using ElasticSearch and the mobile application, was then implemented. The effect of using different values of the parameters in the model on the various performance metrics and time taken to complete a task were compared and it was found that an image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave best results in terms of time taken and performance. An image resolution of 224, a relative size model fraction of 0.5 and training steps of 4000 gave optimum results with an acceptable amount size, speed and accuracy characteristics. As with any system, the proposed system can be improved. first of all, the RESTful web service can be deployed on a more robust and secured platform such as Azure or Amazon Web Services. The android application can be further improved so that whenever a new asset is to be added, the system can auto-train to include the asset in its system without any manual intervention from the user. The exhaustive literature review has shown that no such solution exists to date. Moreover, the system can be trained to identify fake user ids and passports. This will help in thwarting fraud and the use of fake documents by visitors.
328
V. Seenundun et al.
References 1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015) 2. Rawat, W., Wang, Z.: Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput. 29(9), 2352–2449 (2017) 3. Dung, C.V., Anh, L.D.: Autonomous concrete crack detection using deep fully convolutional neural network. Autom. Constr. 99, 52–58 (2019) 4. Fujiyoshi, H., Hirakawa, T., Yamashita, T.: Deep learning-based image recognition for autonomous driving. IATSS Res. 43(4), 244–252 (2019) 5. Fourcade, A., Khonsari, R.H.: Deep learning in medical image analysis: a third eye for doctors. J. Stomatol. Oral Maxillofac. Surg. 120(4), 279–288 (2019) 6. Zejda, D., Zelenka, J.: The concept of comprehensive tracking software to support sustainable tourism in protected areas. Sustainability 11(15), 4104 (2019) 7. Envoy Visitor, Deliveries, and Rooms Management | Envoy. https://envoy.com/. Accessed 14 Feb 2021 8. Proxyclick | Enterprise Visitor Management System. https://www.proxyclick.com/. Accessed 14 Feb 2021 9. Visitor Management System | In and Out Board | Best Sign In App USA. https://www.swi pedon.com/. Accessed 14 Feb 2021 10. Visitor Management. https://info.trackforce.com/en-za/visitor-management-software. Accessed 14 Feb 2021 11. Easy and Flexible Asset Tracking Software - Asset Panda. https://www.assetpanda.com/. Accessed 14 Feb 2021 12. Asset Tracking and Management Software - EZOfficeInventory. https://www.ezofficeinve ntory.com/. Accessed 14 Feb 2021 13. Home - Asset & Inventory Tracking Software. https://gocodes.com/. Accessed 14 Feb 2021 14. CMMS Software by UpKeep CMMS | Try Free. https://www.onupkeep.com/. Accessed 14 Feb 2021 15. O’Shea, K., Nash, R.: An Introduction to Convolutional Neural Networks (2015) 16. Howard, A.G., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017) 17. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) 18. ImageNet. http://www.image-net.org/. Accessed 14 Feb 2021 19. G Inc: A Tutorial on Deep Learning Part 1: Nonlinear Classifiers and the Backpropagation Algorithm (2015) 20. Strubell, E., Ganesh, A., McCallum, A.: Energy and policy considerations for deep learning in NLP. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 3645–3650 (2019) 21. Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., Liu, C.: A survey on deep transfer learning. In: K˚urková, V., Manolopoulos, Y., Hammer, B., Iliadis, L., Maglogiannis, I. (eds.) ICANN 2018. LNCS, vol. 11141, pp. 270–279. Springer, Cham (2018). https://doi.org/10.1007/9783-030-01424-7_27 22. Gormley, C., Tong, Z.: Elasticsearch: The Definitive Guide: A Distributed Real-time Search And Analytics Engine, 1st edn. O’reilly Media, Beijing, p. 724 (2015) 23. Vavliakis, K.N., Katsikopoulos, G., Symeonidis, A.L.: E-commerce Personalization with Elasticsearch. Procedia Comput. Sci. 151, 1128–1133 (2019) 24. Gupta, S., Babu, M.R.: Generating performance analysis of GPU compared to single-core and multi-core CPU for natural language applications. IJACSA 2(5), 108 (2011)
VAMDLE
329
25. Amich, M., Luca, P.D., Fiscale, S.: Accelerated implementation of FQSqueezer novel genomic compression method. In: 2020 19th International Symposium on Parallel and Distributed Computing (ISPDC), pp. 158–163 (2020) 26. ISO - ISO 9241-210:2019 - Ergonomics of human-system interaction — Part 210: Humancentred design for interactive systems. https://www.iso.org/standard/77520.html. Accessed 14 Feb 2021 27. Balsamiq Wireframes - Industry Standard Low-Fidelity Wireframing Software | Balsamiq. https://balsamiq.com/wireframes/. Accessed 14 Feb 2021 28. Sajjad, M., Khan, S., Muhammad, K., Wu, W., Ullah, A., Baik, S.W.: Multi-grade brain tumor classification using deep CNN with extensive data augmentation. J. Comput. Sci. 30, 174–182 (2019) 29. Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017) 30. Shorten, C., Khoshgoftaar, T.M.: A survey on image data augmentation for deep learning. J. Big Data 6(1), 1–48 (2019). https://doi.org/10.1186/s40537-019-0197-0
Wind Speed Time Series Prediction with Deep Learning and Data Augmentation Anibal Flores(B) , Hugo Tito-Chura, and Victor Yana-Mamani Universidad Nacional de Moquegua, Moquegua, Peru
Abstract. This paper presents a hybrid model based on recurrent neural networks known as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) for the prediction of daily wind speed time series in the Moquegua region of Peru. The proposal model called GRU LSTM GRU LSTM + DA consists of an architecture of 4 hybrid sequential layers and works on a normalization scale of −1, +1 instead of the traditional 0, +1 scale, it also uses data augmentation (DA) to improve the process training and prediction results of the model. The results of the proposal are compared with 4 benchmark models (GRU GRU GRU GRU, LSTM LSTM LSTM LSTM, GRU LSTM and GRU LSTM GRU LSTM), showing that the proposal in terms of RMSE, RRMSE and MAPE by far exceeds the benchmark models. In the same way, the results achieved in terms of RMSE are compared with the results of related work, showing the superiority of the proposal model in this study. Keywords: Deep learning · Data augmentation · Wind speed prediction · Time series scaling
1 Introduction Climate change [1] is a severe threat to the future of humanity, and the consumption of fossil fuels has contributed enormously to this problem, hence the use of renewable energies has become an excellent alternative to mitigate its effects. Renewable energies include wind, geothermal, hydroelectric, tidal, solar, wave power, biomass, and biofuels. In the case of wind energy, electricity generation is carried out with the force of the wind. The windmills that are in the wind farms are connected to electricity generators that transform the wind that turns their blades into electrical energy, here the analysis of time series related to the wind speed and wind direction play an important role. Peru has great potential to exploit wind energy, especially along the coastal region that borders the Pacific Ocean. In this study, a hybrid model of four layers GRU LSTM GRU LSTM is proposed for prediction of wind speed time series, the same one that produced good prediction results with solar radiation time series in [2]. Some updates are included in the pre-processing stage of the time series, such as the inclusion of a data augmentation phase and a min/max scale range change of −1, +1 instead of 0, +1 in the normalization or scaling phase. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 330–343, 2022. https://doi.org/10.1007/978-3-030-82193-7_22
Wind Speed Time Series Prediction with Deep Learning
331
Regarding data augmentation, it is commonly used to overcome the overfitting problem presented by deep learning models in the training phase, as can be seen in works such as [3–6]. In other cases, data augmentation also allows overcoming the underfitting problem caused by the reduced amount of training data, this can be seen in the work of Flores et al. [7]. In this study, according to the benchmark models GRU GRU GRU GRU, LSTM LSTM LSTM LSTM, GRU LSTM, and GRU LSTM GRU LSTM, it is observed that these do not present overfitting problems, much less underfitting. However, the aim of using data augmentation in the wind speed time series is to enrich them in order to achieve better training and therefore better prediction results. Regarding the normalization phase, many works such as [2, 8–12], etc. use the 0,1 min/max scale, However in this work, it’s proposed the −1, +1 min/max scale due to the fact that in various experiments carried out it was observed that for wind speed time series, the scale −1, +1 allows to improve the precision of the predictions of deep learning models such as recurrent neural networks. Regarding the organization of this work, it is structured as follows. In Sect. 2, the state-of-the-art works that served as a starting point for the proposal made in this study are briefly described. In Sect. 3, the theoretical background necessary for a better understanding of the content of the paper are described. In Sect. 4, the process for the implementation of the study proposal is described. In Sect. 5, the results achieved are described. In Sect. 6 the results achieved are discussed, comparing them with similar works of the state of the art. Finally, the conclusions of the study and the improvements that can be made for future work are shown.
2 Related Work Some related works arranged chronologically are briefly described below: Zhang et al. [13], propose a hybrid model based on Wavelet Transform Technique (WTT), Seasonal Adjustment Method (SAM) and Radial Basis Function Neural Network (RBFNN) to predict daily wind speed time series, the results show an RMSE of 0.88. Qureshi et al. [14], propose a model based on Deep Neural Network (DNN) that implements Meta Regression and Transfer Learning (MRT), the best RMSE achieved for hourly wind speed time series is 0.0953. Bokde et al. [15], propose the Ensemble Empirical Mode Decomposition (EEMD) and Pattern Sequence Forecasting (PSF) for the prediction of hourly wind speed time series, the results show that the best RMSE achieved is 0.36. Mezaache et al. [16] propose a two-block architecture, in the first block they use the AutoEncoder (AE) network to reduce the dimensionality of the wind speed time series and the second block they experiment with Extreme Learning Machine (ELM) and Elman Neural Network (ENN), the best RMSE achieved is 3.0506 using AE + ENN. Khodayar et al. [17] propose an Interval Probability Distribution Learning (IPDL) to decrease the wind data uncertainties, in addition to the Restricted Boltzmann Machines (RBM) and Rough Set Theory neural network to capture unsupervised temporal features from wind speed time series, it is used 10 min wind speed time series for experimentation, data from 2004 to 2005 is used for training and from 2006 is used for testing, the best RMSE is 11,126 for a 3-h forecast horizon. Li et al. [18], propose a hybrid model based on Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) for the
332
A. Flores et al.
prediction of short-term wind speed time series with feature extraction, considering 3500 data for training and 500 for testing, the best RMSE is 3.0012. Liu et al. [19], propose the use of the recurrent neural network Gated Recurrent Unit (GRU) to predict daily wind speed time series, it is experimented with 811 items for the training phase and 372 for the testing phase, the best RMSE achieved is 0.9899. Deng et al. [20], propose a bidirectional deep learning architecture based on Gated Recurrent Unit (GRU) to predict wind speed and direction time series, the best RMSE achieved is 6.75. Wang et al. [21], propose a model based on Wavelet Transformation-Kullback-Leibler, to predict hourly wind speed data, the results show that the best RMSE achieved is 1.07. Jiang et al. [22], propose a model based on Variable Weights Combined to predict wind speed time series, the results reach an RMSE of 0.2557. Cheng et al. [23], propose an approach called MultiObjective Salp Swarm Optimizer (MSSO) for 10-min wind speed time series prediction. The results show very good precision (RMSE: 0.3002) managing to surpass other techniques of the state of the art. Altan et al. [24], propose a hybrid model based on Long Short-Term Memory (LSTM), Decomposition Methods (DM) and Gray Wolf Optimizer (GWO) to predict 10-h wind speed time series, according to the results achieved, the best RMSE is 0.1878. Yan et al. [25], propose a model based on Improved Singular Spectrum Decomposition (ISSD), Long Short-Term Memory (LSTM) and Deep Belief Network (DBN) to predict hourly wind speed time series, the results show that the best RMSE achieved by the proposal is 1.0156. Noman et al. [26], propose a model called NARX (Nonlinear Auto-Regressive Exogenous), which reached an RMSE of 0.3590 in the prediction of 10-min wind speed time series. Luo et al. [27], propose two types of approaches: The Decomposition Ensemble (DE) and Multi-Objective Optimization (MOO), the best RMSE achieved is 0.2348 predicting 10-min wind speed time series.
3 Background 3.1 Recurrent Neural Networks A recurrent neural network (RNN) is a type of artificial neural network within Deep Learning [8], it integrates feedback cycles, allowing through them that the information persists throughout the training epochs, this is done through connections between the outputs of the layers, which combine their results with the input data. This makes them applicable to solving problems such as handwriting recognition, speech recognition, time series prediction, etc. The best known recurrent neural network is the Long Short-Term Memory (LSTM) from which different variants were generated, including the Gated Recurrent Unit (GRU). Long Short-term Memory (LSTM) LSTM is a recurrent neural network (RNN), and was designed to address the vanishing problem [8] encountered training classic RNNs. Unlike standard feedback neural networks, LSTM has feedback connections. It can process not only individual data points, but also streaming or sequences data. For example, LSTM is applicable to tasks such as handwriting and speech recognition [28]. According Fig. 1, a common LSTM unit is composed of a cell, and three gates (entry, exit and forget). The cell remembers values
Wind Speed Time Series Prediction with Deep Learning
333
Fig. 1. LSTM architecture
arbitrary along the training epochs and the gates regulate the information in/out of the cell. LSTM networks are suitable for classifying and regression tasks. From Fig. 1, the output C t can be calculated with the following equations. f t = σ W f ht−1 , xt + bf (1) it = σ W i ht−1 , xt + bi
(2)
C˜ t = tanh W c ht−1 , xt + bC
(3)
C t = f t C t−1 + it C˜ t
(4)
And the output ht with the following equations: ot = σ W o ht−1 , xt + bo ht = ot tanh(C t )
(5) (6)
Gated Recurrent Unit (GRU) GRU network is a variant of LSTM, it combines the forget and input gates into a single update gate. GRU is simpler than standard LSTM models, and has been growing increasingly popular. GRUs are a gating mechanism in RNNs [29]. The GRU is a recurrent neural network the inspired by LSTM and main difference is in the fewest number of parameters, since it does not have an output gate. GRUs have been shown to exhibit even better performance in certain smaller datasets. However, the LSTM is stronger than the GRU, since it can easily perform an unlimited count, while the GRU cannot [30]. Figure 2 shows the architecture of Gate Recurrent (GRU) Unit network.
334
A. Flores et al.
Fig. 2. GRU architecture
From Fig. 2, the following equations emerge to calculate ht : rt = σ (W r xt + U r ht−1 + br )
(7)
zt = σ (W z xt + U z ht−1 + bz )
(8)
h˜ t = tanh(Wxt + U(rt ht−1 ) + b
(9)
ht = (1 − zt ) ht−1 + zt h˜ t
(10)
Where: W z ,W r ,W,U z ,U r ,U br , bz , b σ
Parameter matrices Parameter vectors Element-wise sigmoid function Element-wise multiplication
3.2 Data Augmentation Data augmentation is the artificial generation of synthetic data through disturbances or transformations of the original data [7]. This allows to increase both in size and diversity the training data. In computer vision, this technique became a regularization standard, and also to improve performance and outperform overfitting problem on Convolutional Neural Networks (CNNs). In time series classification they have generally been created to solve overfitting problems and these consist of time-warping, rotation, scaling, permutation, jittering, etc. [4], and for time series forecasting it is also possible to apply some of these techniques, for example time-warping and jittering is used [5]. Figure 3 shows graphically some of these techniques applied to wind speed time series.
Wind Speed Time Series Prediction with Deep Learning
335
Fig. 3. Basic data augmentation techniques for time series classification. a) Raw Data, b) Jittering, c) Scaling, d) Rotation, e) Time-Warping
4 Methodology 4.1 Time Series Selection The daily wind speed time series corresponds to the coastal province of Ilo in the Moquegua Region of Peru, which has a lot of potential for the generation of wind energy. The daily time series of wind speed at 50 m is between 1981-01-01 and 202012-31 (14610 items) and was obtained from the NASA repository1 , through the POWER Data Access Viewer web tool: latitude: −17.6851 and longitude: −71.3515. The data for the model training corresponds to the years 1981–2016 (13149 items) and for the model testing the years 2017–2020 (1461 items). Figure 4 shows the graphical location.
Fig. 4. Location of Ilo Province in Moquegua Region of Perú.
1 https://power.larc.nasa.gov/data-access-viewer/.
336
A. Flores et al.
4.2 Time Series Imputation The time series did not present missing data or NA values, so it was not necessary to apply any imputation technique. 4.3 Data Augmentation In this stage, the algorithm proposed by Flores et al. [7] was used. The block-size used is 6 and sub-block size is 3. The technique proposed in [7] works with two data augmentation techniques used for time series classification: time-warping and jittering. Time-warping allows to increase the length of the time series, but the synthetic data generated can introduce bias as it is linear, hence it is combined with jittering. What Jittering does is introduce noise in such a way that the synthetic data is not linear, these are generated randomly considering as limits the start and end value of each sub-block. Figure 5 shows partially the results of this process.
Fig. 5. Wind Speed time series with real and augmented values.
4.4 Scaling For the proposal model in this stage, the −1, + 1 min/max scale was considered instead of the 0, + 1 min/max scale that is commonly recommended or used in many works of the state of the art mentioned above. The min/max scales work with the Eq. (11). Os =
Oi − Omin Omax − Omin
Where: Os Oi
The scaled value between min and max value. The vector element to be scaled.
(11)
Wind Speed Time Series Prediction with Deep Learning
337
Omin The smallest element in the vector. Omax The largest element in the vector. After 20 runs of the GRU LSTM GRU LSTM model without data augmentation, for the scale −1, + 1 an average RMSE of 0.5179 was obtained, while for the scale 0, + 1 an average RMSE of 0.5192 was obtained, demonstrating in this way the best precision of the scale proposed for this study. Figure 6 shows a graphical comparison between these two scales.
Fig. 6. Comparison of min/max Scales: 0, + 1 vs −1, + 1.
4.5 Modelling GRU LSTM GRU LSTM architecture for proposal model was implemented with the hyperparameters shown in Table 1. Table 1. Training and testing data for proposal and benchmark models Model
Hyperparameters
GRU LSTM GRU LSTM + DA Hidden neurons
Values 160
Epochs
100
Optimizer
adam
Drop rate
0.2
Activation function ReLu Layer 1, 2, 3 y 4
(40,40,40,40)
Batch size
40
338
A. Flores et al.
Similar parameters were used for the 4-layer benchmark models: LSTM LSTM LSTM LSTM and GRU GRU GRU GRU and GRU LSTM GRU LSTM. For the two-layer GRU LSTM model, the number of hidden neurons was only 80, since 40 neurons were established for each hidden layer. 4.6 Evaluation The proposal model is evaluated through Root Mean Squared Error (RMSE), Relative RMSE, and Mean Absolute Percentage Error (MAPE). These metrics are estimated through Eqs. (12), (13), and (14) respectively. n 2 i=1 (Pi − Oi) (12) RMSE = n RRMSE = 1 MAPE = n
RMSE
∗ 100
1 n i=1 Oi n n (Oi − Pi )
i=1
Oi
∗ 100
(13)
(14)
Where: n Number of predicted values. Pi Vector of predicted values. Oi Vector of observed values (testing data).
5 Results According to Table 2, in the three metrics that were used to evaluate the benchmark models and the proposal model, the superiority of the GRU LSTM GRU LSTM + DA model with respect to the benchmark models is clearly appreciated. Also, the benchmark models being based on recurrent neural networks present very similar results, however, the best of them is GRU LSTM GRU LSTM, the same one that was used as the basis of the proposal model. In terms of average RMSE, the proposal model (RMSE: 0.0876) outperforms the best benchmark model (RMSE: 0.5242) in 0.4366. In terms of average RRMSE, the proposal model outperforms the best benchmark model in 13.0129. Likewise according [31] and [32], a model’s precision level is excellent if RRMSE < 10%, good if 10% < RRMSE < 20%, fair if 20% < RRMSE < 30%, and poor if RRMSE > 30%. Thus, the proposal model (GRU LSTM GRU LSTM + DA) has an average RRMSE = 1.4777 ± 0.2319, which qualifies it as an excellent model surpassing other good models such as the benchmark ones.
Wind Speed Time Series Prediction with Deep Learning
339
Table 2. Results of benchmark models and proposal model Model
Predicted days 30
50
100
250
500
1000
1461
Avg
GRU GRU GRU GRU 0.5146
0.5286 ± 0.0325
RRMSE 17.9351
14.9025 16.5702 16.5926 14.7703 14.8961 14.7443
1.2630 ± 1.2630
MAPE
13.0355 14.1669 13.9296 12.1245 12.3141 12.2761 13.4564 ± 1.5133
RMSE
0.5218 16.3485
0.4828
0.5680
0.5761
0.5181
0.5190
LSTM LSTM LSTM LSTM RMSE
0.5033
RRMSE 17.2994 MAPE
15.7064
0.4748
0.5711
0.5824
0.5224
0.5224
0.5180
0.5277 ± 0.0374
14.6557 16.6608 16.7721 14.8812 14.9946 14.8436 15.7296 ± 1.1266 0.4748 13.9701 13.8863 12.0400 12.1644 12.1151 11.4795 ± 5.0357
GRU LSTM RMSE
0.5367
0.4948
0.5850
0.5806
0.5202
0.5196
0.5156
0.5360 ± 0.0342
RRMSE 18.4472
15.2745 17.0658 16.7207 14.8190 14.9127 14.7740 16.0019 ± 1.4288
MAPE
13.5276 14.7068 14.2251 12.2939 12.3979 12.3672 13.8124 ± 1.7658
17.1685
GRU LSTM GRU LSTM RMSE
0.5078
0.4728
0.5591
0.5792
0.5191
0.5181
0.5138
0.5242 ± 0.0349
RRMSE 17.4528
14.5948 16.3113 16.6807 14.7883 14.8699 14.7220 15.6314 ± 1.1599
MAPE
12.7549 13.9166 14.0231 12.0868 12.1724 12.1431 13.3112 ± 1.4731
16.0820
GRU LSTM GRU LSTM + DA RMSE
0.0895
0.0826
0.0930
0.0935
0.0842
0.0844
0.0865
0.0876 ± 0.0043
RRMSE
3.0748
2.5492
2.7139
2.6922
2.3982
2.4221
2.4797
2.6185 ± 0.2359
MAPE
2.9858
2.3531
2.4360
2.3315
2.0335
2.0458
2.0849
2.3243 ± 0.3342
In terms of average MAPE, the proposal model outperforms the best benchmark model in 10.9869, which only reflects what was indicated with the other metrics in the preceding paragraphs. Figure 7 shows a graphical comparison of the three metrics analyzed in this work. According to Fig. 8, the original wind speed time series and the predictions of the GRU LSTM GRU LSTM and GRU LSTM GRU LSTM + DA models can be seen for a prediction horizon of 100 days. It is also observed how the predictions of the proposal model (red line) better fit the original values. Likewise, it is important to highlight how deviations from the base model are improved with the data augmentation of the proposal model.
340
A. Flores et al.
Fig. 7. Metrics comparison of benchmark models and proposal model.
Fig. 8. Comparison of best benchmark model and proposal model for 100 predicted days.
6 Discussion Table 3 shows a summary of the results obtained by different models of the state of the art in the prediction of wind speed time series. According to Table 3, which shows the RMSEs achieved by different works of the state of the art, it can be seen that the best RMSE achieved is 0.0953, in the work of Qureshi et al. [14], comparing it with the RMSE of 0.0876 of the proposal model GRU LSTM GRU LSTM + DA, it is appreciated that the second one is superior, therefore, it can be affirmed that the proposal model in this study has a lot of potential for the prediction of wind speed time series, being able to obtain better results than the state of the art models.
Wind Speed Time Series Prediction with Deep Learning
341
Table 3. Results of related work and proposal model Work
Technique
Frequency
Train
Test
RMSE
Zhang et al. [12]
WTT + SAM + RBFNN
Daily
696
48
0.88
Bokde et al. [14]
EEMD + PSF
Hourly
2160
720
0.36
Mezaache et al. [15]
AE + ENN
10-min
26000
11000
3.0506
Khodayar et al. [16]
RBM + IPDL
10-min
105120
52560
Li et al. [17]
CNN + LSTM
15-min
3500
500
3.0012
Liu et al. [18]
GRU
Daily
811
372
0.9899
Deng et al. [19]
Bi-GRU
400
6.75
Jiang el at. [21]
VWC
2304
576
0.2557
Wang et al. [20]
EWT + KLD
Hourly
14016
3504
1.07
Qureshi et al. [13]
DNN + MRT
Hourly
Yan et al. [24]
ISSD + LSTM-GOADBN
Hourly
600
100
1.0156
Cheng et al. [22]
MSSO
10-min
2880
720
0.3002
Altan et al. [23]
DM + LSTM + GWO
10-h
4397
775
0.1878
Noman et al. [25]
NARX
10-min
Data 2017
Data 2018
0.3590
Luo et al. [26]
DE + MOO
10-min
3200
800
0.2348
Proposal Model
GRU LSTM GRU LSTM + DA
Daily
13149
1461
0.0876
11.126
0.0953
7 Conclusion and Future Work According to the results obtained in this study, it can be concluded that the −1, +1 scale allows increasing the precision of the predictions of the hybrid four-layer model GRU LSTM GRU LSTM, likewise, the data augmentation widely used to overcome the overfitting problem, it can also be used to improve the precision of models that do not present overfitting, so in this study, according to the average MAPE, it allowed to improve the precision of the GRU LSTM GRU LSTM model by 10.9869%, turning a good model into an excellent model according to the obtained RRMSE. The main advantage of the approach proposed in this study for the prediction of wind speed time series lies in the precision of the results achieved. On the other hand, the increase in data for the training phase of the model causes the computational cost to rise considerably. For future work, according to what was observed during the experimentation process of this work, it can be recommended, the experimentation with larger sizes for the blocksize parameter of the data augmentation technique used. It is very likely that for values greater than six (6), better results can be obtained for the GRU LSTM GRU LSTM model.
342
A. Flores et al.
Likewise, experimentation with data with different frequencies and other architectures based on recurrent neural networks is pertinent.
References 1. McMichael, A.J., Lindgren, E.: Climate change: present and future risks to health, and necessary responses. J. Intern. Med. 270(5), 401–413 (2011) 2. Flores, A., Tito, H., Centty, D.: Comparison of hybrid recurrent neural networks for univariate time series forecasting. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. Advances in Intelligent Systems and Computing, vol. 1250, pp. 375–387.Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55180-3_28 3. Yeomans, J., Thwaites, S., Robertson, W.S.P., Booth, D., Ng, B., Thewlis, D.: Simulating timeseries data for improved deep neural network performance. IEEE Access. 7, 131248–131255 (2019) 4. Rashid, K.M., Louis, J.: Times-series data augmentation and deep learning for construction equipment activity recognition. Adv. Eng. Inform. 42, 1–12 (2019) 5. Iwana, B.K., Uchida, S.: Time series data augmentation for neural networks by time warping with a discriminative teacher. arxiv.org (2020) 6. Rashid, K.M., Louis, J.: Window-warping: a time series data augmentation of IMU data for construction equipment activity identification. In: de 36 International Sympoium on Automation and Robotics in Construction (ISARC 2019), Banff, Canada (2019) 7. Flores, A., Tito, H., Apaza-Alanoca, H.: Data augmentation for short-term time series prediction with deep learning. In: de Computing Conference, London, United Kingdom (2021, in press) 8. Flores, A., Tito, H., Centty, D.: Recurrent neural networks for meteorological time series imputation. Int. J. Adv. Comput. Sci. Appl. 11(3), 482–487 (2020) 9. Flores, A., Tito, H., Centty, D.: Improving gated recurrent unit predictions with univariate time series imputation techniques. Int. J. Adv. Comput. Sci. Appl. 10(12), 710–714 (2019) 10. Huynh, A.N-L., Deo, R.C., An-Vo, D-A., Ali, M., Raj, N., Abdulla, S.: Near real-time global solar radiation forecasting at multiple time-step horizons using the Long Short-Term Memory network. Energies 13(3517), 11–30 (2020) 11. Gürel, A.E., A˘gbulut, Ü., Biçen, Y.: Assessment of machine learning, time series, response surface methdology and empirical models in prediction of global solar radiation. J. Clean. Prod. 277, 122353 (2020) 12. Che, Z., Purushotham, S., Cho, K., Sontag, D., Liu, Y.: Recurrent neural networks for multivariate time series with missing values. Sci. Rep. 8(6085), 1–12 (2018) 13. Zhang, W., Wang, J., Wang, J., Zhao, Z., Tian, M.: Short-term wind speed forecasting based on a hybrid model. J. Appl. Soft Comput. 92(106294), 1–20 (2013) 14. Qureshi, A.S., Khan, A., Zameer, A., Usman, A.: Wind power prediction using deep neural network based meta regression and transfer learning. Appl. Soft Comput. 58, 742–755 (2017) 15. Bokde, N., Feijoo, A., Kulat, K.: Analysis of differencing and decomposition preprocessing methods for wind speed prediction. Appl. Soft Comput. 71, 926–938 (2018) 16. Mezaache, H„ Bouzgoud, H.: Auto-encoder with neural networks for wind speed forecasting. In: de IEEE International Conference on Communications and Electrical Engineering, El Oued, Algeria (2018) 17. Khodayar, M.I., Wang, J., Manthouri, M.: Interval deep generative neural network for wind speed forecasting. IEEE Trans. Smart Grid 10(4), 3974–3989 (2019)
Wind Speed Time Series Prediction with Deep Learning
343
18. Li, G., Wang, T.F., Hu, F.X., Liu, T.C.: Algorithm considering multi-dimensional meteorological feature extraction in short-term wind speed prediction. In: de IEEE Information Technology, Networking, Electronic and Automation Control Conference, Chengdu, China (2019) 19. Liu, M., Qiu, P., Wei, K.: Research on wind speed prediction of wind power system based on GRU deep learning. In: de IEEE Conference on Energy Internet and Energy System Integration, Changsha, China (2019) 20. Deng, Y., Jia, H., Li, P., Tong, X., Qiu, X., Li, F.: A deep learning methodology based on bidirectional gated recurrent unit for window power prediction. In: de IEEE, Xi’an, China (2019) 21. Wang, J., Li, Y.: An innovative hybrid approach for multi-step ahead wind speed prediction. Appl. Soft Comput. J. 78, 296–309 (2019) 22. Jiang, P., Liu, Z.: Variable weights combined model based on multi-objective optimization for short-term wind speed forecasting. Appl. Soft Comput. J. 82(105587), 1–19 (2019) 23. Cheng, Z., Wang, J.: A new combined model based on multi-objective slap swarm optimization for wind speed forecasting. Appl. Soft Comput. J. 92(106294), 1–20 (2020) 24. Altan, A., Karasu, S., Zio, E.: A new hybrid model for wind speed forecasting combining long short-term memory neural network, decomposition methods and grey wolf optimizer. Appl. Soft Comput. 100, 106996 (2020) 25. Yan, X., Liu, Y., Xu, Y., Jia, M.: Multistep forecasting for diurnal wind speed based on hybrid deep learning model with improved singular spectrum decomposition. Energy Convers. Manage. 225(113456), 1–22 (2020) 26. Noman, F., et al.: Multistep short-term wind speed prediction using nonlinear auto-regressive neural network with exogenous variable selection. Alexand. Eng. J. 60(1), 1221–1229 (2020) 27. Luo, L., Li, H., Wang, J., Hu, J.: Design of a combined wind speed forecasting system based on decomposition-ensemble and multi-objective optimization approach. Appl. Math. Model. 89, 49–72 (2021) 28. Xiangang, L., Xihong, W.: Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition. arxiv.org (2014) 29. Kyunghyun, C., et al.: Learning phrase representations using RNN enconder-decoder for statistical machine traslation. 1–15 (2014). arxiv.org 30. Gail, W., Yoav, G., Eran, Y.: On the practical computational power of finite precision RNNs for language recognition. 1–9 (2018), arxiv.org 31. Huynh, A.N.-L., Deo, R.C., An-Vo, D.-A-, Ali, M., Raj, N., Abdulla, S.: Near real-time global solar radiation forecasting at multiple time-step horizons using the long short-term memory network. Energies 13(14), 3517 (2020) 32. Liu, M., Qiu, P., Wei, K.: Research on wind speed prediction of wind power system based on GRU deep learning. In: de IEEE Conference on Energy Internet and Energy System Integration, Changsha, China (2019)
Evaluation for Angular Distortion of Welding Plate Shigeru Kato1(B) , Shunsaku Kume1 , Takanori Hino1 , Fujioka Shota1 , Tomomichi Kagawa1 , Hironori Kumeno1 , and Hajime Nobuhara2 1 National Institute of Technology, Niihama College, Niihama, Japan
[email protected]
2 University of Tsukuba, Tsukuba, Japan
[email protected]
Abstract. Welding is essential in our life. It is crucial to nurture welding skills in Japan nowadays. The experts have to evaluate the many beginners’ welding. Since the experts’ burden is critical, a computational assistant for evaluating beginners’ welding is required. This paper describes a simple evaluation system of welding plates by beginners. The authors considered four types of beginners’ typical defects: lack of welding metal, linear misalignment, welding metal unevenness, and angular distortion. To capture these defects simultaneously, the authors propose an original equipment to photograph the welding plates. The computer extracts only the part of the welding plate using color markers. CNN (Convolutional Neural Network) evaluates the defects. As a first step, the authors addressed evaluating only angular distortion. The angular distortion is one of the typical failures by beginners. In the experiment, the authors conducted the validation of CNN. In the conclusion part, we discuss the experimental result and future works. Keywords: Welding joint · Angular distortion · Image processing · CNN
1 Introduction Welding is significant in our life infrastructure [1], such as building, vehicle, water pipe, etc. Therefore, it is crucial to retain the people with high welding skills in Japan, and researchers develop educational systems for nurturing welding skills [2, 3]. Besides, a welding simulator for training is proposed [4]. However, as illustrated in Fig. 1(a) the expert should judge numerous welded plates to consider whether to admit a welding license. Also, the subjectivity is different among the experts, and thereby, the evaluations would differ as Fig. 1(b). Such a difference sometimes causes a quarrel between the evaluators and examinee. Therefore, in our previous study [5], we began to develop a computational system that evaluates the welding plates, as represented in Fig. 2.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 344–354, 2022. https://doi.org/10.1007/978-3-030-82193-7_23
Evaluation for Angular Distortion of Welding Plate
345
Fig. 1. Problems in human subjective evaluation.
Fig. 2. Welding evaluation system in our previous study.
In our previous system shown in Fig. 2, the convolutional neural network (CNN) [6] is employed. The CNN evaluates the welding joint as good or bad. We confirmed the proposed CNN worked well. However, a human had to extract the welding area by hand, as depicted in Fig. 2. Therefore, we adopted R-CNN (Region-Based Convolutional Neural Network) to automatically extract only welding joint area as in Fig. 3 [7]. Although R-CNN could appropriately capture the welding joint area, we had to prepare a number of image data for training the R-CNN, and the R-CNN was not stable to capture the welding joint area. Therefore, we determined to find another stable method to extract only the welding plate area image instead of R-CNN.
Fig. 3. R-CNN could extract welding joint area automatically.
346
S. Kato et al.
2 Equipment and CNN Figure 4 shows an excellent example of the welding joint without defect.
Fig. 4. Good welding joint plate with no defects.
The welding plate is evaluated by experts considering basically four defects: less metal on the joint line as Fig. 5(a), linear misalignment as Fig. 5(b), joint metal unevenness as Fig. 5(c), and angular distortion as Fig. 5(d). These defects are beginners’ typical failures.
Fig. 5. Typical welding joint defects by beginners.
Evaluation for Angular Distortion of Welding Plate
347
In order to capture the defects explained in Fig. 5 simultaneously, we decided to take a picture of the welding plate from the point of 30° angle above the bottom line, as Fig. 6 displays.
Fig. 6. Equipment to take welding plate picture.
Instead of R-CNN detection, we employ a stable method to extract only welding plate area images from the background using pink color markers, as shown in Fig. 7.
Fig. 7. Welding plate is placed along with the pink markers.
In the present paper, we focused on angular distortion [8] evaluation as the first step. The angular distortion is explained in Fig. 8. The plate is bent largely (Bad) as Fig. 8(a). In Fig. 8(b), the plate is bent slightly (Neutral). Contrary, in Fig. 8(c), the plate is flat (Good).
348
S. Kato et al.
Fig. 8. Angular distortion.
To ensure the light condition becomes constant, the equipment is surrounded by the box, as shown in Fig. 9. LED light is attached to the ceil of the inside of the box.
Fig. 9. Equipment is inside of the black box.
Figure 10 illustrates the automatic plate extraction process and CNN construction. Firstly, the metal welding plate area is extracted from the picture, as shown in Fig. 10(a), (b). And then, the extracted welding plate image is resized to 227-by-227 pixels, as shown in Fig. 10(c). The resized image is inputted to the proposed CNN in Fig. 10(d). We employed the AlexNet [9] to evaluate the welding plate angular distortion.
Evaluation for Angular Distortion of Welding Plate
349
Fig. 10. Welding plate extraction and CNN configuration.
The transfer learning [10] is adopted to tune the connection weights between fc7 and fc8. The proposed CNN outputs angular distortion level “Good” (i.e. flat), “Neutral” (slightly bent), and “Bad” (largely bent).
3 Experiment Firstly, we photographed eight pictures shown in Fig. 11 in a local welding school in our city.
Fig. 11. Photographs of welding plates taken in welding school.
350
S. Kato et al.
On another day, we photographed other welding plates in our laboratory. We hold 29 welding plates. In order to obtain many training image data, we photographed 29 plates by rotating the front and tail, as shown in Fig. 12. Consequently, we have taken 58 (29 times 2) pictures in our laboratory.
Fig. 12. Same welding plates rotating front and tail.
Therefore, we obtained 66 (8 + 58) welding plate pictures totally in the present paper. As displayed in Fig. 13, Good (flat), Bad (angular distorted), and neutral (slightly distorted) welding images are extracted and resized correctly.
Fig. 13. Example of images extracted and resized.
Evaluation for Angular Distortion of Welding Plate
351
All 66 pictures were automatically extracted along with the pink markers successfully. We classified all images to Good (21 images), Bad (34 images), and Neutral (11 images), as shown in Fig. 14.
Fig. 14. Example of extracted images: (a) Good, (b) Bad, and (c) Neutral.
4 Validation of CNN In order to validate the proposed CNN in Fig. 10(d), we used 66 images (21: Good, 34: Bad, and 11: Neutral). The 4-fold-cross validation [11, 12] is conducted using Data Set (1), (2), (3), and (4), as illustrated in Fig. 15. In the training phase, transfer learning [10] is employed. Table 1 enumerates the conditions for transfer learning of CNN and the number of train and test data for all validation data set. We employed “Deep Learning Toolbox” [13] of “MATLAB.” For all validation data set from (1) to (4) in Fig. 15 10 images (5: Good, 5: Bad) not used for training are inputted to the trained CNN. Accuracies were 80%, 100%, 90%, and 60%, respectively. The mean accuracy was 82.5%. Figure 16 shows the confusion matrix for each validation data set. “Good” or “Bad” plates are misjudged as “Neutral.” Since “Neutral” evaluation includes fuzziness [14] between “Good” and “Bad,” CNN would confuse like as a human. Several studies challenge to evaluate welding joint defect using CNN [15]. However, they deal with joints welded by professionals or machines. On the other hand, our study focuses on beginners’ typical failure.
352
S. Kato et al.
Fig. 15. CNN validation data sets.
Evaluation for Angular Distortion of Welding Plate
353
Table 1. Transfer learning and CNN test settings for all validation data set. Parameter
Value/Condition
Solver
sgdm
Learning rate
0.0001
Max epochs
50
Mini batch size
8
Total iterations
350 = 50 * 56/8
Number of train data 56 Number of test data
10
Fig. 16. Confusion matrix for each validation data set.
5 Conclusions The present paper describes the simple system for beginners’ welding evaluation. We constructed the equipment to photograph the welding plate and developed the stable automatic extraction method of welding plate area from the background by using the pink color markers. The extracted images are evaluated by CNN. CNN could evaluate angular distortion properly. In the present paper, we deal with only “angular distortion.” However, it is necessary to evaluate other defects such as “less metal on joint” as Fig. 5(a),
354
S. Kato et al.
“joint step” as Fig. 5(b), and “joint metal unevenness” as Fig. 5(c). In the future, we will address to evaluate other failures, not only the angular distortion. Acknowledgments. This research is supported by the Japan Welding Engineering Society’s grant.
References 1. Niles, R.W., Jackson, C.E.: Weld thermal efficiency of the GTAW process. Weld. J. 54, 25–32 (1975) 2. Asai, S., Ogawa, T., Takebayashi, H.: Visualization and digitation of welder skill for education and training. Weld. World 56, 26–34 (2012) 3. Hino, T., et al.: Visualization of gas tungsten arc welding skill using brightness map of backside weld pool. Trans. Mat. Res. Soc. Japan 44(5), 181–186 (2019) 4. Byrd, A.P., Stone, R.T., Anderson, R.G., Woltjer, K.: The use of virtual welding simulators to evaluate experimental welders. Weld. J. 94(12), 389–395 (2015) 5. Kato, S., Hino, T., Yoshikawa, N.: Fundamental study on evaluation system of beginner’s welding using CNN. In: Lecture Notes in Networks and Systems, vol. 96, pp. 821–827 (2019) 6. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998) 7. Kato, S., Hino, T., Kumeno, H., Kagawa, T., Nobuhara, H.: Automatic detection of beginner’s welding joint. In: Proceedings of 2020 Joint 11th International Conference on Soft Computing and Intelligent Systems and 21st International Symposium on Advanced Intelligent Systems, pp. 465–467 (2020). 8. Mochizuki, M., Okano, S.: Effect of welding process conditions on angular distortion induced by bead-on-plate welding. ISIJ Int. 58(1), 153–158 (2018) 9. Krizhevsky, A., Sutskever, I., Hinton, CE.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems (NIPS 2012), pp. 1097–1105 (2012) 10. Shin, H.C., et al.: Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 35(5), 1285–1298 (2016) 11. Priddy, K.L., Keller, P.E.: Artificial Neural Networks - An Introduction, Chapter 11 Dealing with Limited Amounts of Data, pp. 101–102. SPIE Press, Bellingham (2005) 12. Wong, T.-T.: Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 48(9), 2839–2846 (2015) 13. MathWorks: Transfer Learning Using AlexNet. https://jp.mathworks.com/help/deeplearn ing/ug/transfer-learning-using-alexnet.html;jsessionid=10ee690544b1eb830e5dc2412cf0? lang=en. Accessed 27 Feb 2021 14. Zadeh, L.: Fuzzy sets. Inf. Control 8, 338–353 (1965) 15. Park, J.-K., , An, W.-H., Kang, D.-J.: Convolutional neural network based surface inspection system for non-patterned welding defects. Int. J. Precis. Eng. Manuf. 20(3), 363–374 (2019)
A Framework for Testing and Evaluation of Operational Performance of Multi-UAV Systems Mrinmoy Sarkar1 , Xuyang Yan1 , Shamila Nateghi1 , Bruce J. Holmes2 , Kyriakos G. Vamvoudakis3 , and Abdollah Homaifar1(B) 1
North Carolina A&T State University, 1601 East Market Street, Greensboro, NC 27401, USA {msarkar,xyan}@aggies.ncat.edu, {snateghiboroujeni,homaifar}@ncat.edu 2 Alakai Technologies Corporation, Hopkinton, USA [email protected] https://www.skai.co/ 3 Georgia Institute of Technology, 270 Ferst Drive, NW, Atlanta, GA 30332-0150, USA [email protected]
Abstract. In this paper, we propose a data-driven testing and evaluation framework for multi-UAVs to evaluate their performance in executing missions in the physical world. Seven micro-behaviors, termed here as modes of operation, are leveraged to describe the autonomous functionalities of the UAVs. These functionalities are then used to design five scenarios for model training, validation and testing of the proposed framework. Each scenario includes a distinct sequence of behaviors for the UAVs in order for the different autonomous functionalities to be evaluated. We develop and implement a simulation environment using the Robot Operating System (ROS), Gazebo, and the Pixhawk autopilot to generate synthetic data for the training of a classification model. This trained model is then utilized to evaluate the behaviors of the UAVs while performing real-world missions. Finally, the proposed framework is tested using synthetic data generated from a simulation environment and validated using real-world data. Keywords: Test and evaluation · Multi-UAV testing · Autonomous behavioral testing · Cognitive systems · Physical flight testing · Bi-LSTM · ROS
1
Introduction
Recent developments in Unmanned Aerial Systems (UASs) introduce new challenges for the safety, verification and validation, and efficiency of Advanced Air Mobility (AAM) operating capabilities, and serve as technical foundations for the concept of Urban Air Mobility (UAM) [13]. In contrast to legacy air transportation systems, UAM aims to enable the growth of increasing traffic congestion. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 355–374, 2022. https://doi.org/10.1007/978-3-030-82193-7_24
356
M. Sarkar et al.
The UAM initiatives for safe airspace services seek to support the projected growth by incorporating UASs into the development of future urban traffic networks. National Aeronautics and Space Administration (NASA) projects that approximately 500 million package deliveries and 750 million air metro services will be accomplished by the UASs by 2030 [20]. The upcoming revolution of urban air traffic not only provides a promising solution to the traditional transportation systems, but also brings many concerns for the safety of UAM services, especially in dense environments. Given the deployment of large scale UASs, the testing, verification, and evaluation of such systems has become one of the most important steps for ensuring safety and reliability. In general, testing and evaluation is a core step for the deployment of autonomous systems, especially unmanned [4]. At the beginning of the development of UAS, the testing community tests the systems with methodologies developed for manned systems [22]. However, there are significant differences between the autonomous UAS and manned systems which require the development of new approaches. The fundamental differences lie in the role of autonomy in the decision-making process. Therefore, it is incumbent to develop new methodologies that are capable of testing the entire decision process of UAS without biasing the system into default human solutions [25]. As a result, a datadriven approach is more suitable than human cognitive approaches to develop a testing system for UAS that is adaptive and evolves over time. The key challenges for testing and evaluation of UAS are [17,22], (i) establishing which characteristics to observe, i.e., environment, characteristics of the system itself or characteristics of different threats in the environment; (ii) developing metrics for each characteristic like the tilt, or the height of a wall, GPS coordinates or motion of dynamic obstacles in the environment; and (iii) providing standards for the metrics in terms of numeric specifications like the maximum height of a wall, or the exact location of a static obstacle. Towards this direction, the contributions of the present paper are threefold. First, we develop a novel data-driven framework to potentially mitigate many of these fundamental challenges. The proposed framework incorporates an external observer which is able to perceive the behavior of the UAVs and employs an evaluator to automatically assess the UAVs’ performance. We define micro-behaviors for UAVs, primarily the modes of operation, to design different scenarios for the UAVs and quantify their autonomous capabilities for a particular mission from these scenarios. A classification model is then employed to learn the UAVs’ behavior and predict their performance in executing mission using the collected sensor measurements. Moreover, we develop a simulation environment to generate synthetic data and train the classification model for evaluating the behaviors of the UAVs. We implement the testing framework in our indoor flight testing environment with real-time data and test multiple UAVs simultaneously. The remainder of this paper is structured as follows. The background of testing the autonomous systems is described in Sect. 2. The problem at hand is described in Sect. 3. Section 4 discusses the details of the proposed testing framework. The development of the simulation environment and synthetic data
Testing Framework
357
generation procedure are described in Sect. 5. The details regarding the development of the indoor flight testing facility are discussed in Sect. 6 and the performance of the proposed method is presented in Sect. 7. Finally, concluding remarks and future work are provided in Sect. 8.
2
Literature Review
An autonomous unmanned system [25] refers to any system that acquires data from a sensor, perceives information, compares the information against its previous knowledge, and makes a decision based upon this information. In [22], a UAS is defined as a system that is capable of performing tasks in the world by itself without explicit human control, as well as a system that senses, understands, and acts upon the environment in which it operates. In [25], the authors provided a comprehensive introduction about testing the intelligence of an autonomous system. To test any system effectively, a tester or testing system requires: (1) the autonomous system under test (SUT); (2) the documentation associated with operation and maintenance of the SUT; and (3) the specification against which the system will be tested. First, the platform needs to be tested for structural integrity and controllability. Second, the communication among the individual components of the system needs to be tested. Third, hardware-in-the-loop followed by vehicle-in-the-loop testing can be conducted, and finally the field testing is required to verify and validate the previous testing steps. In our proposed testing framework, we assume the first step has been completed and aim to provide a unified data-driven solution for the final two steps. An introductory framework for simulation-based test and evaluation of autonomous vehicles has been proposed in [10,18]. Based on virtual reality (VR) and augmented reality (AR) technologies, the framework is developed to integrate the testing facility with the development process of an autonomous system. In [2] a quantitative method for assuring coordinated autonomy is proposed. The testing and evaluation could be quantified with a reliability engineering approach. The authors adopted a probabilistic model checking to assure the autonomy of a coordinated multi-robot mission. Primarily the system is suitable for multi-robot model checking, but not applicable to test autonomous systems in the physical world when data from the autonomous system is not accessible. A mission-based test and evaluation framework of UAS has been proposed to simulate innovative concepts and applications across the Department of Defense in [6]. It is limited to those UASs that are developed for military missions while UASs are simultaneously evolving across numerous civilian applications. In [21], a specialized testing and evaluation of autonomous surface vehicles is proposed. The authors provided six different scenarios for an autonomous swarm of surface vehicles and developed quantitative metrics for each scenario. Since their approach is specific to the unmanned surface vehicle, it is difficult to extend those techniques to other autonomous systems such as unmanned ground and aerial vehicles.
358
M. Sarkar et al.
An experimental test and evaluation framework is proposed for autonomous underwater vehicles in [15]. It defines the capabilities of the autonomous system initially and then designs appropriate scenarios to test those capabilities in the physical environment rather than a simulation environment. Our approach takes a similar procedure from [15] to test and evaluate UASs. However, [15] assumes that the SUT is a white-box system while the proposed framework considers the SUT as a black-box system. Generally, a system is considered as a white-box system when human operators have full knowledge about its internal dynamic or structure. Conversely, a black-box system refers to any system with completely unknown internal system structure. A detailed test and evaluation approach for autonomous unmanned ground vehicles (UGVs) is provided in [24]. The authors presented a scientific and comprehensive design approach to test different autonomous modules separately, or to test the autonomous system as a whole. The fuzzy comprehensive evaluation method along with the analytic hierarchy process (AHP) is used to evaluate the individual module and the entire technical performance of autonomous ground vehicles quantitatively. However, their approach also depends on the white-box assumption of the SUT. A description of the testing and evaluation of micro-air vehicles (MAVs) for both the physical realm and behavioral realm is provided in [19]. According to the author, testing the physical capability of a MAV is relatively easier than testing the autonomous behavior of the MAV. Although the work provides a scientific description of what needs to be tested, there is no quantitative, or qualitative description of how autonomous behaviors can be tested. In [23], the authors developed a data-driven framework to predict the autonomous behavior of UAV and validated it using five behaviors. However, in this paper, we extend it significantly by adding two complex behaviors of the autonomous agent and developing five different scenarios. More importantly, we propose a new metric for assignment to each autonomous agent after the mission completion in the testing framework. Although the decision tree classifier presents good performance for predicting the behaviors of UAVs in [23], it ignores the time dependency among the behaviors of UAVs and thus shows poor performance concerning the prediction of the two new behaviors developed in this paper. Hence, the Long Short-Term Memory (LSTM) based classifier is employed in the proposed testing framework to improve the prediction performance for the behavior of UAVs by exploring the temporal relationship. The details of the proposed framework are described in Sect. 4.
3
Problem Description
In the statement of the problem, basic terminology follows that will be used throughout the paper. 3.1
Terminology
Modes of Operation: A behavior pattern which can be described visually or mathematically like vertical takeoff (vTakeoff), vertical land (vLand), or Search.
Testing Framework
359
Scenario: A formal description of a UAV’s expected behavior from the start of a mission to the end of a mission like searching an area for specific targets. External Observer: A combination of different sensors that are not mounted on the UAV agents but mounted in the UAV’s operational environment. This ensures all UAV agents are in the field of view of the sensory system like high definition fully synchronized high resolution video cameras, or a radar system to track UAVs and estimate the dynamics of the UAV agents (SUTs). 3.2
Problem Statement
The problem can be stated generally as how we can automatically infer if an autonomous UAV agent is performing well or exhibiting undesirable behavior while operating in a multi-UAV or single-UAV mission by observing the agents using an external sensory system. Suppose that a UAV is assigned to a search mission. The objectives of the UAV are to search the whole area without colliding with any obstacle and detect all the target objects in the given search region. After conducting experiments on the developed UAV in real-world, the question that we are focusing in this paper is how we can assign a numerical value to evaluate the UAV’s performance in conducting the mission? Therefore, to formulate this problem, we define M ∈ N as a discrete set of modes of operation and S ∈ N as a scenario composed of a subset of M . By observing the motions of the UAV agents in a physical test environment using an external observer, the objective is to predict different modes in the S. Let z ∈ Rn be the external observer measurement and Z = [zt , zt−1 , · · · , zt−τ ], where Z is a sequence of measurements from previous τ ∈ N time steps, zt is the current measurement and t ∈ N is the current time stamp. Each mode of operation in S can be predicted using Z such that s = f (Z), where s ∈ S and f (·) is a nonlinear function of Z which maps the motion history of the UAV to the mode of operation. By predicting s in each time stamp, we can assign a confidence score α to each UAV at the end of the mission, which reflects the overall performance of a UAV during the entire mission.
4
Proposed Framework
In this section the proposed framework and its major components are described. 4.1
Overview of the Proposed Framework
As shown in Fig. 1, a physical testing environment which contains different objects such as boxes, artificial trees, blowers, different light sources, is considered in the proposed framework. The purpose of having these objects is to create different types of real-world environments for testing. The environment is equipped with tracking devices such as high definition fully synchronized high resolution video cameras previously defined as an External observer. The purpose of the tracking devices are to estimate the motion of the UAV agents (position:
360
M. Sarkar et al.
Fig. 1. In the testing framework, the observer module tracks all the UAVs individually and provides the 6D pose vector (x, y, z, φ, θ, ψ) ∈ R. The 6D pose vector goes through a time derivative operation to generate the velocities. In the next step, the 12D vector, ˙ θ, ˙ ψ] ˙ and a history of these measurements are used as X = [x, y, z, φ, θ, ψ, x, ˙ y, ˙ z, ˙ φ, feature representation in the Perception Inference Engine (PIE) which predicts the current mode of operation of each UAV. The predicted mode of operation is used by the evaluator along with the true scenario to generate a confidence score for each UAV in performing a defined mission. The confidence score is a measure to show how well a UAV perform a mission which consists of several modes.
˙ θ, ˙ ψ) ˙ x, y, z, orientation: φ, θ, ψ, linear velocity: x, ˙ y, ˙ z˙ and angular velocity: φ, ∈ R, while they are operating in a mission. Using the estimated motion data the Perception Inference Engine (PIE) block predicts the current mode of operation of each UAV agent. The Evaluator block compares the predicted modes with the expected modes from the True Scenario block. At the end of the mission, a confidence score is provided from the Evaluator block regarding the performance of each UAV. The following are required to be satisfied in the proposed framework. 1. All the UAVs are assumed to be autonomous and will execute a well-defined scenario. 2. During the execution of the scenario, all UAVs must be in the field of view of the Observer module shown in Fig. 1. 4.2
Modes of Operation
Hold: The UAV stays on the ground plane. vTakeoff: The UAV starts flying vertically upwards until it reaches a predefined altitude. Hover: The UAV stays at a fixed altitude and (x, y) coordinate.
Testing Framework
361
vLand: The UAV begins flying vertically downwards until it reaches the ground plane. Search: The UAV primarily moves in (x, y) plane in a fixed altitude. Loiter: The UAV follows a circular trajectory of fixed radius and the center is the detected (x, y) coordinates of the target object in a fixed altitude. Obstacleavoid: The UAV follows a collision-free trajectory and the trajectory depends on the properties of the obstacle. Two possible collision-free trajectories are shown in Fig. 2(g). Each mode of operation is shown graphically in Fig. 2. These seven modes are used as the generalized description of UAV’s behaviors and they are not mission-specific. For example, the Loiter mode indicates the detection of any targets irrespective of their geometric difference or other properties.
(a) Hold
(e) Search
(b) vTakeoff
(f) Loiter
(c) Hover
(d) vLand
(g) Obstacleavoid
Fig. 2. Graphical representation of the proposed seven modes of operation.
4.3
Scenarios
Scenario-1: The UAV takes off from a predefined position until reaching a user-defined altitude and then hovers for a specific period of time. Finally, it slightly moves in a positive x direction and lands on the location. The scenario is designed to test whether a UAV could fly autonomously. Scenario-2: The UAV takes off from a predefined location until gaining a predefined altitude and then hovers for a specific period of time. Afterwards, it scans a rectangular area using a lawnmower type search pattern. The UAV will land after finishing the scanning. In this scenario, we test the UAV’s capability for searching in an area. Scenario-3: This is an extended version of Scenario-2. Here, some target objects are placed in the ground plane which should be detected by the UAV while scanning the area. The objective of the scenario is to check if the UAV’s perception module for target detection works properly. In this scenario, we expect to see
362
M. Sarkar et al.
Loiter mode from UAV, since Loiter mode is a sign to indicate that the UAV detects the target. Scenario-4: This is also an extended version of Scenario-2. Here, we put a static object in the nominal trajectory of the UAV so that it needs to avoid the obstacle in its path. This scenario is designed to test if the UAV can avoid obstacles. Scenario-5: This is a combination of Scenario-3 and Scenario-4. The test environment has both the target object and static obstacle. This scenario is designed to test all the capabilities that have been tested individually in the other four scenarios. The mode transition diagram of each scenario is shown in Fig. 3.
Hold
vTakeoff
Hold
vLand
Hover
vLand
Scenario-1 Hold
vTakeoff
Hover
Hold
Search
vLand
Scenario-2
vTakeoff
Hover
vTakeoff
Hover
Search
Loiter
Scenario-3 Hold
vTakeoff
Hover Loiter
vLand
Search
Obstacleavoid
vLand
Scenario-4
Search
Obstacleavoid
Scenario-5
Fig. 3. Mode transition diagram of the five scenarios.
4.4
Perception Inference Engine (PIE)
Perception Inference Engine (PIE) is a classification model. The objective of PIE is to infer the current mode of operation of individual UAV agents using the motion data extracted from the motion capture system. In [23], the Decision Tree (DT), Support Vector Machine (SVM), and Na¨ıve Bayes classifier are used to predict the modes. It was reported that DT outperformed the other two classifiers. However, the performance of DT-based classifier degrades significantly when we designed the two new modes of operation (Loiter and Obstacleavoid). With further investigation, we discovered that the motion history of a UAV plays an important role in predicting the current mode. In [23], the motion history is not considered to predict the mode of the UAVs, but the current measurement of the observer module is used. Therefore, not only for the two new modes but also during the transition between two modes, the DT’s performance is poor. More importantly, the behavior of UAVs is time-dependent so it is necessary to explore the temporal relationship from the motion history. Hence, we treat the problem as a time series classification or sequence classification problem. The state-of-theart time series classification technique is LSTM-based classifier [7,9,11,12,14,26]. We use the bidirectional variant of the LSTM-based classifier to develop PIE.
Testing Framework
363
The mathematical formulation of this type of network can be found in [14]. An extensive mathematical formulation and architecture of the Bi-LSTM network can be found in [5]. The architecture of PIE is a bidirectional LSTM followed by a dense layer to classify the mode of operation. The architecture is shown in Fig. 4. The LSTM block is described in the following paragraph.
Fig. 4. The architecture of Perception Inference Engine for ith UAV. Here, τ is the time step and FC is a fully connected layer.
LSTM Block: Each LSTM block in Fig. 4 takes three inputs: (i) the sensor measurement (Xt ); (ii) the output (ht−1 ) from the previous LSTM block; and (iii) the previous cell state (Ct−1 ). In a LSTM network the cell state is the mechanism to store the time dependency of input features in the form of memory. The memory is stored or deleted using gates. Therefore, each LSTM block is composed of three gates named as forget, input, and output as shown in Fig. 5. – Forget gate: This gate decides what information needs to be stored or removed from the cell State. The output of this gate is calculated using ft = σ(Wf Xt + Uf ht−1 + bf ), no
(1) no ×ni
where, ft ∈ R is the output vector of the forget gate, Wf ∈ R is the weight matrix of forget gate for input features, Xt ∈ Rni is the input feature vector, Uf ∈ Rno ×no is the weight matrix of forget gate for output vector of previous cell, ht−1 ∈ Rno is the output vector of previous cell, bf ∈ Rno is the bias vector of forget gate, no is output dimension, ni is input dimension, and σ refers to a sigmoid function. – Input gate: The input gate updates the old cell state using it = σ(Wi Xt + Ui ht−1 + bi ), Cˆt = tanh(Wc Xt + Uc ht−1 + bc ), Ct = ft Ct−1 + it Cˆt ,
(2)
364
M. Sarkar et al.
where, it ∈ Rno is the output vector of the input gate, Wi ∈ Rno ×ni is the weight matrix of input gate for input features, Ui ∈ Rno ×no is the weight matrix of input gate for output vector of previous cell, bi ∈ Rno is the bias vector of input gate, Cˆt ∈ Rno is the candidate cell state vector, Wc ∈ Rno ×ni is the weight matrix of candidate cell for input features, Uc ∈ Rno ×no is the weight matrix of candidate cell for output vector of previous cell, bc ∈ Rno is the bias vector of candidate cell, Ct ∈ Rno is the cell state vector, and tanh is the hyperbolic tangent function. – Output gate: The output gate calculates the output of the LSTM block using ot = σ(Wo Xt + Uo ht−1 + bo ), (3) ht = ot tanh(Ct ), where, ot ∈ Rno is the output vector of the output gate, Wo ∈ Rno ×ni is the weight matrix of output gate for input features, Uo ∈ Rno ×no is the weight matrix of output gate for output vector of previous cell, bo ∈ Rno is the bias vector of output gate, and ht ∈ Rno is the output of the LSTM block.
Fig. 5. Graphical representation of the LSTM block.
4.5
True Scenario
The True Scenario block contains information about the expected behavior of the UAV agent in a particular scenario. It has a list of tuples which contains seven variables denoted as (m, x, y, z, w, l, h) ∈ R. Here, m refers to one of those seven modes (m ∈ M ), (x, y, z) and (w, l, h) are the center and dimensions of a virtual 3D bounding box within the expected mode of operation (m). This block takes the current position of the UAV and compares it with all the list elements
Testing Framework
365
to find the UAV’s current expected mode of operation. Suppose (xc , yc , zc ) ∈ R is the current position of the UAV, then the expected behavior of the UAV is m if and only if x − xc ≤ w2 & y − yc ≤ 2l & z − zc ≤ h2 . Based on the scenarios defined in Sect. 4.3, the whole operational environment is partitioned into a number of 3D bounding boxes with different sizes. Each bounding box is assigned with a unique mode of operation. Although this type of design may limit the development of scenarios where a UAV could be in different modes in the same physical location in different time stamps, this design approach facilitates the automation of the testing framework. Moreover, the purpose of the framework is to test the UAVs capabilities such as vtakeoff or vland irrespective of the physical location. For any scenario which requires the same location for different modes, we can design an alternative scenario to handle the particular location conflict among modes of operation. 4.6
Evaluator
The evaluator block compares the predicted mode of operation from the PIE block with the expected mode of operation that comes from the True Scenario block. The True Scenario block provides the expected mode of operation based on the prior knowledge of the area where each scenario is supposed to be implemented and the current position of the UAV. For example, if the UAV is currently on top of a target object then the expected mode of operation of the UAV is Loiter. The confidence score for each UAV is calculated using, N
WTMn × L(TMn , PMn ) , (4) N n=0 WTMn T ∈ N are the total time-steps where, α ∈ [0, 1] is a confidence score, N = δt in a scenario, T is the total mission execution time, δt is the sampling period, i = {1, 2, ..., K}, K is the number of modes in a particular scenario, WTM ∈ R is K i = 1, the user-defined weight for the different modes in a scenario with i=1 WTM TM is the true mode coming from the True Scenario block, PM is the predicted mode coming from the PIE block, and L(A, B) = 0 if A = B or 1 if A = B. In equation (4), α is a weighted accuracy and reflects the accuracy of a UAV while performing a requested mission. Since α is calculated at the end of any mission, T ≤ Tmax is varying, with Tmax the maximum allowed time for the execution of a mission. As an example, Scenario-1 has four modes of operation such that there are four user-defined weights each associated with one of the four modes. Accordingly, the value of K is 4 in Scenario-1. The user-defined weights are introduced to make the testing framework more flexible to test more critical modes effectively, so we can assign small weight for Hover mode and a larger weight for Obstacleavoid mode so that the UAV is more penalized when it makes mistakes during the obstacle avoidance. α=1−
n=0
366
5
M. Sarkar et al.
Synthetic Data Generation
The proposed data-driven testing framework requires a significant amount of good quality data. Therefore, a simulation environment is developed using offthe-shelf state-of-the-art tools and software packages which are both open source and user friendly. The Robot operating system (ROS) is used to develop the intelligence of the UAV agent for mapping, path planning or object recognition. We use the Pixhawk firmware to develop the low-level controller and on-board sensor data management of the UAV. Gazebo is used as the 3D simulation environment. All five scenarios are developed in this simulation environment and the motion data of the UAV agents are recorded to train the PIE classifier. The operational area of the UAVs is set to 50 m × 50 m × 10 m. The autonomy implementation details are out of the scope of this paper. We record twelve features named as x, y, z positions, roll (φ), pitch (θ), yaw (ψ), x, ˙ y, ˙ z˙ velocities ˙ pitch rate (θ) ˙ and yaw rate (ψ). ˙ Using this in three directions, roll rate (φ), feature vector, we can train a deep learning model to predict the current mode (one of the seven modes) of the UAV. The experimental results using these synthetic data are listed in Sect. 7.
6
Hardware Implementation
We use our own built UAV model to implement the scenarios and test it using the proposed framework. We followed a similar development procedure for the UAV hardware as described in [8]. The UAV is designed using quadcopter architecture. It is equipped with Pixhawk autopilot for low-level actuator control and IMU data processing, Intel Aero Computer board for high-level implementation of mission planning and computer vision algorithms such as 3D map generation of the environment using Intel Real Sense Camera. Moreover, the UAV can carry an external load of 250 gm for a maximum flight time of 12 min. Therefore, We use similar software architecture as the simulation platform such as ROS and Pixhawk firmware. As a result, all the software developed in the simulation platform can be directly used with minor modifications in the UAV hardware. The assembled UAV is shown in Fig. 6. The dimension of the indoor flight testing environment is 12 m × 4 m × m and it has ten high definition fully synchronized high resolution video cameras which are evenly distributed in the space. During the hardware implementation of the framework, we use the motion capture system to track each UAV and predict the current mode of operation of the UAV using the PIE block. The testing framework provides the confidence scores for these UAV agents after each implemented scenario. The experimental results are discussed in Sect. 7.
7
Experiments and Discussion
In this section, the experimental results from simulation and hardware implementation are described.
Testing Framework
367
Fig. 6. The developed UAV which is used to implement the scenarios and tested in the indoor flight testing environment.
7.1
Data Collection and Model Selection
In the first stage, we implement all the five scenarios described in Sect. 4.3 in the simulation environment and record the data during the simulation. All the data is sampled at 60 Hz. The recorded data is presented in Table 1. We label the data with respect to the seven modes and utilize the labeled data to train the classification model. Table 1. An overview of the recorded synthetic data from four UAVs. Scenario-1 Scenario-2 Scenario-3 Scenario-4 Scenario-5 Hold
13264
3378
64310
15361
33644
vTakeoff
1813
1850
1900
1978
1974
Hover
9001
9036
9045
9039
9025
0
40868
204655
154459
156413
Search Loiter
0
0
83451
0
49397
Obstacleavoid
0
0
0
7665
9119
1940
1974
1986
1993
2027
vLand
Since Scenario-5 contains all the developed modes of operation, we use only Scenario-5 to train the model. We split the data from Scenario-5 into training (80%) and validation (20%) set. Then, we test the model with data from the other four scenarios. To select the best Bi-LSTM model, we use three different timesteps (32, 64, 128), two different feature sets (seven features: [z, x, ˙ y, ˙ z, ˙ ψ, θ, φ] and nine features: [x, y, z, x, ˙ y, ˙ z, ˙ ψ, θ, φ]) and four different neuron sizes in the LSTM cell (64, 128, 256, 512). A total number of 24(3 × 2 × 4) different models are trained for 700 epochs with Adam [16] (learning rate = 0.001) optimizer and Categorical Cross Entropy as the loss function. We use Keras [3] with Tensorflow [1] back-end as the deep-learning framework to develop, train, and
368
M. Sarkar et al.
validate the models. We used Intel Xeon(R) CPU at 2.2 GHz with 88 cores, 128 GB RAM, and Nvidia Geforce RTX 2080Ti GPU. To train one Bi-LSTM model, it took 7 h 46 min 7 s. Based on the accuracy metric, the best parameter set is obtained as time-steps 128, feature set 7 and neuron size 64 as shown in Fig. 7. The best model has only 37, 767 trainable parameters. Table 2 summarizes the testing scores under the best parameter setting in terms of Accuracy, Precision, Recall and F1 Score. Also, a comparison with base decision tree model is presented in Table 2. The decision tree is also trained with the same number of input features and same number of time steps. Consequently, the input feature length of the decision tree is 896 (128 × 7). From Table 2, it is clear that Bi-LSTM outperforms the base decision tree (DT) model and thus we use the Bi-LSTM model in the PIE block. With the best parameter setting, the trained model is used as the PIE block in the framework and experiments are conducted using four UAVs. The details of the experiments are discussed in the following subsections.
Fig. 7. Performance of different trained models for different combination of parameters.
7.2
Deployment of PIE
In the simulation interface, three different types of experiments are conducted to evaluate the efficacy of the proposed testing framework. In each experiment, all the UAVs are operating at the same time concurrently. Experiment-1: UAVs with healthy sensors, controllers, and planning algorithms are used in all five scenarios. Table 3 summarizes the testing results of
Testing Framework
369
Table 2. Best testing scores of the best trained models (Bi-LSTM and DT) using data which are recorded from four UAVs during the data-collection phase. Accuracy Bi-LSTM DT
Precision Bi-LSTM DT
Recall Bi-LSTM DT
F1 Score Bi-LSTM DT
Scenario-1 0.96
0.85 0.96
0.92 0.96
0.85 0.96
0.86
Scenario-2 0.96
0.92 0.98
0.95 0.96
0.92 0.97
0.94
Scenario-3 0.97
0.93 0.97
0.93 0.97
0.93 0.97
0.93
Scenario-4 0.97
0.90 0.98
0.98 0.97
0.90 0.98
0.94
Scenario-5 0.95
0.90 0.95
0.91 0.95
0.90 0.95
0.90
UAVs with healthy functionalities. From Table 3, we can infer that each UAV gets a confidence score of close to 0.9 and above for performing well in these five 1 for each mode of scenarios. In each trial, the user-defined weights are set to K operation. Table 3. Average confidence score (α) of 10 trials, for the four simulated UAVs for five different scenarios while UAVs have healthy implementation of autonomy. UAV1 UAV2 UAV3 UAV4 Scenario-1 0.86
0.87
0.90
0.87
Scenario-2 0.94
0.93
0.96
0.92
Scenario-3 0.93
0.93
0.92
0.94
Scenario-4 0.93
0.91
0.93
0.92
Scenario-5 0.88
0.90
0.90
0.91
Experiment-2: In this experiment, we disable different functionalities of UAVs to test whether the proposed testing framework could capture the deficiency of UAVs in four scenarios. In Scenario-2, we inject delays in the path planning algorithms for UAV1 and UAV3 . The target detection modules of the UAV2 and UAV3 are disabled during the execution of Scenario-3. For Scenario-4 and Scenario-5, the obstacle detection sensors are removed from the UAV1 and UAV4 . The testing results are shown in Table 5. From Table 5, we observe that the UAVs shows a poor confidence score in each scenario when some technical issues occur. The user-defined weights for this experiment are tabulated in Table 4. Experiment-3: In this experiment, we implement a new behavior in Scenario-3 for which the PIE model is not directly trained. In the modified Scenario-3 , UAVs are commanded to move gradually upward when they find a target object in the search space. Therefore, the trajectory of Loiter mode changes from a circle into a spiral trajectory in this experiment as shown in Fig. 8. The maximum
370
M. Sarkar et al.
Table 4. User-defined weights for different scenarios for experiment-2. Here, “-” indicates that the respective scenario does not have the respective mode of operation. WHold WvTakeoff WHover WSearch WLoiter WObstacleavoid WvLand Scenario-2 0.1
0.1
0.1
0.6
-
-
0.1
Scenario-3 0.1
0.1
0.1
0.1
0.5
-
0.1
Scenario-4 0.1
0.1
0.1
0.1
-
0.5
0.1
Scenario-5 0.1
0.1
0.1
0.1
0.1
0.4
0.1
Table 5. Average confidence score (α) of 10 trials, for the four simulated UAVs for four different scenarios while different functionalities of UAVs are disabled. UAV1 UAV2 UAV3 UAV4 Scenario-2 0.15
0.89
0.17
0.88
Scenario-3 0.95
0.09
0.12
0.93
Scenario-4 0.05
0.94
0.93
0.02
Scenario-5 0.04
0.92
0.91
0.02
change of altitude is set as 25% of the vertical height of the indoor flight testing space in this mode which is a 67% change of altitude from nominal behavior. We introduce this type of behavior to indicate the atmospheric disturbance in the physical environment. The testing result is shown in Table 6. Table 6 indicates that the proposed framework still provides a high confidence score to each UAV while UAV’s behavior shifts from the nominal behavior significantly but in an 1 for each acceptable direction. In each trial, the user-defined weights are set to K mode of operation.
Fig. 8. The expected behavior of a UAV during loiter mode in modified scenario-3 .
Testing Framework
371
Table 6. Average confidence score (α) of 10 trials, for the four simulated UAVs for the modified scenario-3 . UAV1 UAV2 UAV3 UAV4 Scenario-3 0.89
0.86
0.86
0.85
In summary, from Table 3, it is clear that the testing framework provides a high confidence score while all the UAVs are performing well in the simulation environment. Conversely, Table 5 shows very poor confidence scores for UAVs in the simulation when there are technical issues in UAV’s autonomy implementation. Moreover, Table 6 demonstrates that the testing framework can predict unseen behavior. 7.3
Deployment of PIE in Hardware
After obtaining satisfactory results from the simulation, we implement Scenario1 and Scenario-2 in the indoor flight testing facility for two UAVs. The objective of the real-world implementation and experimentation is to further evaluate the performance of the proposed testing framework. Due to the limited space of the physical environment, we are unable to implement Scenario-3, Scenario-4, and Scenario-5 in the indoor flight testing facility. We do not implement any faulty behaviors in real UAVs because it may cause potential damage to the hardware components of the UAVs. Moreover, we have a plan to develop a control mechanism which will take control of the UAVs in case an unexpected behavior occur during the physical testing. The UAVs are tracked using high definition fully synchronized high resolution video cameras (External Observer) and the obtained features are fed to the trained PIE model. The Evaluator uses the predicted mode from the PIE module and true mode from the True Scenario block to calculate the confidence score of each UAV in the indoor flight testing facility. With ten trials, the mean of the confidence scores for UAVs from the hardware implementation are presented in Table 7. Considering the dynamic nature of the physical world, we also provide the standard deviation of the confidence scores in Table 7. Table 7. Average confidence score (α) of 10 trials, for the two real UAVs for two different scenarios while UAVs have healthy implementation of autonomy. UAV1 μ σ
UAV2 μ σ
Scenario-1 0.95 0.015 0.96 0.011 Scenario-2 0.92 0.026 0.93 0.014
372
M. Sarkar et al.
From Tables 3 and 7, we can infer that the proposed testing framework can be utilized to test the operational success of autonomous UAV missions not only in the simulation platform but also in the physical testing environment. More importantly, Table 5 indicates that the framework can evaluate the undesirable behavior of a UAV agent by assigning a relatively low confidence score.
8
Conclusion and Future Work
In this paper, a data-driven testing framework is proposed to evaluate the performance of multiple UAVs while performing missions in real-world scenarios. We define seven modes of operation to describe the behaviors of UAVs and use the defined modes of operation to assess the autonomous capabilities of UAVs in executing missions. The LSTM-based classifier is used to provide an accurate estimation of the behavior of UAVs by exploring the temporal relationship in the data and a comparison study with previously known best classifier (Decision tree) has been presented. To train and validate the proposed framework, we developed a simulation interface using ROS, Gazebo, and Pixhawk. Five different scenarios were designed and implemented in the developed simulation interface. Furthermore, we implemented two scenarios in the indoor flight testing facility and tested the proposed framework with real-world data. The experimental results justify the efficacy of the proposed testing framework for use in both simulation and real-world scenarios. A potential application of the proposed framework is the development of safe and reliable urban air mobility networks. Additionally, the proposed framework offers a promising solution for the recurrent qualification assurance of UASs after deployment. In the future, we will implement the other three scenarios in a larger indoor testing facility with a control mechanism that will reduce the chance of damaging the UAV during the testing procedure. Moreover, the collaboration among UAVs will be considered for designing more comprehensive scenarios for the performance evaluation of the proposed testing framework. Furthermore, efforts will be conducted on the interpretation of the confidence score in terms of the upper or lower bounds to consider real-world standards for the operational success of UASs. Acknowledgment. The authors would like to thank the Office of the Secretary of Defense (OSD) for the financial support under agreement number FA8750-15-2-0116. This work is also partially funded through the National Institute of Aerospace’s Langley Distinguished Professor Program under grant number C16-2B00-NCAT, and by the NASA University Leadership Initiative (ULI) under grant number 80N SSC20M 0161. Also, this work is supported in parts by NSF under grant Nos. CAREER CPS-1851588, S&AS 1849198, and SATC-1801611.
Testing Framework
373
References 1. Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2016), pp. 265–283 (2016) 2. Chaki, S., Dolan, J.M., Giampapa, J.A.: Toward a quantitative method for assuring coordinated autonomy. In: Proceedings of ARMS Workshop (2013) 3. Chollet, F., et al.: Keras (2015) 4. Cowart, K., Valerdi, R., Kenley, C.R.: Development, validation and implementation considerations of a decision support system for unmanned & autonomous system of systems test & evaluation (2010) 5. Cui, Z., Ke, R., Pu, Z., Wang, Y.: Deep bidirectional and unidirectional LSTM recurrent neural network for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143 (2018) 6. Djang, P.A., Lopez, F.: Unmanned and autonomous systems mission based test and evaluation. In: Proceedings of the 9th Workshop on Performance Metrics for Intelligent Systems, pp. 81–85 (2009) 7. Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM (1999) 8. Girma, A., et al.: IoT-enabled autonomous system collaboration for disaster-area management. IEEE/CAA J. Automatica Sin. 7(5), 1249–1262 (2020) 9. Girma, A., Yan, X., Homaifar, A., Driver identification based on vehicle telematics data using LSTM-recurrent neural network. In: 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), pp. 894–902. IEEE (2019) 10. Gonda, N.D.: A framework for test & evaluation of autonomous systems along the virtuality-reality spectrum. Master’s thesis, Old Dominion University, Norfolk, VA, USA (2019) 11. Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 18(5–6), 602–610 (2005) 12. Greff, K., Srivastava, R.K., Koutn´ık, J., Steunebrink, B.R., Schmidhuber, J.: LSTM: a search space odyssey. IEEE Trans. Neural Netw. Learn. Syst. 28(10), 2222–2232 (2016) 13. Holden, J., Goel, N.: Fast-forwarding to a future of on-demand urban air transportation, San Francisco, CA (2016) 14. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015) 15. Keane, J., Joiner, K.: Experimental test and evaluation of autonomous underwater vehicles. Aust. J. Multi-Discip. Eng. 16(1), 67–79 (2020) 16. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 17. Koopman, P., Wagner, M.: Challenges in autonomous vehicle testing and validation. SAE Int. J. Transp. Saf. 4(1), 15–24 (2016) 18. Leathrum, J.F., Mielke, R.R., Shen, Y., Johnson, H.: Academic/industry educational lab for simulation-based test & evaluation of autonomous vehicles. In: 2018 Winter Simulation Conference (WSC), pp. 4026–4037. IEEE (2018) 19. Michelson, R.C.: Test and evaluation for fully autonomous micro air vehicles. ITEA J. 29(4), 367–374 (2008) 20. NASA: Advanced air mobility studies/reports/presentations (2019)
374
M. Sarkar et al.
21. Reitz, B.C., Wilkerson, J.L.: Test and evaluation of autonomous surface vehicles: a case study. In: 2020 IEEE/ION Position, Location and Navigation Symposium (PLANS), pp. 839–850. IEEE (2020) 22. Roske, V.P., Kohlberg, I., Wagner, R.: Autonomous systems challenges to test and evaluation. In: 28th Conference of National Defense Industrial Association (2012) 23. Sarkar, M., Homaifar, A., Erol, B.A., Behniapoor, M., Tunstel, E.: PIE: a tool for data-driven autonomous UAV flight testing. J. Intell. Robot. Syst. 98(2), 421–438 (2019). https://doi.org/10.1007/s10846-019-01078-y 24. Sun, Y., Xiong, G., Song, W., Gong, J., Chen, H.: Test and evaluation of autonomous ground vehicles. Adv. Mech. Eng. 6 (2014). https://doi.org/10.1155/ 2014/681326 25. Thompson, M.: Testing the intelligence of unmanned autonomous systems. Technical report, Trideum Corp., Aberdeen, MD (2008) 26. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.: Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, pp. 802–810 (2015)
Addressing Consumer Demands: A Manufacturing Collaboration Process Using Blockchain for Knowledge Representation Ricardo Barbosa1,2(B) , Ricardo Santos1 , and Paulo Novais2 1
2
CIICESI, Escola Superior de Tecnologia e Gest˜ ao, Polit´ecnico do Porto, Porto, Portugal {rmb,rjs}@estg.ipp.pt Department of Informatics, ALGORITMI Center, University of Minho, Braga, Portugal [email protected]
Abstract. Under I4.0, the evolution of the manufacturing processes is supported by an increase of data that is available and produced by organisations, the digitalisation of manufacturing pipelines, and a paradigm shift in production (from mass production to mass personalisation). Additionally, organisations need to gather the necessary conditions to ensure their quick adaptation to a changing environment and replace reactiveness for proactivity. Collaboration can act as the foundation to an answer for the increase demand for customised products, with an open and transparent environment where information is shared, and actors can work together to solve a common problem. In this work we propose a model definition for an industrial collaboration network composed by a network of entities, with reasoning and interaction, that uses a blockchain for knowledge representation. Current definitions of MAS already include a representation of equipment, transportation, products, and organisations; our contribution proposes the inclusion of the consumer, represented by an agent, directly in the manufacturing process. This agent represents the preferences and needs of the consumer in product customisation scenarios which, together with the other agents, negotiate criteria and cooperate with each other. The network is composed by distinct types of agents, across multiple organisations, that share common objectives. We use Hyperledger Fabric to represent knowledge, assuring that the data is stored and shared with all entities, while keeping the information secure and assuring that it cannot be tampered with.
Keywords: Collaboration Multi-agent system
· Negotiation · Industry 4.0 · Blockchain ·
This work has been supported by FCT – Funda¸ca ˜o para a Ciˆencia e Tecnologia within the Project Scope: UIDB/04728/2020. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 375–390, 2022. https://doi.org/10.1007/978-3-030-82193-7_25
376
1
R. Barbosa et al.
Introduction
Recent developments regarding Industry 4.0 (I4.0) definitions are commonly naming collaboration scenarios, and the urge to collaborate, as an essential characteristic for the success of the fourth industrial revolution [3]. As a revolution that is destined to impact the overall performance, quality, and the control of the manufacturing process, is still facing some challenges [24]. To answer its demands, organisations have a necessity to collaborate more efficiently, making faster and reliable decisions, and establishing transactions between the right partners. Organisations establish among themselves a collaborative principle, which typically operates in the supply chain, in order to introduce benefits to their activities, as well as the ability to respond to the needs of the most demanding consumer. With the introduction of I4.0, the manufacturing process must be able to meet the consumers needs, resulting in an increase in the demand associated with the supply chain, which represents a necessity to improve communication and supplier integration [13]. In a set of entities, where all (or part of) services and/or products are highly dependent on the availability of services and/or products from other entities, it is difficult to execute quick and easy decisions. There will always be a set of dependencies between two or more entities that share a position in the supply chain [27], and problem is that in an environment where organisations need to make decisions assertively and quickly, there is no way of knowing which entity to depend on. Such developments in the manufacturing processes are being supported by governmental agenda, being part of the several strategy plans for the future of the industry in many countries, including the European Union [11]. Collaboration is based on trust, but without social capabilities or characteristics, collaboration can be based on agreeable contracts that bound two entities [30]. We evidence four necessities: (1) the decentralisation of the decision-making process; (2) supporting collaboration; (3) include the consumer in the manufacturing processes; (4) and represent the generated knowledge. As result this work proposes the inclusion of the consumer, represented by an agent, directly in the manufacturing process. To achieve that, this work proposes the definition of a model for an industrial collaboration network, that includes the consumer in the manufacturing process (social manufacturing) and is composed by a collaborative network of entities (based on a Multi-Agent System), a reasoning and interaction layer, and the usage of blockchain to represent knowledge. Our proposal is based on the definition of a Multi-Agent System (MAS) to represent industrial entities in an environment where there is a necessity for collaboration, while maintaining competitive natures. This model can support decision-making processes regarding which entity should one rely on, to solve existing dependencies and is initially focused on the manufacturing of customised products. This work is structured as follows. In Sect. 2 we present a background on the technologies and concepts included in this work, with special reference to the blockchain technology. This Section also includes related work entries that are the result of the combination of MAS and blockchain technologies; Sect. 3
Manufacturing Collaboration Process Using Blockchain for KR
377
describes our proposed solution, including the process that originated the proposed model and a description of its main components, namely: network of entities, reasoning and interaction layer, and knowledge representation; This work concludes (Sect. 4) with a discussion of the proposed solution, its strengths, limitations and future work paths.
2
Background
As a concept, Ambient Intelligence defines a vision of the Information Society with emphasis on greater ease of use, more efficient support services, and supporting human interactions, referring to a digital environment that proactively but sensibly supports people and their daily activities. The focus of this concept has been adjusted according to chronological needs [6], and a quantitative analysis of scientific publications in the field suggests that the term has been replaced by more popular terminologies appropriated to the area of application, including the I4.0 terminology that is typically associated with Ambient Intelligence in an industrial context (or an intelligent industrial environment). First introduced by the German industry during the Hannover Fair event in 2013 [15], the I4.0 concept is impulsed by emerging technologies that are being adopted by manufacturing environments like the Internet of Things, wireless sensor networks, big data, cloud computing, and embedded systems. One of the main objectives is the creation of new values for the industry, through the creation of new business models, and the resolution of numerous socia l problems [14]. Cyber-Physical Systems (CPS) are defined as a transforming technology that provides innovative services to enable interconnected operations between physical assets, computing, and communication [16]. Shafiq et al. [26] define CPS as being “the convergence of the physical and digital worlds by establishing global networks for business that incorporate their machinery, warehousing systems and production facilities”. Monostori et al. [20] affirm that CPS “are systems of collaborating computational entities which are in intensive connection with the surrounding physical world and its ongoing processes, providing and using, at the same time, data-accessing and data-processing services available on the Internet”. With the growing usage of sensors and network connected machines, there will be a continuous generation of data that the CPS manage and leverage the connectivity between the machines, originating smart-machines. Also applying the concept of CPS in production, logistics and services in the current state of industrial practices, it would transform the factories of today into smart-factories with significant economical potential [16]. With an increasing usage of online social networks, and the adoption of new technologies, there is a demand to include consumers opinions on product manufacturing, customisation and delivery, requiring factories to become self-aware, self-maintenance, and capable of making market predictions and act accordingly [17]. Social Cyber-Physical Systems (SCPS) are an evolution of the CPS
378
R. Barbosa et al.
model, and combines the production services with the consumer, understanding consumer demands and offer personalised products and services on valuable time [33]. Agent-based technology is recognised as an important approach for the twenty-first century manufacturing systems. The suitability of agent technology is a unique factor to consider in the real world applications, particularly in I4.0, since it can bring a major improvement in the decision making processes and in the collaboration of different systems [2]. Is an entity that senses the environment and acts on it, performing a task continuously, with a strong autonomy, in a shifting environment, while coexisting with other entities and processes. Multi-agent Systems (MAS) aim to provide both principles for construction of complex systems involving multiple agents and mechanisms for coordination of independent agent behaviour [28]. While an agent ins any individual entity that is making decisions independently, MAS are a network of agents that work together to solve a specific problem (where agents work together) implying a certain level of cooperation among the agents involved, that can be explicit by design, or adapted. MAS are a particular type of intelligent systems, where autonomous agents dwell in a world with no control, or persistent knowledge. This infrastructure has been studied as a solution to manage widely distributed systems, particularly industrial applications, and aim to provide both principles for construction of complex systems involving multiple agents. MAS, which consists of multiple autonomous agents with distinct goals, are especially suitable for the development of complex and dynamic systems. Agents communicate with each other and with the environment with a focus on understanding the latter and reason upon intelligent models, coordinating their efforts to achieve their goals and the one of the ecosystem where they are inserted in. 2.1
Blockchain
Since the publication of “Bitcoin: A Peer-to-Peer Electronic Cash System” by Nakamoto [21], and the follow announcement of the first public version of the bitcoin client, blockchain has started its journey to become one of the most popular topics today. Since then, blockchain has being commonly associated with cryptocurrency and accompanied its success, which intrigued and triggered the curiosity of researchers from different academic backgrounds for the pursue of all the different scenarios of application for the blockchain technology [4]. Despise the current success in digital currencies and financial assets, the potential application reach is still a work in progress [1]. Blockchain is the generic designation given to transaction persistence protocols, which are based on different algorithms and cryptographic principles that ensure the integrity and traceability of all transactions within the system, without the need to place trust in a central entity, thereby maintaining it, decentralised and distributed. The successor of the initial blockchain protocols (Blockchain 1.0), whose implementation is restricted to ensuring that a predefined set of validations were respected, is Blockchain 2.0. This new designation is associated with the new generation of blockchain protocol implementations designed since its inception to support the definition
Manufacturing Collaboration Process Using Blockchain for KR
379
of business rules and custom validations through Smart Contracts. As a direct response to the increasing demands from the industry, anxiously expecting a framework that allowed the full exploration of this technology for the most different ends. Smart contracts were introduced as a concept by Szabo in the 90s [29], whose definition was defined as a computerised transaction protocol that executes the terms of a contract [8]. This definition was based on the necessity to translate contractual clauses into code, and embedded into hardware or software that is capable of self-enforce them, resulting in a decrease for the need of a trusted intermediary between transacting parties. In Blockchain, smart contracts are self-enforcing scripts that represent a digital contract [18]. They work as a software protocol that performs an action when certain conditions are met, reducing the amount of human involvement required to create, execute, and enforce a contract. Since there is no necessity for the contract partners to fully trust each other, blockchain, as a distributed system, is suitable for this type of application by removing the intermediary and simplifying trustless protocols between multiple parties [32]. 2.2
Related Work
The combination of these technologies (namely, blockchain, MAS, and smart contracts) is not a novel concept. There are existing proposals based on the combination of some/all previous described technologies in the described domain (intelligent industrial environment), namely: – The work of Casado-Vara et al. [7] presents a model that uses a combination of blockchain, smart contracts, and a MAS to coordinate the tracking of food in the agriculture supply chain. The proposed model uses blockchain to store a record of all transactions, and this decision was justified by the authors due to security and decentralisation necessities. The coordination of all the members of the supply chain is performed using a MAS, where agents verify the fulfillment of smart contracts for each transaction between entities. – Abeyratne and Monfared [1] main objective was to define a blockchain based system to facilitate the vast amount of data that is required about the products and respective consumers in a manufacturing domain. Their approach is composed by a decentralised distributed system that uses blockchain to collect, store and manage the data related to the product life cycle, where the authors claim that this solution allows consumers to access information related to a specific product at any given time, resulting (theoretically) in better buying decisions. – Ghadimi et al. work [12] proposes a MAS approach as solution for the automation, and process facilitation, of sustainable supplier selection and order allocation, which results in a more cooperative partnership. Their proposed model is composed by two sub-models: a supplier evaluation model; and an order allocation model. The first sub-model uses three types of agents: a database agent, a supplier agent, and a decision maker agent. The second sub-model
380
R. Barbosa et al.
uses an order allocator agent, a database agent, and a supplier agent. According to the authors, their model can improve the order fulfillment rate, decrease demand uncertainty, and eventually can lead to improvements in the performance of a supply chain. – The work of Wang et al. [31] proposes the definition of a MAS to represent an Industrial Network where they define the following agents: Machine Agent (MA) which represents all the equipment that performs any production or test activity; Conveying Agent (CA) which represents all entities that move a product, like robots, conveyor belts, and others; and the Product Agent (PA) which represents the products that are or will be processed by MA, and transported by CA. In addition, they propose an intelligent negotiation mechanism for agents to cooperate with each other, as well as preventing deadlocks by improving their decision-making and coordination behaviour.
3
Proposed Solution
The current list challenges that industries face today, includes the necessity for the collaboration between different organisations and/or partners including suppliers, service providers, shipping providers, and even other competing organisations. While this collaboration might prove to be an improvement to the previous unidirectional communication channels that different organisations had, there is still a necessity to include and create communication channels to the consumers to answer the shift in the paradigm of production: from the mass production towards mass customisation. With reports [9] affirming that the frequency that consumers will ask for more complex or personalised products is increasing, only with collaboration organisations will be capable of answering such exclusive demands. Collaboration occurs when organisations work jointly on the development of products, where the distributed returns are sufficient for all the collaborating parties [23], witnessing a free flow of information between collaborating organisations, which in turn provides faster decision-making and can enhance the effectiveness of internal processes. With an increase of productivity efficiency under I4.0, manufacturing flexibility and the integration of different processes and activities are guaranteed, due to the intelligent manufacturing environment. The problem is how, besides handling manufacturing and processes flexibility, industries will be able to fulfil personalised demands by their consumers, and be capable to offer better response to the needs and preferences of them. I4.0 assumes its operations in a computerised and intelligent manufacturing environment, assuring flexibility and high production efficiency, which allows a faster communication between costumer and producer, with consumers being much more demanding and requesting more personalised products. As result, is even more important to include the consumer on the manufacturing process (social manufacturing). Therefore, this work proposes a model definition for an industrial collaboration network that includes the consumer in the manufacturing process. A visual representation of this model is illustrated in
Manufacturing Collaboration Process Using Blockchain for KR
381
Fig. 1, which is divided in three main components: (1) a collaborative network of entities based on a MAS; (2) a reasoning and interaction layer; (3) knowledge representation using blockchain. The proposal is based on the definition of an MAS in an industrial context for the representation of different entities that decentralise decision-making processes and aid the manufacturing process by using agents to represent entities included in an industrial environment, where there is a necessity for collaboration. Is designed to be capable of representing and supporting the complex structure of dependencies created between entities, improve decision-making processes, and to facilitate the relationships through collaboration. Organisations need to look at the individualisation of customer’s requirements, where the goal is to deliver various goods to fulfill small customer groups with specific needs while reducing production costs and focusing in customisation, flexibility, and responsiveness.
Fig. 1. Industrial collaboration model that includes the consumer. On the left: entities network from different organisations (A, B, C) based on a MAS. In this network the MA are represented by , CA represented by ♦, PA represented by , and the CsA are represented by ; On the right: reasoning and interaction layer and a consortium blockchain node representation, used for knowledge representation.
Despite their presence in the collaboration network, it does not mean that an organisation is associated or is part of other organisation. Instead, that organisation is allowed to use the resources that are available in the network. As result, organisations A, B, or C, might not have any form of business relationship between them, or even any past interaction/transaction. Despite the
382
R. Barbosa et al.
initial definition of this model being oriented to solve the increase in demand for customised products, the model can work as a collaborative network for every manufacturing process. 3.1
Collaborative Network of Entities
This initial part of the model is achieved through the creation of a network, as suggested by the work of Schuh et al. [25], where entities can collaborate towards a stronger cooperation and each can achieve its targets. We achieve this network of entities in a similar approach proposed by the model present on the work of Wang et al. [31], where they define a MAS that includes: – Machine Agent (MA): represents all the equipment that performs any production or test activity; – Conveying Agent (CA): which represents all entities that move a product, like robots, conveyor belts, and others; – Product Agent (PA): which represents the products that are or will be processed by MA, and transported by CA Our model proposal, represented in Fig. 1, includes the previous agents in its network of entities including a visual reference, namely: MA are represented by ; CA are represented by ♦; and PA are represented by . The objective of this network is not to create an idea that the entities belonging to the network appear and operate like a larger unique entity. Instead, the point of the network is to encapsulate the different entities and their relationships in the same environment to allow the other components of the proposed model to be applied in an organised setup. Each entity has knowledge of all other entities present and the network and knowledge regarding their function, and inputs, outputs, and credibility (to be discussed in Sect. 3.3). A new type of product is created through smart manufacturing: intelligent product. These products contain embedded sensors, identifiable components, and processors that carry information and knowledge to convey functional guidelines to the production system, including information about your production requirements and the equipment required for this. In this way, each PA knows, at any given moment, all the steps it has already taken, all the MA it has passed through, the remaining steps, and which MAs are needed for its completion. The main contribution of this proposal is the introduction of a new actor (the consumer), represented by an agent, who assumes criteria representing his preferences and needs. This Consumer Agent (CsA) is represented in the model by , and represents a consumer (our a group of consumers). In customisation scenarios, the needs and preferences represented by the CsA will have to be negotiated with the other agents. This cooperation is essential to understand the feasibility of the product taking into account the existing raw material and the current processes performed by the MA and other entities present in the network. The MAS systems already contemplate negotiation processes between agents, and the inclusion of the CsA, and its integration into the system, creates a need for redefinition/adjustment of existing negotiation processes.
Manufacturing Collaboration Process Using Blockchain for KR
383
The goal that each CsA intends to achieve, is directly correlated to the consumer or group of consumers that is representing, more specifically, to their needs and preferences. The capture of these criteria is not the main focus of this proposal, but in future works we will address the possibility of including externals sources that can help the identification of consumer needs. At this moment, we are going to assume that this needs and preferences are known and being correctly represented by the CsA. Additionally, the inclusion of the consumer is a direct response to the necessities for social manufacturing, and their inclusion on the entirety of the product life cycle. This model is initially focused on the inclusion of the consumer on the manufacturing process (design, manufacturing, disposal), but can be further extended to the other processes that can even include the decision-making process regarding materials and suppliers selection. 3.2
Reasoning and Interaction
The second part that composes this model is based on the MAS and is intended as a solution to handle the reasoning and the interactions between entities, to decide which are the best entities, in the network, to interact with in each situation. As a direct response to the diverse consumer demands (represented by the CsA), there is a necessity for each entity to connect and work effectively and efficiently with others, making the entity to entity relationship critical for the success of this model. The selection of the right entities for the manufacturing of a product (whose characteristics are represented by the PA as a result of a negotiation process with the CsA) is the main purpose of this layer. The MAS proposed in the reasoning and interaction layer is based on the methodology presented in Nikraz et al. [22] and the work of Ghadimi et al. [12]. These works are focused on the key issues of the analysis and designing of a MAS, with a special attention to the analysis and designing phases, which are based on the Foundation for Intelligent Physical Agents (FIPA) standards. To design the system, is performed an identification, categorisation, and refinement of agent types during the analysis phase. It starts by making an initial agent type identification based on two rules: (1) add one type of agent per user/device; (2) add one type of agent per resource; This step is followed by a responsibility’s identification, where is created an initial list for each agent main responsibilities. In this proposal are included the definition of the following agents: Blockchain Agent (BA); Entity Agent (EA); and the Decision Maker Agent (DMA). For each one of these agents were defined the following responsibilities: – Blockchain Agent (BA): 1. Receives the entity data from the EA; 2. Saves the data from the EA in a blockchain transaction; 3. Informs the EA that data was saved; 4. Receives a data request from the EA; 5. Returns data results to EA;
384
R. Barbosa et al.
6. Receives data requests from DMA; 7. Returns results to DMA; – Entity Agent (EA): 1. Requests data from the BA; 2. Send data to the BA to add to its public profile; 3. Send data to the BA to add to its private profile; 4. Receive data from the BA; 5. Request results from DMA; 6. Receive results from DMA; – Decision Making Agent (DMA): 1. Start decision-making process; 2. Request data from BA; 3. Receives data; 4. Evaluate entities involved; 5. Send data to the BA; 6. Inform all EA involved. The process is then focused on the acquaintance’s identification, where there is a necessity to identify all the possible interactions. The analysis ends with the agent refinement where a set of considerations is applied: – Support: what supporting information agents need to accomplish with their responsibilities, and how, when and where is this information generated/ stored; – Discovery: how agents linked by acquaintance relation discover each other; – Management and monitoring: is the system required to keep track of existing agents, or if there is a need to create or demand other agents. How each agent relates to another is defined in the form of communications and interactions, with messages being sent between sender and receiver. To perform a specification for the system interactions, Nikraz et al. [22] advise that an interaction table should be created, that considers each agent responsibilities, including: – – – – – –
A description of the interaction; The responsibility (identified by a corresponding number); An interaction protocol to implement the interaction; The role played by the agent (Initiator or a Responder); The agent name of the complementary role; A description of the trigger condition that initiates the interaction.
3.3
Knowledge Representation
The final part of this model is responsible for handling the knowledge representation that supports its entirety. The model uses a blockchain to store entity and transactions data, providing a shared, immutable, and transparent appendonly register of all actions that have happened to all the participants in the
Manufacturing Collaboration Process Using Blockchain for KR
385
network. This is achieved through the adoption of a consortium blockchain (a middle ground between the low trust provided by the public blockchain, and the ‘single entity that rules everything’ of the private blockchain) [19], since it provides many of the benefits found in a private blockchain (like efficiency, transactions, and data access privacy) without consolidation the power in one entity, and maintaining the decentralisation of the decision-making process. This unique strategy found in the consortium blockchain is highly beneficial for entities collaboration since it operates under a leadership of a group instead of a single entity. Transaction and general data on the blockchain are also controlled using permissions, managed by the network. These overall system rules are easier to manage and are capable of achieving better protection results against external disturbances (when compared to other solutions). Regarding the entity data, is created and registered for each entity that is inserted in the network, and is used for the identification of entities and the ease of the collaboration process. As result, each entity is represented by a public and a private profile. The public profile contains data that is accessed by the network participants, and is used to validate and evaluate which entities should be approached to collaborate in a specific manufacturing process. This profile aids the identification of an entity in a network and stores the following values: – Inputs: represent the needs of the entity, namely what it needs from the network to fulfil its processes. These inputs can be raw materials, maintenance needs, transportation services, among others. This value can be read by each participant of the network, but each entity can only update its input values. – Outputs: represent what an entity offers to the network. Each entity has a set of needs that wants to be fulfilled (inputs) and can have a set of outputs (what it can offer/produce) that can be used as inputs by other entity. Ultimately, an output of an entity might represent the input of other. – Credibility: is a value attached to the public profile of an entity and represents how each entity is perceived by the other entities in the network. Defined as the quality of being trusted and believed in, this variable holds a range of values (from zero to one, where zero is no trust and one is absolute trust) that, based on previous interactions, represents how the network trusts a specific entity. Despise its presence on the public profile, this value cannot be adulterated. In the specific case of the CsA, the inputs define the needs and preferences of the consumers or group of consumers that they are representing. Initially this needs can be related to a product they want to be manufactured, but in further expansions of this model can also be related to specific preferences like processes, suppliers, or even raw materials. The private profile stores data regarding the level of confidence that a single entity has in every other entities of the network. In our proposal, one entity can have a certain level of confidence in other entity, regarding what the level of confidence of the others entities is. This confidence is represented by a range of values (from zero to one, where zero is no trust and one is absolute trust)
386
R. Barbosa et al.
and is only accessible by its entity. The update of this value occurs each time a transaction is performed between entities. This combination of confidence and credibility values are critical for the success of this model. Credibility can be described in four axes: trustworthiness; expertise; reliability; and quality; where the first two axes can be related to the credibility of the entity itself, while the latter are related to the credibility of the transaction performed. In this model credibility is used to provide a mean for an entity to be individually classified by others, while the simpler and direct approach of confidence is used to provide an entity with a way of storing their evaluation for each entity, based on their previous interactions. For example: MA1 has a low level of credibility, but due to previous successful collaborations with a PA1 , it has a high value of confidence in MA1 which allows PA1 to trust in MA1 to establish more transactions. As for the blockchain that supports the knowledge representation layer for this model, it requires transparency and privacy features, and a necessity for a special infrastructure that can provide such characteristics. As result, this work relies on a Hyperledger Fabric (HF) for knowledge representation. Similar to other blockchain technologies, HF has a ledger, uses smart contracts, and is a system where participants manage their transactions. HF differs from other blockchains by not being an open system that allows unknown entities to participate in the network, instead, its members need special authorisation and validation to be part of the network [10]. Is an implementation of a distributed ledger platform for running smart contracts, leveraging familiar and proven technologies, with a modular architecture that allows pluggable implementations of various functions [5]. This peculiar blockchain architecture introduced by HF is called “executeorder-validate”, and a distributed application for Fabric consists of two parts: 1. Smart Contract (Chaincode): is the central part of a distributed application in Fabric, with special chaincodes existing to manage the blockchain system and maintaining parameters. Chaincode is invoked by an application external to the blockchain, when there is a need to interact with the ledger. 2. An endorsement policy that is evaluated in the validation phase. This policy acts as a static library for the validation of transactions, which can only be parameterised by the chaincode. A typical endorsement policy allows the chaincode to specify the endorsers for a transaction in the form of a set of peers. This set of peers are defined as the smallest set of entities required to endorse a transaction to be valid. To endorse, an entity endorsing peer needs to run the smart contract associated with the transaction and sign its outcome. In HF, a ledger consists of two distinct parts, a world state and a blockchain. The world state is a database that holds the current values for the ledger state, making it easy to access them, while the blockchain works as a transaction log that registers every change that lead to the current world state. The world state is implemented as a database, providing a rich set of operations for the efficient storage and retrieval of states. When a transaction that implies changes to the
Manufacturing Collaboration Process Using Blockchain for KR
387
world state is submitted, by invoking a smart contract, ends up being committed to the blockchain, where a notification about the validity of the transaction is later sent to its committer. In addition to represent and register every transaction performed in the network (and its participants), this knowledge representation layer is also capable of representing a product life-cycle by analysing each transaction performed by a PA.
4
Conclusion and Future Work
While the manufacturing processes are evolving under I4.0, by taking advantage of the amount of data produced and the digitalisation of manufacturing pipelines, organisations are still facing a variety of challenges. On of those challenges is the increasing demand of customised products by their consumers, that are shifting the manufacturing paradigm towards mass customisation. This specific challenge requires organisations to adapt their manufacturing process, to produce multiple products (or the same product but with different variations) without having to make significant changes to their production line while minimising their downtime. Besides the necessity for the manufacturing of customised products, organisations will need to gather the necessary conditions to ensure their quick adaptation to a changing environment (motivated by trends and social influence), and assuring that they have the required materials and services to answer the manufacturing needs. One solution to this problem can be found in collaboration, that besides providing a solution to the increase in demand for customised products, can also act as a solution for many other challenges in I4.0. Collaboration is an open and transparent environment where information is shared, and each actor can work together to solve a common problem. The proposed model present in this work is our solution to the necessity for collaboration between organisations, and the satisfaction of customised demands by the consumers. We proposed a model definition for an industrial collaboration network composed by a network of entities, reasoning and interaction layer, and knowledge representation using blockchain. Despite the combination of MAS and blockchain not being a novel process, and existing works that proposed a similar base infrastructure, the novelty of this proposal is found in the inclusion of the consumer. The initial portion of our model is found in a network of entities, based on a MAS, where each agent represents an entity that is directly related to the manufacturing process of a product, namely: Machine Agent (MA); Conveying Agent (CA); Product Agent (PA); and the Consumer Agent (CsA). This network of entities is composed by different types of agents, belonging to different organisations, that are a common objective: collaborate to solve an existing problem, which in this scenario is the manufacturing of a product. The knowledge representation uses Hyperledger Fabric and is the entry point for all the information in the network. By creating a solid way of structuring and saving the data, creates the possibility that for each entity and its interactions,
388
R. Barbosa et al.
the data is stored and shared with all the entities, while keeping the information secure and making sure that stored information cannot be tampered with. Entities information contains data that helps create each organisation’s profile and helps in the decision-making process, creating a way for network participants to evaluate and classify each other’s performance when collaborating. The decision-making portion, relies on a multi-agent system that interacts with the Hyperledger Fabric blockchain in order to gather the necessary data to handle decision-making processes regarding choosing the right entity to collaborate. This is crucial, to help stakeholders and decision makers streamline their decision-making process, that can be the difference between acting in a useful time and solving a problem or failing. As for the limitations of this work, the first that should be addressed is the usage of blockchain. Is the right solution for this model? Despite the current success with cryptocurrencies, and the combination of MAS and blockchain being well documented in literature, this application of this technology is still limited in real world, often associated with a certain level of distrust. However, blockchain aligns with our proposal, and the consortium blockchain provides a way to create interactions among a group of entities that exchange funds, goods, or information, while none are willing to agree on a trusted third party. Also, the usage of smart contracts can simplify trust-less protocols between multiple parties, while the details of the contract remain hidden to other network entities, and providing the decentralisation of the decision-making process. However, there are some limitations. The MAS developed still lacks maturity in some areas, namely when it comes to the actions of the agents. An entity that can potential affect the operation of the model is the Decision Making Agent (DMA) behaviour and actions, where is important to consider what decisionmaking model framework/algorithm, such has the Markov decision process and a fuzzy inference system, should be used and how it could affect the model. This would enable the developing of the model even further. It is also noted that since different organisations will be sharing their resources, where sensitive data can be available, there is a concern for security and privacy. At the moment, this work relies on the underlining concepts of privacy that come attached to the blockchain technology.
References 1. Abeyratne, S.A., Monfared, R.P.: Blockchain ready manufacturing supply chain using distributed ledger. Int. J. Res. Eng. Technol. 5(9), 1–10 (2016) 2. Adeyeri, M.K., Mpofu, K., Adenuga, O.T.: Integration of agent technology into manufacturing enterprise: a review and platform for industry 4.0. In: 2015 International Conference on Industrial Engineering and Operations Management (IEOM), pp. 1–10. IEEE (2015) 3. Agostini, L., Filippini, R.: Organizational and managerial challenges in the path toward industry 4.0. Eur. J. Innov. Manag. 22, 406–421 (2019) 4. Aste, T., Tasca, P., Di Matteo, T.: Blockchain technologies: the foreseeable impact on society and industry. Computer 50(9), 18–28 (2017)
Manufacturing Collaboration Process Using Blockchain for KR
389
5. Cachin, C., et al.: Architecture of the hyperledger blockchain fabric. In: Workshop on Distributed Cryptocurrencies and Consensus Ledgers, vol. 310 (2016) 6. Carneiro, D., Novais, P.: New applications of ambient intelligence. In: Ramos, C., Novais, P., Nihan, C.E., Corchado Rodr´ıguez, J.M. (eds.) Ambient Intelligence Software and Applications. AISC, vol. 291, pp. 225–232. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-07596-9 25 7. Casado-Vara, R., Prieto, J., De la Prieta, F., Corchado, J.M.: How blockchain improves the supply chain: case study alimentary supply chain. Procedia Comput. Sci. 134, 393–398 (2018). The 15th International Conference on Mobile Systems and Pervasive Computing (MobiSPC 2018)/The 13th International Conference on Future Networks and Communications (FNC-2018)/Affiliated Workshops 8. Christidis, K., Devetsikiotis, M.: Blockchains and smart contracts for the Internet of Things. IEEE Access 4, 2292–2303 (2016) 9. Deloitte: Industry 4.0. Challenges and solutions for the digital transformation and use of exponential technologies, pp. 1–30. Deloitte (2015) 10. Dib, O., Brousmiche, K.-L., Durand, A., Thea, E., Hamida, E.B.: Consortium blockchains: overview, applications and challenges. Int. J. Adv. Telecommun. 11(1 & 2), 51–64 (2018) 11. Digital Transformation of Industrial Ecosystems (Unit A.4): Smart ManufacturingShaping Europe’s digital future, October 2020 12. Ghadimi, P., Toosi, F.G., Heavey, C.: A multi-agent systems approach for sustainable supplier selection and order allocation in a partnership supply chain. Eur. J. Oper. Res. 269(1), 286–301 (2018) 13. Ghadimi, P., Wang, C., Lim, M., Heavey, C.: Intelligent sustainable supplier selection using multi-agent technology: theory and application for industry 4.0 supply chains. Comput. Ind. Eng. 127, 588–600 (2019) 14. Kang, H.S., et al.: Smart manufacturing: past research, present findings, and future directions. Int. J. Precis. Eng. Manuf.-Green Technol. 3(1), 111–128 (2016). https://doi.org/10.1007/s40684-016-0015-5 15. Lee, J.: Industry 4.0 in big data environment. Ger. Harting Mag. 1(1), 8–10 (2013) 16. Lee, J., Bagheri, B., Kao, H.-A.: A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manuf. Lett. 3, 18–23 (2015) 17. Lee, J., Kao, H.-A., Yang, S.: Service innovation and smart analytics for industry 4.0 and big data environment. Procedia CIRP 16, 3–8 (2014) 18. Lin, I.-C., Liao, T.-C.: A survey of blockchain security issues and challenges. IJ Netw. Secur. 19(5), 653–659 (2017) 19. Du, M., Ma, X., Zhe, Z., Wang, X., Chen, Q.: A review on consensus algorithm of blockchain. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 2567–2572 (2017) 20. Monostori, L., et al.: Cyber-physical systems in manufacturing. CIRP Ann. 65(2), 621–641 (2016) 21. Nakamoto, S.: Bitcoin: a peer-to-peer electronic cash system. Bitcoin, April 2008 22. Nikraz, M., Caire, G., Bahri, P.A.: A methodology for the analysis and design of multi-agent systems using JADE. Technical report (2006) 23. Oliver, A.L.: On the duality of competition and collaboration: network-based knowledge relations in the biotechnology industry. Scand. J. Manag. 20(1–2), 151– 171 (2004) 24. Olsen, T., Tomlin, B.: Industry 4.0: Opportunities and challenges for operations management. Manuf. Serv. Oper. Manag. 22, 113–122 (2020)
390
R. Barbosa et al.
25. Schuh, G., Potente, T., Wesch-Potente, C., Weber, A.R., Prote, J.-P.: Collaboration mechanisms to increase productivity in the context of industrie 4.0. Procedia CIRP 19, 51–56 (2014) 26. Shafiq, S.I., Sanin, C., Szczerbicki, E., Toro, C.: Virtual engineering object/virtual engineering process: a specialized form of cyber physical system for industrie 4.0. Procedia Comput. Sci. 60, 1146–1155 (2015). Proceedings of the 19th Annual Conference on Knowledge-Based and Intelligent Information & Engineering Systems, KES-2015, Singapore, September 2015 27. Stevens, G.C.: Integrating the supply chain. Int. J. Phys. Distrib. Mater. Manag. 19(8), 3–8 (1989) 28. Stone, P., Veloso, M.: Multiagent systems: a survey from a machine learning perspective. Auton. Robots 8, 345–383 (2000). https://doi.org/10.1023/A: 1008942012299 29. Szabo, N.: Smart contracts: building blocks for digital markets. EXTROPY: J. Transhumanist Thought (16), 18(2) (1996) 30. Tschannen-Moran, M.: Collaboration and the need for trust. J. Educ. Adm. 39(4), 308–331 (2001) 31. Wang, S., Wan, J., Zhang, D., Li, D., Zhang, C.: Towards smart factory for industry 4.0: a self-organized multi-agent system with big data based feedback and coordination. Comput. Netw. 101, 158–168 (2016) 32. W¨ ust, K., Gervais, A.: Do you need a blockchain? In: 2018 Crypto Valley Conference on Blockchain Technology (CVCBT), pp. 45–54 (2018) 33. Zhang, F., Liu, M., Shen, W.: Operation modes of smart factory for high-end equipment manufacturing in the Internet and Big Data era. In: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Banff, AB, pp. 152–157. IEEE, October 2017
Cellular Formation Maintenance and Collision Avoidance Using Centroid-Based Point Set Registration in a Swarm of Drones Jawad N. Yasin1(B) , Huma Mahboob2 , Mohammad-Hashem Haghbayan1 , Muhammad Mehboob Yasin3 , and Juha Plosila1 1
3
Autonomous Systems Laboratory, Department of Future Technologies, University of Turku, Vesilinnantie 5, 20500 Turku, Finland {janaya,mohhag,juplos}@utu.fi 2 Connected Shopping Ltd., Thetford, UK [email protected] Department of Computer Networks, College of Computer Sciences and Information Technology, King Faisal University, Hofuf, Saudi Arabia [email protected] Abstract. This work focuses on low-energy collision avoidance and formation maintenance in autonomous swarms of drones. Here, the two main problems are: 1) how to avoid collisions by temporarily breaking the formation, i.e., collision avoidance reformation, and 2) how do such reformation while minimizing the deviation resulting in minimization of the overall time and energy consumption of the drones. To address the first question, we use cellular automata based technique to find an efficient formation that avoids the obstacle while minimizing the time and energy. Concerning the second question, a near-optimal reformation of the swarm after successful collision avoidance is achieved by applying a temperature function reduction technique, originally used in the point set registration process. The goal of the reformation process is to remove the disturbance while minimizing the overall time it takes for the swarm to reach the destination and consequently reducing the energy consumption required by this operation. To measure the degree of formation disturbance due to collision avoidance, deviation of the centroid of the swarm formation is used, inspired by the concept of the center of mass in classical mechanics. Experimental results show the efficiency of the proposed technique, in terms of performance and energy. Keywords: Multi-agent system intelligence · Collision avoidance
1
· Formation maintenance · Swarm · Point set registration
Introduction
A swarm is a concept that seems to have no precise definition in literature as such; instead, we find a lot of definitions and discussion addressing swarming i.e. swarm behaviour [11]. Swarm robotics can be classified as the study c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 391–408, 2022. https://doi.org/10.1007/978-3-030-82193-7_26
392
J. N. Yasin et al.
of how a system, consisting of large number of a relatively simple agents, can be designed to attain a desired cumulative behaviour based on the interactions between the agents themselves and between the agents and the environment [7,23]. Due to their ability to work in a collaborative style, swarms of autonomous agents add significant advantages over the use of single agent, and therefore they have high demand in diverse fields such as search and rescue, surveying and mapping, inspection, and delivery, in both military and civilian/commercial contexts [22]. Consequently, the interest of the research community is increasing towards optimization of various autonomy characteristics of drone swarms, for instance collision avoidance, formation maintenance, resource allocation, and navigation [5,13]. In swarm navigation and formation flight, collision avoidance and maintenance of the formation are the most important problems [20,25]. Formation control methodologies can be categorized into three main approaches: 1) the leaderfollower based approach, where every node/drone works autonomously and individually by maintaining a given formation as perfectly as possible by adjusting its position with respect to its neighbours and the leader drone [17,22,24]; 2) the behaviour based approach, where, based on a pre-determined strategy, one behaviour is chosen out of the available ones [3,14]; and 3) the virtual structure based approach, where the swarm is considered a single entity, i.e. a single large drone effectively, and navigated through a trajectory accordingly [6,15]. Cellular automata based modelling provides the environment that each cell can decide its movement by only looking at its neighbours and the environment and based on its rules that are defined for each individual cell dynamically and run-time [1]. The modelling of cellular automata in our collision avoidance algorithm provide us the opportunity to reform the system to pass the obstacle only by defining some distributed rules for each individual drone. In other words to pass the collision we do not need to have a central processing element that defines the path of each drone. In return each node, individually, in a dynamic and flexible movement can pass the obstacle so that the overall time and energy consumption of the swarm is optimized. To define these rules for individual drones, we make use of genetic algorithm techniques that is highly compatible with cellular automata model. A genetic algorithm (GA) is one of the simplest random-based classical evolutionary computing methods, where random changes are applied to the current solutions to generate new solutions for finding an optimal or near-optimal solution [8]. GAs work by utilizing the basic principles of generation of potential random solutions, selection of the best solution by calculating the distance of each solution to the destination, generation of new solutions based on the generated good solutions, and repeating these steps in order to reach the desired result [2,18]. The ability of GAs to converge close to the global optimum and their relatively simple implementation make them quite popular among available optimization heuristics [4]. Point set registration is a commonly used method, playing an important role in various applications such as image retrieval, 3D reconstruction, shape and object recognition, and SLAM [9]. In point set registration, the correlation
Cellular Formation Maintenance and Collision Avoidance Using CPSR
393
between two point sets is determined in order to retrieve the required transformation that maps one point set to the other [16]. Our algorithm has two phases: – The first phase is a cellular automata based collision avoidance scheme that disturbs the original formation to pass the obstacle. – The second phase is a re-formation scheme, inspired by point set registration, that will resume from the highest disturbed formation to the original formation. In this paper, the leader-follower based approach is utilized for drone swarm control due to its reliability, ease of implementation and analysis, and scalability [12,19]. However, in our proposed solution, there is no unique global leader, as the leader gets changed dynamically based on certain constraints. The cellular formation and collision avoidance algorithms are integrated with a simple GAinspired approach and a point set registration method in order to optimize the collision avoidance and re-formation phases. The goal is to calculate the escape routes and select a near-optimal path upon detection of an obstacle, having minimal deviation from the original route. Once a defined danger zone has been passed, for reconstruction of the formation, centroid-based point set registration (CPSR) is used in the formation maintenance algorithm to optimally bring back each drone that lost its position in the formation when avoiding a collision with the detected moving obstacle. Using a GA-inspired approach for collision avoidance in a swarm of robots/UAVs is beneficial as it allows the algorithm to check for all possible maneuvers and select the best solution depending on the pre-defined constraints such as the minimal movement requirement and power consumption limitations. Furthermore, with the help of CPSR, once a collision has been avoided, the formation can be obtained again swiftly and optimally by bringing the UAVs back to the desired formation shape. CPSR facilitates this dynamic recovery process and autonomous switching of the swarm leader according to the requirements posed by the scenario at hand. This can be very helpful especially in cases where the time to complete a mission is critical. The rest of the paper is structured as follows. In Sect. 2, the proposed algorithm is described. Section 3 provides the simulation results. Finally, Sect. 4 concludes the paper with some discussion and comments on future work.
2
Proposed Approach
The general pseudo code of the proposed approach is given in Algorithm 1. We start with the assumption that the leader-follower connection has been established and that the formation is already maintained before a mission starts. By utilizing the on-board processing units, this top-level algorithm is executed by every individual node locally. Algorithm 1 starts by initializing the Boolean variable/flag F LAGobs (Line 2) whose role is to indicate absence (F alse) or presence (T rue) of an obstacle. Then the target shape (TShape) of the swarm
394
J. N. Yasin et al.
is initialized based on the current state or position of each node with respect to the others (Line 3). TShape is the next targeted formation shape calculated at every interval for the next time interval, determining the next target position for each node to propagate to. After the above initial steps, the main loop (Lines 4–13) is entered. First, the procedure ObstacleDetection is called, and based on the values of the variables/signals calculated by this algorithm, information on the presence and characteristics of a potential obstacle is available (Line 5). In case an obstacle is detected, i.e. F LAGobs == T rue, the procedure CollisionAvoidance is called, and TShape is set up according to the feedback received (Lines 6–7). After this, once a collision has been successfully avoided, F LAGobs is reset to F alse (Line 8). On the other hand, if no obstacle is detected, i.e. if F LAGobs == F alse holds after the execution of ObstacleDetection, then TShape is updated without an involvement of the collision avoidance procedure (Line 10). Finally, the point set registration based re-formation algorithm PS-ReFormation is called to re-establish the desired formation (Line 12). This has an effect only if the formation is distorted due to a collision avoidance event. It is important to note here that in the point set registration of the reformation process, it is crucial to optimally and rapidly calculate the mapping between the current and desired shapes of the swarm. This is the case especially when complicated movements are involved which drastically change TShape of the swarm, for instance, when the angle of the leader’s movement strongly changes due to the presence of an obstacle on the path.
Algorithm 1. Global Routine 2: 4:
6: 8: 10: 12: 14:
procedure Obstacle Detection & Navigation F LAGobs ← F alse; TShape ← Initialization based on current state; while True do F LAGobs , Dobs , Aobs , Vobs , Dimpact ← Obstacle Detection(); if F LAGobs then TShape ← CollisionAvoidance(Dobs , Aobs , Vobs , Dimpact ); F LAGobs ← F alse; else Update TShape; end if PS-Reformation(TShape); end while end procedure
Cellular Formation Maintenance and Collision Avoidance Using CPSR
395
Algorithm 2. Obstacle Detection procedure Obstacle Detection() if obstacle in Detection Range then F LAGobs ← T rue; 4: Dobs , Aobs ← Calculate obstacle distance and angles at which the edges lie; Vobs ← Calculate obstacle Velocity; 6: Dimpact ← Calculate distance to impact; return(F LAGobs , Dobs , Aobs , Vobs , Dimpact ) 8: end if end procedure 2:
2.1
Obstacle Detection
In this procedure (specified in Algorithm 2), the node continuously scans for the obstacles, and soon as there is an obstacle in the detection range of the onboard sensor system, the signal flag F LAGobs is set to T rue (Lines 2–3). After this, the calculation of the parameters of the obstacle is done, i.e. the distance to the obstacle (Dobs ) and the angle at which the detected obstacle lies Aobs (Line 4), as shown in Fig. 1 (variables explanation given in Table 1). Then it is determined if the obstacle is moving or stationary (Line 5), using Eqs. (1)–(3) and illustrated in Fig. 2. The velocity of the obstacle (Vobs ) is computed, and based on the value of Vobs , we have three possible case scenarios: (1) if Vobs == 0 then the environment is static or the obstacle under observation is stationary; (2) if Vobs is negative, then the obstacle is coming towards the UAV; or (3) if Vobs is positive, then the obstacle is going away from the UAV. Based on the computed velocity of the obstacle, distance to the potential impact (Dimpact ) is calculated (Lines 5–6). These calculations are elaborated in the equations below.
Fig. 1. Distance and direction calculation
396
J. N. Yasin et al. Table 1. Description of variables from Fig. 1 Variables
Description
DRi DLi
Distance of right and left edges of the obstacle from leader, follower 1 and follower 2 as shown in Figure
dF 1L dF 2L
Distance of leader from follower 1 and follower 2 respectively
θLOR θLOL Angle at which right and left edges are detected from leader respectively θF 1L θF 2L
Angle of leader from follower 1 and follower 2, respectively
We know that after t1 seconds, the distance travelled by the UAV can be calculated by: (1) dU AV = v ∗ (t1 − t0 )
Fig. 2. Moving obstacle calculation
Then the distance travelled in the meantime by the obstacle and the velocity of the obstacle are calculated by (2) and (3), respectively: dobs = do − dU AV − d1
(2)
where do , d1 are the distances between the UAV and the obstacle at times to and t1 , respectively. (3) vobs = dobs /Δt In Eq. (2), if dobs == 0, it means the obstacle is stationary. Otherwise the obstacle is moving and in that case if the distance between the obstacle and the UAV after t1 , i.e., d1 is less than the distance detected at time to , i.e., do , then the obstacle is moving towards the UAV (Fig. 2). If d1 is greater than do , it means the obstacle moving away from the UAV. Figure 3(a) shows the point when the obstacle has entered the detection range of the UAV. The obstacle’s
Cellular Formation Maintenance and Collision Avoidance Using CPSR
397
trace of movement is shown as a red dotted line; similarly, the UAVs’ traces of movement are shown as correspondingly coloured lines behind the smaller coloured circles representing three UAVs. Based on the movement of the obstacle, the computational point of impact and the dimensions of the obstacle are calculated, and based on these the Danger Zone (the red circle, the point of impact is the red dot inside it) is defined as shown in Fig. 3(a).
Fig. 3. (a) Point of impact and danger zone as they appear dynamically. (b) Highest level of disturbance illustration
2.2
Collision Avoidance
Collision avoidance in our proposed algorithm is simply defining continuously the next step formation for the swarm in a way that this sequence of formations can pass the obstacle. This re-forming of the swarm continues until the swarm reaches the highest formation disturbance in which it is guaranteed the swarm can pass the obstacle. The highest formation disturbance is defined as the state when all the drones have passed the line perpendicular to the velocity vector of the swarm, i.e. swarm movement in Fig. 3(b), and passing through the mass point of the obstacle, see Fig. 3(b). After that the swarm resumes back to its original formation via TPS-based algorithm [21] that will be discussed in the reformation section1 . In this section we only cover the collision avoidance algorithm until reaching the highest level of disturbance. The goal here is to determine rules for each drone by which the drones pass the obstacle in a way that the time and energy minimizes. To do this we made use of applying GA in a cellular automata (CA) model of the swarm. This model is based on separating the space, 2D or 3D, into identical grid zones where the size of the grid is determined to encompass one drone in its safe distance from the borders of the grid. Using this modeling
1
Even though the highest formation disturbance state might not totally guarantee not to have collision in TPS-based reformation phase, we take this assumption to simplify finding the moment of switching from GA-based collision avoidance phase to TPS-based resuming the original formation.
398
J. N. Yasin et al.
method, 2D/3D environment can be divided to identical grid zones, cells, where existences and non-existence of a drone in each grid can change the state of the cell to one and zero respectively. Figure 4(a) shows such a model for 2 drones and one obstacle in 2D environment. Each cell in this model only can see its neighbors, like standard CA [1], and based on the neighbors it decides its next state. If the cell is occupied by a drone, the next state of the cell is determined via the movement of the drone that is limited towards cardinal and inter-cardinal directions (Fig. 4(b)). Our GA-based algorithm tries to find the best possible rule for each cell occupied by the drone i.e. black, that minimizes the time. The time in this model is the number of steps for the swarm to reach the highest level of disturbance. The energy in each step, i.e. for the CA rule, can only have three values, 1) ‘0’, when the drone stays in its position, 2) ‘1’, when there is a movement toward one cardinal direction, and 3) ‘2’, when there is a movement toward one intercardinal direction. The overall energy for a drone is the summation of energy consumption in each iteration from the start of reformation to highest formation disturbance state.
Fig. 4. (a) 2D model of 2 drones and one obstacle. (b) Example of movement according to a rule.
Algorithm 3. CollisionAvoidance procedure CollisionAvoidance(Dobs , Aobs , Vobs , Dimpact ) DangerZone ← Calculate based on Dimpact and obstacle dimensions; while DangerZone do 4: Calculate Escape routes; Update the TShape using GA; 6: end while return(TShape) 8: end procedure 2:
Cellular Formation Maintenance and Collision Avoidance Using CPSR
399
The population of the potential solutions was done by applying the following principles: – generation of potential random solutions by defining different rules for each drone – calculate the time and energy of each solution when reaching the highest disturbance formation is targeted, i.e., when all the drones pass the obstacle, – regenerate new solutions by mutation of good obtained rules for each drone These routines are integrated with the collision avoidance developed and presented in [24], in such a way that the translated TShape destination of each node of the swarm is checked at each time interval. If there is an obstacle in the path of a node, the TShape destination may not be the same as the original destination. Therefore, the TShape destination is calculated by using a fixed grid around the danger zone in order to restrict the GA from populating infinite exhaustive solutions. Afterwards at each iteration, the way to reach the TShape destination is optimized by point set registration. 2.3
Re-formation
Observing and avoiding an obstacle by the swarm, in most of the cases changes/disturbs the formation of the swarm until reaching the highest disturbance formation where after the swarm must be restored to the initial formation state. This process raises a formation construction problem that is widely covered in the literature. However, in our case, the re-formation algorithm, or in other words the disturbance rejection of a swarm, must be compatible with our obstacle detection and collision avoidance algorithm whose main target is to reduce the overall settling time and energy of the system, i.e., bringing the disturbed centroid back to its intended state in the TShape. It is worth mentioning that in the process of resuming the formation it is not necessarily needed to keep the initial neighbouring state among the drones since in the formation all the drones are considered to be an identical node. Furthermore, since there is no dedicated leader and leader is only selected according to the situation at hand, therefore the dynamicity of re-formation process gets smoother with no pauses or unnecessary waiting times for nodes. For example, in the original state if drone 1 has two neighbors drone 2 and 3, after the reconstruction and resuming the formation this might not necessarily happen. Or as shown in next section (Fig. 6), the leader before the swarm gets disturbed and after the reformation process is not the same, as the leader was re-elected dynamically. In the process of resuming back from the disturbed state of the swarm formation, referred to as the scene in this section, to the initial formation state, i.e., the TShape model, two main questions are: 1) what is the optimal alignment or mapping of identical nodes in the disturbed formation of the swarm, i.e. the scene, and in the initial formation, i.e., TShape model; and 2) what is the optimal trajectory of each node in the scene to be mapped into the corresponding node in the TShape model? For the first problem, we adopt a well-know idea in point set
400
J. N. Yasin et al.
registration [9,16] that is based on the thin-plate spline (TPS) technique that is used in data interpolation and smoothing [10]. After determining the mapping strategy, for the second problem, we use the shortest path scheme when applying the proposed collision avoidance approach. In the following, we first explain the concept of thin-plate splines, and after this we propose an algorithm based on that. A piece-wise function defined by polynomials is known as a spline. Complex and complicated shapes are approximated with ease via curve fitting using splines due to their non-complicated construction [10]. For simplicity, we discuss the algorithm for 2-dimensional formulation and presume to have two sets of correlating data sets or points X i.e. xi , i = 1, 2, . . . , n and V i.e. vi , i = 1, 2, . . . , n. Where xi and vi are the coordinate representation of the locations of a point, xi = (1, xi x, xi y) and vi = (1, vi x, vi y), in the scene and model respectively. Considering the shape of the disturbed function, finding a mapping function f (vi ) that fits between correlating point sets X and V can be obtained by minimizing the following: ET P S (f ) = +λ
n
||xi − f (vi )||2
i=1
∂2f ∂2f 2 ∂2f ) + ( 2 )]dxdy [( 2 )2 + 2( ∂x ∂x∂y ∂y
(4)
where ET P S is the energy function that is considered as the measurement for the amount of formation disturbance. The integral part of the equation represents how the corresponding point sets are mapped to the correlating point set by keeping the intended formation under consideration. Also, the factor λ provides the scaling. If the intention is to map one point set over the other without considering the shape of the disturbed swarm, λ should be set to zero and the closest points are mapped accordingly without keeping the shape under consideration. In this situation, the disturbance, i.e., ET P S , is simply as follows: ET P S (f ) =
n
||xi − f (vi )||2
(5)
i=1
Minimization of such a temperature function determines the mapping process from the disturbed swarm in the highest disturbance formation, i.e. the scene, to the original shape after the obstacle. Via this mapping the new leader also will be determined. After calculating the mapping function, each drone from the scene follows the shortest path to reach its hypothetical location in the model. Since the model in Eq. 4 is hypothetical and uncertain events might affect the process of reformation, a relative run-time measurement is needed to continuously assure that the swarm is reaching its formation. This metric is the hypothetical center of the swarm that can be calculated from an instantaneous location of the drones in the swarm. The continuous error in the formation that should be dynamically observed and reduced is calculated by summation of the deviation of the distance of each drone from the center w.r.t. the golden formation model, that is shown
Cellular Formation Maintenance and Collision Avoidance Using CPSR
401
by drms . As an example, Fig. 5 shows the centroid point for three drones in the golden formation model, i.e. the left side, and the disturbed model, i.e., the right side. Based on this, the deviation of the distance of each drone from the center w.r.t. the golden formation model is as follows: Δd1 = dc − d1 Δd2 = dc − d2 Δd3 = dc − d3
(6)
and drms =
Δd21 + Δd22 + Δd23
(7)
Fig. 5. Centroid of the swarm
The measurement of drms , is a figure of merit. From this equation it is determined how much the current formation has been distorted from its original/predefined formation. So, minimizing the drms to zero will bring back the formation optimally, i.e., drms −→ 0. The reformation process is done by first calculating the centroid of the swarm as shown in the Fig. 5. These values are then fed to the point set registration, in order to calculate the optimal solution for bringing the UAVs back to the desired formation, and in the meantime, bringing the centroid as quickly as possible to the final destination, which can be seen in the next section, i.e., Simulation and Results. During the reformation the leader of the swarm changes dynamically, and the previous leader (UAV1/Blue UAV) goes to the position of the current leader (UAV2/Green UAV). The reason for that is, while UAV1 is deviating from its current trajectory in order to avoid colliding with the obstacle, it slows down. In the meantime UAV2, which continues in its path, becomes a more likely candidate for going to the position of UAV1 rather than slowing down for it. Therefore, UAV2 moves to the location of UAV1, and simultaneously UAV1 moves to the previous location of UAV2, as soon as UAV1 has successfully avoided the collision.
402
3
J. N. Yasin et al.
Simulation and Results
Fig. 6. Different time intervals from spawning to when the obstacle comes in detection range. (a) When the UAVs are spawned and obstacle is moving towards the swarm. (b) Obstacle is in detection range of the UAV, and point of impact is calculated and shown, UAV1 is deviating from its original path in order to avoid the danger zone. (c) Bypassing the obstacle. (d) bypassing the obstacle 2. (e) Leader changed while bypassing the obstacle.
The initial conditions/assumptions for our work are defined as follows: 1. there is no explicit unique leader; the leader for the swarm changes dynamically according to the situation, i.e., the leadership is a temporary role. 2. UAVs accelerate or decelerate as needed.
Cellular Formation Maintenance and Collision Avoidance Using CPSR
403
Fig. 7. Simulation snapshots at equal time intervals, i.e., 0%, 25%, 50%, 75%, 100% (a) When the UAVs are spawned and obstacle is moving towards the swarm. (b) Obstacle is in detection range of the UAV, and point of impact is calculated and shown, UAV1 is deviating from its original path in order to avoid the danger zone. (c) Bypassing the obstacle. (d) Notice the change of leader while bypassing the obstacle. (e) Leader changed while bypassing the obstacle.
3. UAVs obtain their own position vectors using the on-board localization techniques. 4. communication channel ideal, i.e., lossless. 5. an obstacle can be stationary or moving towards the swarm or away from the swarm with unknown velocity. 6. for visualization purposes and to avoid the overlapping, the detection range circle of only the leader is shown. The UAVs are spawned at near the defined V-shaped formation (Fig. 6). The current leader UAV1 (blue), starts moving towards the destination and the other UAVs start moving towards their positions in the formation. An obstacle is also moving towards the swarm, but at this instant it is outside the detection range of the on-board sensor system of the UAVs, as shown in Fig. 6(a). In Fig. 6(b), the obstacle is already in the detection range and the Point of Impact and the Danger Zone has been computed, as explained in Algorithm 2.
404
J. N. Yasin et al.
Fig. 8. Swarm movement from start to destination. Navigational traces using: (a) Proposed approach (b) Dedicated leader
Figure 6(c), shows the trend of escape route chosen by the collision avoidance module by deviating UAV1 to its right and slowing down the velocity of UAV3 to allow UAV1 stay on chosen route, as explained in Algorithm 3. Figure 6(d) and (e) shows reformation process using CPSR, as shown that since UAV1 had to slow down and deviate to avoid the Danger Zone, UAV2 in the meantime is dynamically declared as the leader, as it continued its trajectory on the same path with same velocity, gets ahead of the rest instead of waiting for UAV1 to get back into its formation position. This is done to make sure time of arrival of centroid to the destination is minimized. Similarly, the optimal reformation would require UAV1 to go to UAV2’s place and UAV3 would just speed up to catch with its position in the formation, as explained in Section II-C. Figure 8(a), shows the movement of the swarm from starting point till it reaches the destination. In comparison, the behaviour of the swarm if point set registration is used with explicit unique leader and without balancing the centroid of the swarm is shown in Fig. 8(b). The graph (Fig. 9(a)) shows the overall trend of the distances maintained by the UAVs throughout the course, where D31 is the distance between UAV1 and UAV3, D21 is the distance between UAV1 and UAV2, and D32 is the distance between UAV2 and UAV3. The obstacle gets detected at t = 30s, and the collision avoidance is enforced which distorts the formation (Fig. 6). In order to test the scalability of the proposed algorithm, the number of nodes in the swarm was increased to make a two layered V-shaped formation, as shown in Fig. 7. The little overshoots in the movements of the drones in the figure can be reduced by integrating a more stable speed controller into the algorithm. Figure 9(b), shows the change in the temperature of the system, i.e., the swarm, from the start until the destination is reached. At t = 30s, the disturbance/change in the temperature of the system shows the obstacle
Cellular Formation Maintenance and Collision Avoidance Using CPSR
405
detection. The other significant disturbance at t = 75s is due to the leader change and from there on the swarm gradually reshapes itself into the target shape defined by TShape.
Fig. 9. (a) Distance maintained by drones with each other. (b) Change in temperature of the system as a whole
Figure 10 shows the time taken for the swarm to reach its destination in four different scenarios: “No obstacle”, i.e., if it is not disturbed at all and there are no deviations from the path, then t0 = 166s; “Unique”, i.e., if there is an obstacle in the path and the swarm has a unique dedicated leader, then t1 = 213s; and finally two cases experimenting our proposed approach, i.e., in “3-CPSR” (a swarm with 3 nodes like in the previous cases) and “8-CPSR” (with 8 nodes), the time to finish was t2 = 181s and t3 = 198s, respectively. This shows that we can considerably reduce the time by dynamically re-forming the swarm and changing the leader at run-time whenever the situation requires. The reason for this is that in the CPSR approach the swarm does not stop at any moment but keeps on progressing towards the destination, with each UAV deviating to avoid a collision when needed and accelerating afterwards to reach its position, defined by TShape, to maintain the formation. On the other hand, in the fixed leader case (“Unique”), the swarm will slow down and wait for the leader to resume its position in the front before continuing towards the destination. When considering the three-drone formation (Fig. 6), the swarm needed 56s from the obstacle detection to come back into the initial formation, whereas in the eight-drones case (Fig. 7), this took 85s. It is evident from the experiments that in the latter case, due to a much bigger obstacle, the drones had to deviate more than in the former case. However, this did not affect the overall mission time very much, as UAV2, which became the new leader, did not have to deviate a lot from its path nor reduce its speed significantly.
J. N. Yasin et al.
Time (s)
406
220
213 198
200 180 No obstacle
181 166
3-CPSR
8-CPSR
Unique
Fig. 10. Time for mission completion for different approaches
4
Conclusion
In this paper, we developed a novel approach for collision avoidance and formation maintenance in a swarm of drones in dynamic environments. The proposed method utilizes a genetic algorithm inspired scheme in its collision avoidance part and point set registration in its formation maintenance part. In the approach, a swarm does not have a uniquely determined leader, and formation maintenance is accomplished by stabilizing the centroid of the swarm. The behaviour of the algorithm was theoretically analysed and tested in a simulation environment. The simulation results shown provide sufficient proof that the method works in a near-optimal manner in a dynamic environment, where an obstacle continues movement in its detected trajectory. We tested the efficiency of the proposed algorithm by comparing it with corresponding algorithms that assume existence of an explicit/unique leader. It was demonstrated that the ability to re-elect the leader dynamically, if required, gets the mission completed more quickly, i.e. it saves time and consequently energy by sparing the swarm from waiting for the leader to get back into its defined position in the formation. In our future work, we plan to extend the proposed approach by examining the other environmental effects, such as air drag, on the layers of drones, such as a two or multi layered V-shaped formation. That can help in optimizing the resource management in the swarm by dynamically swapping the outer layer with the inner layer in order to minimize the effect of air drag and maximize the flight time on a single charge. Also, we will consider more complex scenarios with several simultaneous obstacles and more versatile movement of obstacles. Acknowledgment. This work has been supported in part by the Academy of Finlandfunded research project 314048 and Nokia Foundation (Award No. 20200147).
Cellular Formation Maintenance and Collision Avoidance Using CPSR
407
References 1. Wolfram, S.: A New Kind of Science — Online–Table of Contents. Library Catalog. www.wolframscience.com 2. Alander, J.T.: On finding the optimal genetic algorithms for robot control problems. In: Proceedings IROS 1991: IEEE/RSJ International Workshop on Intelligent Robots and Systems 1991, vol. 3, pp. 1313–1318, November 1991 3. Balch, T., Arkin, R.C.: Behavior-based formation control for multirobot teams. IEEE Trans. Robot. Autom. 14(6), 926–939 (1998) 4. Bansal, J.C., Singh, P.K., Pal, N.R.: Evolutionary and Swarm Intelligence Algorithms. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-91341-4 5. Campion, M., Ranganathan, P., Faruque, S.: A review and future directions of UAV swarm communication architectures. In: 2018 IEEE International Conference on Electro/Information Technology (EIT), pp. 0903–0908, May 2018 6. Dong, L., Chen, Y., Qu, X.: Formation control strategy for nonholonomic intelligent vehicles based on virtual structure and consensus approach. Proc. Eng. 137, 415– 424 (2016). Green Intelligent Transportation System and Safety 7. Dorigo, M., Roosevelt, A.F.D.: Swarm robotics. In: Special Issue, Autonomous Robots. Citeseer (2004) 8. Gad, A.: Introduction to Optimization with Genetic Algorithm, July 2018 9. Guo, P., Hu, W., Ren, H., Zhang, Y.: PCAOT: a Manhattan point cloud registration method towards large rotation and small overlap. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7912–7917, October 2018 10. Chui, H., Rangarajan, A.: A new algorithm for non-rigid point matching. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 44–51, June 2000 11. Hamann, H.: Introduction to Swarm Robotics, pp. 1–32. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-74528-2 1 12. Han, Q., Li, T., Sun, S., Villarrubia, G., de la Prieta, F.: “1-n” leader-follower formation control of multiple agents based on bearing-only observation. In: Demazeau, Y., Decker, K.S., Bajo P´erez, J., de la Prieta, F. (eds.) Advances in Practical Applications of Agents, Multi-Agent Systems, and Sustainability: The PAAMS Collection, pp. 120–130. Springer, Cham (2015) 13. He, L., Bai, P., Liang, X., Zhang, J., Wang, W.: Feedback formation control of UAV swarm with multiple implicit leaders. Aerosp. Sci. Technol. 72, 327–334 (2018) 14. Lawton, J.R.T., Beard, R.W., Young, B.J.: A decentralized approach to formation maneuvers. IEEE Trans. Robot. Autom. 19(6), 933–941 (2003) 15. Li, N.H.M., Liu, H.H.T.: Formation UAV flight control using virtual structure and motion synchronization. In: 2008 American Control Conference, pp. 1782–1787, June 2008 16. Myronenko, A., Song, X.B.: Point-set registration: coherent point drift. CoRR, abs/0905.2635 (2009) 17. Kwang-Kyo, O., Park, M.-C., Ahn, H.-S.: A survey of multi-agent formation control. Automatica 53, 424–440 (2015) 18. Vicencio, K., Davis, B., Gentilini, I.: Multi-goal path planning based on the generalized traveling salesman problem with neighborhoods. In: 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2985–2990, September 2014
408
J. N. Yasin et al.
19. Yasin, J.N., et al.: Night vision obstacle detection and avoidance based on bioinspired vision sensors. In: 2020 IEEE Sensors, pp. 1–4 (2020) 20. Yasin, J.N., Mohamed, S.A.S., Haghbayan, M., Heikkonen, J., Tenhunen, H., Plosila, J.: Unmanned aerial vehicles (UAVs): collision avoidance systems and approaches. IEEE Access 8, 105139–105155 (2020) 21. Yasin, J.N., et al.: Energy-efficient formation morphing for collision avoidance in a swarm of drones. IEEE Access 8, 170681–170695 (2020) 22. Yasin, J.N., Mohamed, S.A.S., Haghbayan, M.-H., Heikkonen, J., Tenhunen, H., Plosila, J.: Navigation of autonomous swarm of drones using translational coordinates. In: Demazeau, Y., Holvoet, T., Corchado, J.M., Costantini, S. (eds.) Advances in Practical Applications of Agents, Multi-Agent Systems, and Trustworthiness. The PAAMS Collection, pp. 353–362. Springer, Cham (2020) 23. Yasin, J.N., et al.: Dynamic formation reshaping based on point set registration in a swarm of drones (2020) 24. Yasin, J.N., Haghbayan, M.-H., Heikkonen, J., Tenhunen, H., Plosila, J.: Formation maintenance and collision avoidance in a swarm of drones. In: Proceedings of the 2019 3rd International Symposium on Computer Science and Intelligent Control, ISCSIC 2019. Association for Computing Machinery, New York (2019) 25. Zhuge, C., Cai, Y., Tang, Z.: A novel dynamic obstacle avoidance algorithm based on collision time histogram. Chin. J. Electron. 26(3), 522–529 (2017)
The Simulation with New Opinion Dynamics Using Five Adopter Categories Makoto Fujii(B) and Akira Ishii Tottori University, Koyama, Tottori, Japan [email protected], [email protected]
Abstract. The purpose of this paper is to interpret the diffusion of innovation (transfer of opinions to the adoption category) from the simulation of opinion dynamics with five adapter categories set as agents, and to provide a computational social science method useful for marketing and mass media research. In the simulation, we observed the impact on the spread of innovation by manipulating variables such as the Initial Distribution of Opinions, the Confidence Coefficient between agents, the Mass Media Effects, and the Network Connection Probabilities of the random network. Simulation results show that when the media has a uniform impact on the market, the distribution of people’s opinions is distorted in the direction that the media takes the lead. We also observed that by manipulating the initial values of the opinions of the initial adopters, the reliability coefficient, and the connection probability between the nodes of the random network, the market is affected, and the spread of innovation is affected. Keywords: Opinion dynamics · Diffusion of Innovations · The five adopter categories · Simulation
1 Introduction The dispatch of information has traditionally been the role of the mass media, but with the development of SNS, consumers have also become a part of its role. Elucidation of the mechanism of opinion transition (Information diffusion and consensus building) regarding consumer innovation adoption based on the theory of opinion dynamics is thought to provide many suggestions to the complicated recent marketing communications. Opinion dynamics is known as a theory that analyzes how many human opinions converge by exchanging opinions with people. It is one of the research themes of social physics and has been studied from various aspects as a theory to analyze the process of aiming for social consensus building [1–6]. On the other hand, in the field of social science, as represented by Rogers’s “Diffusion of Innovation” [7], the research is being conducted on the process of disseminating innovation that people perceive as new. Opinion dynamics and research of diffusion of innovation are considered to have a high affinity because they are both conducted with the elements of objects, exchange of opinions, communication channels, and the passage of time. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 409–424, 2022. https://doi.org/10.1007/978-3-030-82193-7_27
410
M. Fujii and A. Ishii
Therefore, the purpose of this research is to interpret the spread of innovation from the simulation of opinion dynamics with five adapter categories set as agents, and to provide a computational social science method useful for marketing and mass media research. We organize the rest of this article as follows: In the next section, we discuss a typical model of opinion dynamics and a new model of opinion dynamics that also deals with the negative confidence factors used in this paper. Next, we will consider the diffusion of innovation and the five adapter categories used as agents. In Sect. 4, simulation using the proposed model is performed. We perform simulations that manipulate the Initial Distribution of Opinions, Confidence Coefficient, Mass Media Effects, and Network Connection Probabilities that influence the diffusion of innovation. Section 5 concludes this paper with a closing chapter after considering the conditions that are effective for the spread of innovation from the results of simulations.
2 Theory 2.1 Opinion Dynamics There are two main types of opinion dynamics models. One is a mathematical model in which people’s opinions are discrete values of two values (1 and 0, or 1 and −1), and the other one is a model in which people’s opinions are quantified and distributed continuously. Discrete binary theory can be applied in such as the French and American presidential elections [8] and the referendum (seen in Brexit) [9]. Typical mathematical models of binary theory include the Voter model [10–12] proposed by Granovetter, Galam’s theory of magnetic physics [13], and the local majority decision model [8, 9]. Many developments have been proposed for the Voter model, but the basic purpose is to describe two choices (adopted/not adopted) represented by voting behavior in elections. In the standard Voter model, the opinion is s ± 1. The opinion of i is Si and the opinion of j is Sj . Let the total number be N people, and the total set of opinions be S = {Si }. Also, let Sk be the set of opinions that differ only in the values of S and S k . Then, if the probability that the overall opinion distribution is S at time t is P(S, t), then it can be obtained by the following Eq. (1). d (1) P(S, t) = Wk S k P S k , t − Wk (S)P(S, t) k dt On the other hand, the mathematical model of opinion dynamics theory that expresses opinions with continuous values from 0 to 1 is called the Bounded Confidence Model [14–16], and it is premised that people find a compromise of consensus by exchanging opinions. The Deffuant-Weisbuch Model [17, 18] and the Heselmann-Krause Model [19] are known as the major models of the Bounded Confidence Model. The basic idea of the Bounded Confidence Model [14–16] is that the individual i is influenced by the opinions of those around him and his own opinions change. The Heselmann-Krause Model is defined below (2). N Dij xj (t) xi (t + 1) = (2) j=1
The Simulation with New Opinion Dynamics
411
Opinion xi threshold is 0 ≤ xi (t) ≤ 1. Here, assuming N persons, the opinion of the individual i at time t is written as xi (t), where 1 ≤ i ≤ N. The coefficient Dij takes various real positive values for all combinations i, j among N persons, and if Dij = 0, it means that the individual i ignores the opinion of j. Since the opinion threshold is 0 ≤ xi (t) ≤ 1, this model does not assume that the coefficient Dij takes a negative value. Opinion takes a continuous value from 0 to 1 (1 = agrees, 0 = disagree), and there is no assumption of disagreement. In other words, it can be seen that implicit consensus building is assumed. However, in the real world, we know that sometimes consensus building can be difficult. Therefore, Ishii-Kawahata [20, 21] extended Hegselmann-Krause Model [19] to deal with problems that are difficult to form consensus and developed a model that included lack of trust among people in opinion dynamics theory. Ishii et al. correct the meaning of the coefficient Dij as a confidence coefficient and assume that if there is a trust relationship between i and j, then Dij > 0, and if there is a distrust relationship, then Dij < 0. Moreover they consider people to ignore opinions that are far from theirs without agreeing or repelling them, and assuming that they are not particularly affected by opinions that are remarkably close to him/herself. Including these two processes, Ishii uses the following function (3) and (4) instead of the Dij xj (t) term used in HegselmannKrause Model [21]. Dij Ii , Ij Ij (t) − Ii (t) (3) where Ii , I j =
1 1 + exp(β) Ii − Ij − b
(4)
Furthermore, the model proposed by Ishii [21] and Ishii-Okano [22] consider the influence of mass media and forgetting topics. In addition to human contact (information exchange), information provided by the mass media influences the formation of public opinion and the opinions of people and is therefore an important term in opinion dynamics theory. Moreover, considering the transition of time, it is considered to be an important point as opinion dynamics theory to consider the effect that the matter in question itself becomes old and the interest of people decreases. Regarding the mass media effect, A(t) is the external pressure at time t, and the reaction difference of each agent is represented by the coefficient ci . The coefficient ci can have different values for each, and ci can be positive or negative. If the coefficient ci is positive, the person i moves his/her opinion toward the mass media. On the contrary, when the coefficient ci is negative, the opinion of the person changes against the direction of the media. Regarding forgetting, we have dealt with the forgetting of problems by introducing an exponential decay function. Including these two effects, the change in the opinion of the agent can be expressed as follows (5). Ii (t) = −αIi (t)t + ci A(t)t +
N j=1
Dij Ii (t), Ij (t) Ij (t) − Ii (t) t
(5)
412
M. Fujii and A. Ishii
Based on this model, Fujii-Ishii [23] replaced consent with adoption, distrust with rejection, and added agents in the middle of the simulation to observe the transition of opinions. Assuming that the entire simulation is 300, the consumers who exist in the market from the beginning are Nini , and the consumers who enter the simulation later are Nuser , that is, N = 300 = Nini + Nuser . In addition, the later entry of consumers into the market (simulation) is defined as follows (6). Nuser =
(N − Nini ) ∗ (tanh x + 1) 2
(6)
This study showed that when the mass media had a uniform impact on the market, the distribution of people’s opinions was biased towards the media-led direction. In this study, it was confirmed that the mass media first affects Nini and then Nuser . It was also shown that the stronger the connection between people, the easier it is to be influenced by others. In other words, in order to adopt (plus) the opinions of consumers who will enter the market later, it has been shown that there is a need for pre-existing positive consumers, mass media which encourage adoption, and a dense network of people. 2.2 Diffusion of Innovations According to Rogers, innovations are an idea, habit, or object that is perceived as new by an individual or other hiring unit. Its diffusion is the process by which innovations are transmitted between members of the social system over time through a communication channel [7]. In the marketing context, smartphones (iPhone) in the mobile phone market and Greek yogurt Chobani in the yogurt market are good examples of innovations. In addition, the essence of the diffusion process is information exchange, through which one conveys new ideas to others. The fastest and most efficient way to potentially convey information that innovation exists is in the mass media, and face-to-face information exchange is effective in persuading people to accept new ideas. In the choice of adoption/rejection of innovations, the superiority or inferiority of innovations and demand have a great influence on the spread, but the superiority or inferiority of innovation itself is not discussed in this paper. We focus on the diffusion of innovations utilizing opinion dynamics. In the previous research, the agents were divided into Nini and Nuser and the simulation was executed, but in the Diffusion of Innovations, Rogers classifies consumers into five innovation-based categories. Therefore, in this study, agents are classified into five categories, and the transition of innovation propagation (adoption and rejection) is considered from a marketing perspective. Adopter Categorization distinguishes between members of the social system based on individual innovations. Innovativeness is the degree to which oneself adopts new ideas and products relatively early compared to other members of the social system [7]. The five adopter categorizations are Innovators, Early Adopters, Early Majority, Late Majority, and Laggards (Fig. 1).
The Simulation with New Opinion Dynamics
413
Fig. 1. Adopter categorization. Modified from “Diffusion of Innovations” (2003).
Innovators are adventurous and act as gatekeepers of social systems. In other words, it is the layer that adopts innovations earliest among the five categories. Early Adopters are opinion leaders who provide information about innovations for potential users. Although the Early Majority are cautious when adopting innovations, they are also an important category that act as a bridge in the dissemination process and also as a mutual liaison in interpersonal networks. Late Majority are said to be skeptical of innovations and will adopt innovations after the adoption rate exceeds 50%. For the Late Majority, pressures within the groups are needed to motivate innovation adoption. Laggards are the last category of social systems to adopt innovations. This is the category that takes the longest time to adopt innovation.
3 Modeling In this study, tanh is also used to drop agents into simulations, but the timing difference when each adopter category adopts innovations is considered as follows (Fig. 2). (tanh(x) + 1) 2
(7)
EarlyAdopters : Nea =
(tanh(x − 3) + 1) 2
(8)
Early Majority : Nem =
(tanh(x − 6) + 1) 2
(9)
Innovators : Ninn =
414
M. Fujii and A. Ishii
Late Majority : Nlm = Laggards : Nlg =
(tanh(x − 9) + 1) 2
(10)
(tanh(x − 12) + 1) 2
(11)
1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 -3.00 -2.31 -1.62 -0.93 -0.24 0.45 1.14 1.83 2.52 3.21 3.90 4.59 5.28 5.97 6.66 7.35 8.04 8.73 9.42 10.11 10.80 11.49 12.18 12.87 13.56 14.25 14.94
0.00
Ninn
Nea
Nem
Nlm
Nlg
Fig. 2. Innovation adoption timing for each adopter category
Based on the Ishii (2019) model, agents are classified into five adopter categories, and the initial distribution of opinions, confidence coefficient, media coefficient, and random network connection probability are manipulated to calculate the transition of innovation propagation. As the first calculation example, we show the result calculated by 1,000 people (Ninn = 25, Nea = 135, Nem = 340, Nlm = 340, Nlg = 160) in Fig. 3. The initial opinion distribution of the Innovators is ± 30, and the network of people is set as a random network with a link connection probability of 50%. The Dij value among 1,000 people is determined by a uniform random number from -1 to 1. As we can see, people’s opinions are positive and negative, but people’s opinions are completely scattered. In this calculation, the mass media effect A(t) is set to 0, so the opinion distribution in this figure is positive and negative without bias. In the calculation assuming 1,000 people, it seems that the opinion of people is balanced to some extent as a whole, although the opinions of people are positively and negatively distributed. Also, we can see from the calculated distribution, starting with a uniform distribution of opinion distributions, calculations with slight differences are spread and balanced to some extent, but the final distribution of opinions is not uniform. We can find several opinion groups from the
The Simulation with New Opinion Dynamics
415
calculated distribution. It can also be confirmed that both the adoption curve and the rejection curve draw the S-shaped curve. In this paper, a simulation is executed based on this calculation result.
Fig. 3. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0.5. The figure on the left shows the time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The central figure shows the distribution of opinions at the final point of the calculation. The figure on the right shows the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color Figure online)
4 Simulations 4.1 Manipulating the Initial Distribution of Opinions In the Fig. 3, we set the initial opinion distribution of the Innovator to ± 30, but we bias the initial opinion distribution to 10 to 30 and −10 to −30 and observe the propagation of dissemination. The left figure in Fig. 4 starts the opinion distribution of the Innovators from the plus, and the right figure starts from the minus, but it seems that the shares of adoption (plus) and rejection (minus) are in competition. This is because the confidence coefficient Dij is set to random (−1 to 1), and there is no bias variable other than the initial value of opinions that biases to plus or minus, so it is uniform in the process of people exchanging opinions. It is thought that this is because it is being transformed. In addition, among the adopter categories, the early adopters are said to hold the key to popularization, so the results of the calculation with a biased distribution of opinions of the Early adopters are shown in Fig. 5. In the first half of the simulation, where the opinions of the initial adopters started from positive, the positive ratio was high, and in the negative case, the negative ratio was high, but in the latter half of the simulation, the positive/negative ratio was about the same. Again, the confidence factor Dij is set to random (−1 to 1), and there is no bias variable other than the initial value of opinions that biases it to plus or minus, so it is leveled in the process of people exchanging opinions. 4.2 Manipulation of Confidence Coefficient Dij What if the Early Adopters are trusted by other adopter categories? Here, the basic conditions are the same as in Fig. 5, but the confidence coefficient from Early Adopters,
416
M. Fujii and A. Ishii
Fig. 4. Calculation result of N = 1,000. Left figure: innovator’s initial opinion distribution = 10 to 30, right figure: innovator’s initial opinion distribution =−10 to −30. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color figure online)
Early Majority, Late Majority, and Laggard to the Early Adopters is set to 1 to 2 (nonnegative), and the simulation is executed (Fig. 6). It can be confirmed that when the initial opinion distribution of the Early Adopters is positive, the ratio is positive, and when it is negative, the ratio of negative is higher. This is because Early Adopters are trusted by the other categories other than the Innovators, so it is considered that the opinions of Early Adopters influence the other three categories. However, the trust from other category to Early Adopters suggests that it can be both a poison and a medicine for those who want to spread innovation. They need measures (variables that distort the hiring direction) to keep Early Adopters on their side. 4.3 Manipulating Mass Media Effects We apply a certain mass media effect to the theory of opinion dynamics and consider how the effect affects each adopter category and the ratio of adoption (+) and rejection (−). That is, it is assumed that A(t) = Aconst. Here, Aconst is a constant value. Figure 7 is calculated under the same calculation conditions as Fig. 3, but the media effect A(t) is not zero. Here, in Fig. 7, it is assumed that A(t) = 2.5, 5, and 10. When we compare with A (t) = 2.5 and 5, the opinion distribution of A(t) = 5 appears to be more biased towards positive opinions.
The Simulation with New Opinion Dynamics
417
Fig. 5. Calculation result of N = 1,000. Left figure: early adopters initial opinion distribution = 10 to 20, right figure: early adopters initial opinion distribution = −10 to −20. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, So everyone trusts or distrusts everyone. The media effect is zero (A (t) = 0). (Color figure online)
In the case of A(t) = 10, the calculated opinion distribution is clearly distorted in the positive direction. This can be seen in Ishii et al.’s theory of opinion dynamics, which qualitatively explains the phenomenon in which the media effect biases market opinions toward the media-led direction [22]. The effect is shown not only to Innovators but also to the other adopter category that will be entered later. It can also be seen that people who are in the market first and have been in touch with the mass media for a longer time are more likely to be affected stronger by the media. As a result, consumers who enter the market later are in contact with and influenced by more adopt (+) opinions. It suggests that by dropping larger mass media (advertising) variables, the opinion of adopter categories can be distorted to the media-led direction, given the same other simulation conditions. When comparing the ratios of A (t) = 2.5, 5, and 10 (Fig. 6, bottom line graph), it can be confirmed that the ratio difference between 5 and 10 is larger than the difference between 2.5 and 5. This suggests that mass media (advertising) variables may have thresholds that function in the dissemination of innovation (between 5 and 10 in this case). Next, the simulation is performed by manipulating the mass media variables with the confidence coefficient of the Early Adopters set to non-negative. The simulation conditions are as follows. When Early Adopters, which are the key to the diffusion of
418
M. Fujii and A. Ishii
Fig. 6. Calculation Result of N = 1,000. Left figure: early adopters initial opinion distribution = 10 to 20, right figure: early adopters initial opinion distribution =−10 to −20. The human network is a random network with a link connection probability of 0.5. The time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom figures show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij from early adopters, early majority, late majority, and laggard to early adopters is randomly set from 1 to 2, and other Dij is randomly set from −1 to 1. The media effect is zero (A (t) = 0). (Color figure online)
innovations, are positive for the adoption of the innovation, and the adopter category excluding the Innovators trusts the Early Adopters, and the mass media variable in the recruitment (+) direction is set (Fig. 8). Since the mass media variables are added, it can be confirmed that the right figure is more biased toward positive than the left figure. In the time series graph of the ratio of adoption (+) and rejection (−), The movements in the first half of the simulation are similar on the left and right, in the latter half of right graph, it can be confirmed that the ratio of reject (−) is decreased and the ratio of adopt (+) is increased. If Early Adopters, who are considered to be opinion leaders, are positive in adopting innovations, are trusted by other adopter categories, and can create a favorable environment in which the media is given positive opinions, it can be expected to attract the majority. 4.4 Manipulating Network Connection Probabilities Finally, we calculate the changed connection probabilities of the nodes in the random network. We set N = 1,000 in this calculation (Ninn = 25, Nea = 135, Nem = 340, Nlm = 340, Nlg = 160), too. Dij is randomly set from −1 to 1, so everyone trusts or distrusts
The Simulation with New Opinion Dynamics
419
Fig. 7. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0.5. The above three figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and Reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The mass media effect is A (t) = 2.5, 5, 10 from the left. (Color figure online)
everyone. Here, we set the mass media parameter (A) to 5. We change the probability of connecting to other nodes in a random network to 0% and 1%. The calculation results are shown in Fig. 9. From this calculation result, it can be seen that when the probability of connection between nodes is 0%, the consumers are not influenced by others, but are influenced only by the mass media and form their opinion. It can be seen that consumers, who were initially negative, are shifting their opinions from negative to positive over the time. In addition, it can be seen that the cumulative share of negative (rejection) becomes zero and the cumulative share of positive (adoption) becomes 100% with the passage of time. The adoption curve, it can be confirmed that each adopter category is added to the plus (recruitment) step by step. Even when the connection probability is 1%, the distribution is to the right, but it can be seen that the opinion distribution extends to the minus side. This means that the distribution of opinions is expanding as a result of being influenced not only by the mass media but also by the opinions of others. In the process of long-term contact with media information and increasing contact with people, the share of positive (adopt) is increasing and the ratio of negative (reject) is decreasing. In addition, we change the probability of connecting to other nodes in a random network to 25%, 50% and 75%. The calculation results are shown in Fig. 10. From this calculation, we can know that the higher the probability of connection between nodes, the more the opinions of others are influenced by the opinions of others. Comparing the dispersal of opinions between 25% and 75%, we can see that the difference between hiring and rejection is not much different. When people are not connected enough to each other, opinions are formed under the influence of the mass media. If the connection probability is high, it is suggested that the opinions of others change due to the influence
420
M. Fujii and A. Ishii
Fig. 8. Calculation result of n = 1,000. in both the left and right figures, the initial opinion distribution of the innovator ± 30. Initial opinion distribution of early adopters = 10 to 20. The above two figures show time evolution of the opinion trajectory of Aach adopter category (Red: Ninn, Light blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). The human network is a random network with a link connection probability of 0.5. Dij from early adopters, early majority, late majority, and laggard to early adopters is randomly set from 1 to 2, and other Dij is randomly set from −1 to 1. Media effect: Left Figure A (t) = 0. Right Figure A (t) = 5. (Color figure online)
of the opinions of others. It suggests that face-to-face communication works effectively when adopting innovations. In other words, the use of social media is expected for diffusion of innovations.
5 Discussion Based on Ishii et al.’s new opinion dynamics theory, which incorporates both trust and distrust in human relationships, we conducted a simulation incorporating adopter categories (Innovators, Early Adopters, Early Majority, Late Majority, and Laggards). In this simulation, the agent is set to 1,000, and the reliability coefficient Dij that connects people is executed in two patterns. One is set Dij to a random number from −1 to 1. Therefore, the probability that Dij takes a positive value and the probability that it takes a negative value are 50%, respectively. On the other hand, we set the confidence coefficient from Early Adopters, Early Majority, Late Majority, and Laggard to Early Adopters to a random number of 1–2. The person-to-person connection is a random network that is part of a complete graph, with the probability of linking between nodes set to 50%, except in Fig. 9 and 10.
The Simulation with New Opinion Dynamics
421
Fig. 9. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 0% on the right and 1% on the left. The above two figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The media effect is zero (A (t) = 5). (Color figure online)
The composition of the adopter categories is set to 2.5% for Innovators, 13.5% for Early Adopters, 34% for Early Majority, 34% for Late Majority, and 16% for Laggard. Regarding the innovativeness of each adopter category, the adopt order is clear, but the adopt curve of each adopter category is not clear, so we set to be sequentially added to the simulation by tanh. Since it is expected that the detailed recruitment curve of each recruitment category will differ greatly depending on the category, etc., it is necessary to set a guideline axis and examine it in detail. Figure 4 and 5 show simulations that manipulate the initial distribution of opinions. In Fig. 4, the Innovators are set, and in Fig. 5, the initial opinion distribution of the Early Adopters are set to adoption (+) and rejection (−), respectively, and the simulation is executed. Regardless of whether the initial opinion distribution of Innovators and Early Adopters were adopting (+) or rejecting (−), there was no significant difference in the adopt/reject ratio. Unless the influence of the mass media and the confidence coefficient are manipulated, it is inferred that the opinions of the people in the market will be leveled. It suggests that if the market is hot, it will be anointed, and if it is on Flaming (although appropriate action is required), it may eventually settle down without taking excessive action.
422
M. Fujii and A. Ishii
Fig. 10. Calculation result of N = 1,000. The human network is a random network with a link connection probability of 25% on the right, 50% on the center, and 75% on the left. The above two figures show time evolution of the opinion trajectory of each adopter category (Red: Ninn, Light Blue: Nea, Blue: Nem, Green: Nlm, Pink: Nlg). The bottom line graphs show the time-series change in the ratio of adopt (+) and reject (−), and red indicates adopt (+) and blue indicates reject (−). Dij is randomly set from −1 to 1, so everyone trusts or distrusts everyone. The media effect is zero (A (t) = 5). (Color figure online)
Figure 6 shows a simulation in which the reliability coefficient Dij is manipulated. Early Adopters are the adopter category that is expected to play the role of opinion leaders in the diffusion of Innovations. Therefore, the adopter categories excluding Innovators trust Early Adopters (non-negative). As a result, it was confirmed that when the initial opinion distribution of the Early Adopters is adopting (+), the ratio of adopt (+) tends to be high, and when it is rejected (−), the opposite tendency is observed. Since there is no guarantee that Early Adopters will always be on their side, it is suggested that detailed measures such as the implementation of campaigns targeting the Early Adopters are necessary. Figure 7 and 8 show simulations by manipulating media effects. Here, it was shown that the effect of the media distorts people’s opinions toward the media. In particular, a large difference was confirmed between the media effects A(t) = 5 and 10. It is thought that this suggests the existence of a threshold value at which the media exerts its effect. However, this time, we have not been able to execute simulations assuming multiple media such as mass media and digital media, so we would like to continue our studies in the future. Figure 8 shows a simulation that adds the initial opinion distribution, confidence coefficient, and media effect of Early Adopters. It can be said that it is a perfect environment for the spread of innovation because the trusted Early Adopters are positive about the spread of innovations and the influence of the media is biased toward the positive. It is important to build that positive initial opinion distribution of Early Adopters, trust from other adopter categories to Early Adopters, media bias in the positive direction. In Figs. 7 and 8, we calculated while changing the connection probabilities of the nodes of the random network. The mass media effect is set to 5 (A(t) = 5). When the
The Simulation with New Opinion Dynamics
423
connection probability is 0%, even if the opinion is rejecting (−) among the consumers who are in the market first, the opinion is changed to adopt (+) due to the influence of the mass media. This is because it is not affected by others, but only by the influence of the mass media. However, we can see that the share of people with reject (−) opinions is increasing as the probability of connection between people increases. This result shows that it is important to lead the opinions of people who are already in the market to adopt (+) first, if we want to raise the adopt (+) opinions of those who enter the market later. It is also shown that the higher the probability of connection between people, the stronger the influence on the change of opinion.
6 Conclusion In this paper, we confirmed the influence of the opinion transition of the adopter category using the new opinion dynamics theory that includes both trust and distrust in human relations proposed by Ishii et al. When the influence of the mass media uniformly affects the market, it was shown that the distribution of opinions of people is biased toward the media-led direction. In doing so, it was observed that it first affected the people in the market and then later on the categories that adopted innovations. It was confirmed that the existence of Early Adopters with positive opinions that existed in the market in advance, a certain number of mass media that encourage adopting, and the trust in Early Adopters promote the diffusion. Elucidation of the mechanism of opinion transition of new entrants based on opinion dynamics theory is useful for marketing and mass media research. This time, we provided the computational social science method. However, the influence of media that influences human decision-making is wide-ranging, such as mass advertising, digital advertising, and SNS and so on, and influence of those media are not uniform. In addition, the timing of adopting depends largely on the level of involvement of individual consumers, but in this paper, it is fixed for convenience. It can be said that the remaining issues are the construction of a model that considers the heterogeneity of media effects, and other type of human network and their connection probabilities.
References 1. French, J.R.P.: A formal theory of social power. Psychol. Rev. 63 181–194(1956) 2. Harary, F.: A criterion for unanimity in French as theory of social power. In: Cartwright, D. (ed.) Studies in Social Power. Institute for Social Research, Research Center for Group Dynamics, Ann Arbor (1959) 3. Abelson, R.P.: Mathematical models of the distribution of attitudes under controversy. In: Frederiksen, N., Gulliksen, H. (eds.) Contributions to Mathematical Psychology, pp. 142–160. Holt, Rinehart and Winston, New York (1964) 4. De Groot, M.H.: Reaching a consensus. J. Am. Statist. Assoc. 69, 118–121 (1974) 5. Lehrer, K.: Social consensus and rational agnoiology. Synthese 31, 141–160 (1975) 6. Chatterjee, S.: Reaching a consensus: some limit theorems. In: Proc. Int. Statist. Inst, pp. 159– 164 (1975) 7. Rogers, E.M.: Diffusion of Innovations, 5th edn. Free Press, New York (2003)
424
M. Fujii and A. Ishii
8. Galam, S.: The trump phenomenon: an explanation from sociophysics. Int. J. Mod. Phys. B 31, 1742015 (2017) 9. Galam, S.: Are referendums a mechanism to turn our prejudices into rational choices? An unfortunate answer from sociophysics. In: Morel, L., Qvortrup, M. (eds.) The Routledge Handbook to Referendums and Direct Democracy, Chapter 19, Taylor & Francis, London (2017) 10. Granovetter, M.: Threshold models of collective behavior. Am. J. Sociol. Vil. 83(6), 1420– 1443 (1978) 11. Clifford, P., Sudbury, A.: A model for spatial conflict. Biometrika 60, 581–588 (1973) 12. Holley, R., Liggett, T.: Ergodic theorems for weakly interacting infinite systems and the voter model. Ann. Probab. 3(4), 643–663 (1975) 13. Galam, S.: Rational group decision making: a random field Ising model at T = 0. Phys. A 238, 66 (1997) 14. Jager, W., Amblard, F.: Uniformity, bipolarization and pluriformity captured as generic stylized behavior with an agent-based simulation model of attitude change. Comput. Math. Organ. Theor. 10 295–303(2004) 15. Jager, W., Amblard, F.: Multiple attitude dynamics in large populations. In: Presented in the Agent 2005 Conference on Generative Social Progress, Models and Mechanisms, October 13–15. The University of Chicago (2005) 16. Kurmyshev, E., Juarez, H.A., Gonzalez-Silva, R.A.: Dynamics of bounded confidence opinion in heterogeneous social networks: concord against partial antagonism. Phys. A 390, 2945– 2955 (2011) 17. Deffuant, G., Neau, D., Amblard, F., Weisbuch. G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3(15), 87–98 (2000) 18. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexitym 7(3) 55–63 (2002) 19. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence: models, analysis, and simulation. J. Artifi. Soc. Soc. Simul. 5(3), 1–33 (2002) 20. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the Internet. In: Proceedings of the 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES2018), pp. 71–76. arXiv:1812.11845 [physics.socph] (2018) 21. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais, D.C., Carreras, A., de Almeida, A.T., Vetschera, R. (eds.) GDN 2019. LNBIP, vol. 351, pp. 193–204. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21711-2_15 22. Ishii, A., Okano, N.: Sociophysics approach of simulation of mass media effects in society using new opinion. In: Arai, K., et al. (eds.) Advances in Intelligent Systems and Computing as the Proceedings of IntelliSys2020 (IntelliSys 2020), AISC 1252, pp. 13–28 (2021) 23. Fujii, M., Ishii, A.: The simulation of diffusion of innovations using new opinion dynamics. In: The 5th International Workshop on Application of Big Data for Computational Social Science (ABCSS2020 @ WI-IAT 2020) (2020)
Intrinsic Rewards for Reinforcement Learning Within Complex 2D Environments Nathaniel Grabaskas1(B) and Zhizhen Wang2 1
PengFei Research, Cupertino, CA 95014, USA 2 Microsoft Australia, Sydney, Australia
Abstract. In this paper, we propose an approach to train an intelligent agent using reinforcement learning in order to draw on a two-dimensional grid. Painting is a creative art, and it will take human beings years to learn how to draw. In the training process, we build grid environments with obstacles and challenges resembling abstract art and then place the agent in different environments to reach the goal. In phase I, We propose using intrinsic rewards based on the state of the model to stimulate the agent’s exploration desire and to increase adaptability in complex environments. In phase II, we prototype a rendering pipeline to translate the agent’s movement during the training process into a painting. Our results show the intrinsic reward method increased the agent’s ability to learn in environments of moderate complexity. The rendering pipeline prototype was evaluated in a single round of crowd sourced evaluation and steps to further improve outlined. Keywords: Reinforcement learning · Intrinsic rewards · Deep learning · Grid environments · Generative art · Intelligent agent
1
Introduction
Drawing and creative art are a critical part of human civilisation and culture. To learn how to draw would take years of learning and practising for humans. Hence we want to explore the idea of training an artificial intelligence agent and visualizing the training to produce interesting and abstract pieces of art. However, the scope of work in this paper is primarily focused on stimulating agent curiosity for a reward. Phase I of this project focuses on agents learning to explore an environment, avoid obstacles, and reach a goal. Phase II, is to investigate the abstract artwork an agent can generate during the training process. Based on the goal, the agent is measured against quantitatively in phase I and qualitatively in phase II. In phase I, the agent learning to explore, we define a 2D environment (canvas) to simulate the painting environment, and each training canvas is initialised with a positive reward grid (goal) and hazard zones (immediately ends the episode) [5]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 425–437, 2022. https://doi.org/10.1007/978-3-030-82193-7_28
426
N. Grabaskas and Z. Wang
The initial parameters for generating the environment are based on abstract art paintings. The next step is to allow the agent to explore the environment while avoiding danger zones and minimize frames taken to reach the goal. We propose a method of rewarding the agent simply for learning, attempting to give an intrinsic desire to explore the environment [1,10]. The intrinsic reward method proves successful in moderate complexity environments. Phase II of the experiment is to generate visualizations from the agent’s training and compare which methods produce more interesting art. This was inspired by Luo’s dissertation in artistic applications for reinforcement learning [7]. In order to turn the agent training into artwork, a separate pipeline is setup. This pipeline renders images from the agent training, applies open source painting effects, and sends to human raters for qualitative evaluation. The phase II is explored as a prototype design in this paper and further experiments are outlined. The paper is organized into the following sections: related works, data used, methods, metrics, results and discussion, and finally conclusion and future steps.
2
Related Work
Exploration in sparse reward environments remains a key challenge of reinforcement learning [10]. They propose Rewarding Impact-Driven Exploration (RIDE), a novel type of intrinsic reward which encourages the agent to take actions leading to significant changes in its learned state representation. They evaluate their method on multiple challenging procedurally-generated tasks in MiniGrid. This approach is more sample efficient than existing exploration methods and the intrinsic reward doesn’t diminish during the course of training and rewards the agent substantially more for interacting with objects it can control. RIDE is computed as the L2-norm of the difference in the learned state representation between consecutive states. However, to ensure the agent does not go back and forth they discount RIDE by the number of times the state is visited. The parameters used to learn the intrinsic reward signal are used only to determine the exploration bonus and never part of the agent’s policy. While our intrinsic reward was inspired by them, we use the model weights not the embedding to calculate the reward. We believe the better represents rewarding the agent’s learning. Similar work was done by Zhewei, Wen and Shuchange [18] where they trained an AI agent to paint on a canvas to generate a painting similar to the target image. Apart from training the agent for a reward policy, they also proposed an approach to decompose the target painting into hundreds of strokes in sequence within a grid. In the end of the project, the agent is trained to be general enough to handle multiple types of images (including digits, handwritten, streetview, human portains, etc.). Another work related to artwork generation is done by Ning and his team [17]. The team is focusing on a particular type of painting, stroke drawing. They applied inverse reinforcement learning to learn the reward function from the training data. Our approach differs as we convert the agent movements and interaction with the environment into artwork.
Intrinsic Rewards for Reinforcement Learning
3
427
Data
The dataset set used is the agent’s training environment. In Zhewei’s paper [18], their team model the agent’s painting process as sequential decision-making tasks. During training, the agent was rewarded based on comparing the current status on the canvas to the target painting. In this project, we use a different strategy. Instead of focusing on decomposing the painting into a sequential environment, we pre-build the training environment based on the chosen abstract art from Surma’s Kaggle Open dataset [11]. An example can be seen in Fig. 1.
Fig. 1. This images shows a piece of abstract art chosen from Surma’s Kaggle Open Dataset.
The Kaggle dataset consists of 8,145 abstract art paintings with a resolution of 512*512. To maximize the experiment result, we choose a few arts with precise edges and repeatable patterns. Then a method was built to transform the raw artwork into a 2D grid representation. The team uses a preprocessing script for resizing, aligning, setting a threshold for detecting the edge, blurring, and defining the hazard zones. Then the interim images are used as examples to create MiniGrid environments. Minimalist GridWorld (MiniGrid) environment seeks to minimize software dependencies, be easy to extend, and deliver good performance for faster training [5]. The environment comes with the existing object types wall, floor, lava, door, key, ball, box and goal. This gives the agent diverse, generated environments with tasks such as getting the key to unlock the door, hazards like lava to be
428
N. Grabaskas and Z. Wang
avoided, and the purpose of getting to the goal. This provides a framework in which to execute our agent training. The team uses customized objects to echo the artworks’ characteristics (for example, different color schema, shape, patterns, and transparency). Later, the agent will explore procedurally generated environments and iterate on various policies to achieve maximum rewards. An example MiniGrid environment can be found in Fig. 2.
Fig. 2. This images shows a piece of abstract art converted to a grid environment for the agent to explore, named MiniGrid-PaintingS11Env-v1. The orange tiles represent lava and end the episode if the agent touches them and green represents the goal the agent is striving to reach. Only if the goal is reached is an extrinsic reward given.
4
Methods
In this section we discuss some background in how reinforcement learning works, existing learning algorithms, our evaluated intrinsic reward method, and the model architecture used.
Intrinsic Rewards for Reinforcement Learning
4.1
429
Reinforcement Learning Background
We use a common reinforcement learning setup where an agent interacts with an environment over frames or discrete time steps. At each frame st , the agent receives a current state representation and selects an action at from a set of possible actions A. The policy π is a mapping from the given state st to an action at . Afterwards, the agent receives the next state st+1 and possibly a reward. This process continues until a terminal state is reached (reaching a goal or frame limit). The reward at the end is the sum of rewards over each time step with a discount factor λ to decay the reward. The goal of the agent is to maximize the expected end reward from each current state [8]. 4.2
Model Policies
Advantage Actor-Critic (A2C) - In this algorithm the advantage function captures how much better an action is compared to others at a given state, while the value function captures how good it is to be at this state. This way the evaluation of an action is based not only on how good the action is, but also how much better it can be. The benefit of the advantage actor-critic function is it reduces the high variance of policy networks and stabilizes the model [6,12]. Actor-Critic algorithm is a hybrid mechanism combining the value optimization and policy optimization. More specifically, the Actor-Critic combines the Q-learning and PG (Policy Gradient) algorithms [9]. At a high level, the resulting algorithm involves a loop alternating between: • Actor: a Policy Gradient algorithm deciding on an action to take • Critic: off policy reinforcement learning algorithm critiquing the action the actor selected, providing feedback on how to adjust. It can take advantage of efficiency tricks in Q-learning, such as memory replay. A2C maintains a policy defined as π(st ; θ) and an estimate of the value function V (st ; θv ). A2C operates in the forward view and uses the same mix of time step returns to update both the policy π and the value function V π . The policy π and the value function V π are updated after every tmax actions. The update performed by the algorithm is updateloss = πloss − Hξ ∗ H + Vlossξ ∗ Vloss
(1)
where ξ is the coefficient (we use 0.01 and 0.5 respectively for entropy H and V ) [14]. We use two fully-connected neural networks for the actor and critic. The actor component outputs the agent action. And the critic outputs the value function estimate. This value function estimate replaces the reward function in policy gradient calculating the rewards only at the end of the episode. A2C Intrinsic Rewards - After backward propagation of the model using the updateloss , there is a second round of backward propagation using intrinsicreward : intrinsicreward = L2 N orm||φst+1 − φst ||
(2)
430
N. Grabaskas and Z. Wang
where φ represents the weights for all layers in the CNN [10]. This intrinsic reward is to encourage the agent to take actions leading to significant changes in its learned state representation. Attempting to mimic the sensation of learning or exploring. 4.3
Model Inputs
Input to the model (see Table 1) is the agent’s view of the grid environment (See environment example in Fig. 2). For all experiments the view distance is set to 11, therefore the input is an array of size (3, 11, 11). Each tile is encoded as a 3 dimensional tuple: (OBJECT, COLOR, STATE) [5]. Table 1. Reinforcement model input by dimension and possible values for each dimension. Dimension Represented values
4.4
STATE
Open, closed, and locked
OBJECT
Wall, floor, lava, door, key, ball, box, and goal
COLOR
Red, green, blue, purple, yellow, and grey
Model Architecture
To implement A2C with Deep Learning, 3 components are needed (see Fig. 3). Component 1 is a Convolution Neural Network (CNN) which takes the agent observation and converts this to an embedding of size 576. The other two components are the Actor and Critic and have the same architecture. They both take the embedding output from the CNN and have a single fully-connected layer of 64 neurons [16]. The Actor outputs one of the possible agent actions, while the Critic outputs an estimate of the value function. These components are optimized using RMSProp [13].
5
Metrics
We outline both the quantitative and qualitative metrics we use to evaluate our experiments. 5.1
Quantitative Agent Comparison
Each episode the grid is setup and the agent is given a reward for reaching the goal, this reward decays for each frame of the episode. No reward is given if the frame limit is reached or the agent touches the hazard. Quantitative metrics are used to compare agents trained with different algorithms in the same
Intrinsic Rewards for Reinforcement Learning
431
Fig. 3. This figure shows a diagram of the model architecture. The CNN contains three layers which take the agents view as input and output an embedding of size 576. Each actor/critic component take this as input and outputs an action (size = 7) and value function estimate (size = 1) respectively.
environment. Each agent is evaluated over the entire training process and there aggregate performance is used. The baseline is the normal A2C algorithm trained and evaluated on each grid environment, both basic and art inspired. We compare the intrinsic reward variation against A2C on each environment and discuss successes and shortcomings. We chose two metrics to evaluate the algorithms. Mean Reward is the first metric and Max Reward is the second metric, this is across all episodes played, the highest reward the agent received. We don’t discuss these further, but there are two other metrics one could consider, mean frames and max frames. The difficulty with frames as a metric is lower is not always better. Given an environment with hazards, higher may be better, because the agent learned to avoid the obstacles. 5.2
Qualitative Comparison
After training the agents we sent each trained agent through 50 episodes to evaluate for each environment. For each episode we create a time lapse graphic representation of the environment. The agent’s movements are shown throughout the episode with brighter spots representing where the agent spent more time. Next, we used OpenCV xphoto oil painting effect [3] to take this image and create an abstract representation. These images were combined to create an A/B comparison (see Fig. 4). Comparisons were between the same environments and agents with the same amount of frames for training. Again using the normal A2C agent output as the
432
N. Grabaskas and Z. Wang
baseline and our intrinsic model as the comparison. The ordering was alternated so one source was not always on the same side. And 100 comparison images were sampled from the available 350 to be sent off for rating. These comparisons were placed in front of human raters using Mechanical Turk (MTurk) and asked which one they find more interesting. The major challenge with the qualitative comparison is the subjectivity of what each rater finds interesting. Image ordering is varied to help reduce bias, if the rater learns to prefer one method over the other and expects the better image to be on a particular side. Each sample is placed before three raters and their ratings combined so we are not relying on a single observer, this also helps to reduce bias. The question and possible answers is shown as an example below: • Question - With an A/B comparison of images, “Which Image of do you find more interesting?” • Answer - A, B, Same.
Fig. 4. This figure shows two images rendered using the agent environment time lapse along with OpenCV oil painting effect. Image A is from normal A2C and image B is from A2C intrinsic. In this example the agent on the right did more exploration and created a visualization with more effect.
6
Results and Discussion
In this section we cover the experiment setup, reporting quantitative and qualitative results. We also discuss what the results show, comparing variations of environments and training algorithms. As well as a piece by piece analysis of the training visualization pipeline.
Intrinsic Rewards for Reinforcement Learning
6.1
433
Experiment Setup
There are three repositories we combined to build upon for our intrinsic rewards and agent training environment. 1. Gym-MiniGrid [5] - contains the framework for the grid environment and obstacles/interactions the agent is able to perform built on top of the popular OpenAI Gym [4]. 2. Torch-AC [15] - contains the base algorithm implementations for Advantage Actor-Critic [8]. 3. RL-Starter-Files [16] - contains the framework for training the agent on each environment, storing model states, and evaluating agent performance. In the experiment, we first set up the Gym-MiniGrid environment as the base environment, which provides the playground for the agents to explore. Then Actor-Critic (A2C) algorithm was implemented using the Torch-AC. We also enhance the default A2C algorithm by implementing intrinsic reward when updating the agent’s parameters during each batch. A list of abstract art inspired environments were implemented in GymMiniGrid to train the agent. In each environment, the obstacles objects were customized generated based on either precise rules or constraint policy. In some of the complicated grid world, the agent needs to complete a list of tasks (picking up a colored key or going through a linked door) to reach the goal. The agent was trained with RL Starter repository with the A2C and A2C intrinsic for 3 million frames. The agent’s convergence rate is highly correlated with the complexity and safety of the environment [2]. And we found out not all agents could find the path to reach the goal at the end of the experiments. We utilize two types of results for quantitative and qualitative evaluation. For quantitative comparison, an average of return means and return maximum of the agent’s rewards were picked to show the general performance in each setup. And for qualitative questions, we first rendered an image using the time-lapsed agent’s movement on the grid and enhanced the raw image with visual effect before sending them for review. 6.2
Quantitative Results
The baseline agent without intrinsic rewards did better on our baseline environments DoorKey, FourRoom, and MultiRoom. These environments are less complex, with FourRoom and MultiRoom requiring navigation only with no hazards. These did not require exploration or curiosity on the part of the agent, eliminating any advantage the intrinsic agent had. While both algorithms were able to adapt to the environment, giving a reward for “learning” did not benefit the intrinsic agent and showed a 14.79% decrease in rewards received by the agent (see Table 2). However the intrinsic reward agents does achieve a higher maximum reward. PaintingS11N5 and PaintingV1S11N5 are complex, procedurally generated environments which neither algorithm was able to train well on. Since rewards
434
N. Grabaskas and Z. Wang
Table 2. All of these agents were trained for 3M frames on each environment. The means and max are shown for the entire training period. This method of evaluation reduced variance in agent performance based on only a small number of final episode. Environment
A2C - μrt A2C - max rt A2C Intrinsic - μrt A2C Intrinsic - max rt
DoorKey-8x8
0.0115
0.1112
0.0094
0.1068
FourRooms
0.3623
0.7733
0.2734
0.7923
MultiRoom-N2-S4
0.6738
0.8396
0.6631
0.8500
PaintingS11N5Env
0.0000
0.0000
8.08E−07
0.0190
PaintingV1S11N5Env 0.0075
0.0800
0.0065
0.0721
PaintingS9Env
0.0012
0.0786
0.2685
0.9474
PaintingS11Env
0.0012
0.0585
0.0018
0.0790
were only given for reaching the goal, the agent is not given a reward for making progress within the environment. These maze-like environments change each episode, and make it difficult for the agent to learn to navigate. In the future we could try first training the agent on a smaller version of the maze and then expand the size of the maze as the agent learns to navigate. This is a common problem and one we tried to overcome with intrinsic rewards. Agents learning to navigate in a sparse reward and complex environment. Further experiments are still needed here. PaintingS9 and PaintingS11 are less complex and singleton environments, but they still require some exploration to move around the lava and find the goal. In both of these environments the A2C model with intrinsic reward agent was able to explore the environment and learn to reach the goal in not only fewer frames but also more consistently than the baseline A2C. 6.3
Qualitative Results
For our qualitative analysis, each image combination was shown to three raters and we use the average across all three raters. A baseline image vote was given a 0.0 for each time it was selected as more interesting and an intrinsic image vote was given a 1.0 for each time it was selected as more interesting. A vote for “they are the same” was given a 0.5. The average was taken across all 300 raters and this gave a score of 0.503. Threshold based was only count votes where a confidence interval of 2/3 votes was needed to count. The means 97/100 images received two out of three votes with only three images receiving a neutral score. Normal image received 48 votes and intrinsic images received 49 votes. For examples of success and failure as viewed by the raters, see Fig. 5 for intrinsic success and Fig. 6 for intrinsic failure. The 100 samples sent off for rating came back almost completely even, 0.503 toward intrinsic artwork being more interesting. This shows there is a lot of room to improve both our rendering pipeline and the agent rewards to prioritize “good art” creation. At each step in the rendering process there is the potential for improvement. Time lapse of the environment is only one option, we could also
Intrinsic Rewards for Reinforcement Learning
435
Fig. 5. This figure shows two generated art images. In this image the left is intrinsic and right is baseline derived. The raters unanimously chose left has more interesting. Most likely because the agent explored more and thus created more movement within the artwork.
Fig. 6. This figure shows two generated art images. In this image the left is intrinsic and right is baseline derived. The raters unanimously chose right as more interesting. In this case the agent on the left created a darker image with less contrast.
look at a time lapse of the agent’s view, a time lapse of the layers of the CNN or Actor/Critic components. Instead of a time laps we could attach a theoretical brush to the agent and render brush strokes based on the agents movement. If the critic could evaluate how good the final artwork is this could lead to more interesting artwork. Additionally, instead of the oil painting effect, we could try watercolor effect, brush stroke effect, cartoon effect, or sketch effect. The color scheme could be changed to give more variety or less for a monochromatic final image. Each of
436
N. Grabaskas and Z. Wang
these steps could be evaluated using human raters to determine the best decision at each phase and combine all steps to create even more interesting final artwork. Asking a human rater which image they find more interesting is very subjective, and this was clearly shown in our survey results. Perhaps if there was a target painting, then we could ask which model version was able to more closely match the target. Or even more pointed questions, such as which image is more colorful, which one more closely matches a specific style, etc. Since human ratings on art are very subjective, it would also be very difficult to compare results from multiple rounds of evaluations, since each group would be different. We need to use the same group or a group with a similar to makeup to increase confidence between evaluation rounds.
7
Conclusion and Future Work
Our work set out to give reinforcement learning agents a reward beyond just the explicit reward of reaching a goal. And to visualize this training as a way to generate interesting abstract artwork. This intrinsic reward incentivized model state changes to promote curiosity and learning. We evaluated a method of determining the difference between two states using the L2 norm, adding this to the loss, and performing a second round of backward propagation. We evaluated this change on the Advantage Actor-Critic algorithm and Gym-MiniGrid for the environment. We looked at multiple baseline environments: one involved a simple task of using a key to open a door and the others involved navigating through a limited number of rooms to reach the goal. Additionally, we created 4 environments inspired by abstract art, two complex procedurally generated mazes with hazards, and two singleton environments with hazards. The A2C algorithm with intrinsic rewards was able to learn the singleton environments far more effectively, mastering the PaintingS9 where the normal A2C was unable to achieve even small rewards. Our experiments show the normal A2C algorithm was able to learn the baseline environments more effectively, achieving 14.79% higher average rewards over the intrinsic variant. However, the intrinsic variant did show higher maximum rewards. Suggesting the rewards do influence the agent’s desire to explore and at times lead to better maximum performance. The complex procedurally generated environments were too difficult for either of variants to learn. From our qualitative analysis in phase II, it is clear our prototype to create artwork from the agent’s actions has opportunities to improve at multiple points. Our rewards need to better incentivize art creation actions, our rendering pipeline to convert actions to artwork has many untested assumptions which need to be verified, and our evaluation needs more specific questions with a similar group of raters for each iteration. The intrinsic reward showed positive training results and there are areas for future work. Further training and analysis across more environments and for longer training periods. A stepped approach where an agent is trained on a simple environment and then tuned on a more complex environment, could
Intrinsic Rewards for Reinforcement Learning
437
help the agent learn faster than starting with random weights on a complex environment. Additional tuning of agent view distance and learning parameters may also help the agent explore effectively.
References 1. Al-Shedivat, M., Lee, L., Salakhutdinov, R., Xing, E.: On the complexity of exploration in goal-driven navigation. In: Relational Representation Learning Workshop (NIPS 2018), arXiv:1811.06889 (2018) 2. De Biase, A., Namgaudis, M.: Creating Safer Reward Functions for Reinforcement Learning Agents in the Gridworld. University of Gothenburg Chalmers University of Technology, Sweden (2018) 3. Bradski, G.: The OpenCV library. Dr. Dobb’s J. Softw. Tools (2000) 4. Brockman, G., et al.: OpenAI gym. In: ACL 2016, arXiv:1606.01540 (2018) 5. Chevalier-Boisvert, L.M., Willems, L., Pal, S.: Minimalistic gridworld environment for openAI gym (2018) 6. Degris, T., Pilarski, P.M., Sutton, R.S.: Model-free reinforcement learning with continuous action in practice. In: 2012 American Control Conference (ACC), pp. 2177–2182 (2012) 7. Luo, J.: Reinforcement learning for generative art. Ph.D. thesis, UC Santa Barbara (2020) 8. Mnih, V., et al.: Asynchronous methods for deep reinforcement learning. Computer Vision and Pattern Recognition (cs.CV), arXiv:1903.04411v3 (2019). Version 3 9. Mnih, V., et al.: NIPS Deep Learning Workshop 2013, arXiv:1312.5602v1 (2013) 10. Raileanu, R., Rockt¨ aschel, T.: RIDE: rewarding impact-driven exploration for procedurally-generated environments. Machine Learning (cs.LG), arXiv:2002.12292v2 (2020). Version 2 11. (Grzegorz) Surma, G.: Abstract art images (2019) 12. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998) 13. Tieleman, T., Hinton, G.: Lecture 6.5—RMSProp: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning (2012) 14. Wang, Z., et al.: Sample efficient actor-critic with experience replay. In: ICLR 2017, ArXiv:1611.01224v2 (2016). Version 2 15. Willems, L., Karra, K.: PyTorch actor-critic deep reinforcement learning algorithms: A2C and PPO (2018) 16. Willems, L., Yuan, Y., Bahdanau, D., Chevalier-Boisvert, M.: RL starter files (2018) 17. Xie, N., Zhao, T., Tian, F., Zhang, X., Sugiyama, M.: Stroke-based stylization learning and rendering with inverse reinforcement learning. In: IJCAI 2015: Proceedings of the 24th International Conference on Artificial Intelligence, July 2015. Version 2 18. Zhou, S., Huang, Z., Heng, W.: Learning to paint with model-based deep reinforcement learning. Computer Vision and Pattern Recognition (cs.CV), arXiv:1903.04411v3 (2019)
Analysis of Divided Society at the Standpoint of In-Group and Out-Group Using Opinion Dynamics Nozomi Okano and Akira Ishii(B) Tottori University, Tottori 680-8552, Japan [email protected]
Abstract. This is a study using Ishii’s opinion dynamics to simulate a divided society. It is assumed that there is trust and distrust between people in society and that it is influenced by the mass media. We distinguished the trust in the inner group of the two divided groups and the trust as an outgroup between the groups, and simulated the degree of social division. Keywords: Opinion dynamics
1
· Divided society · Conflict
Introduction
After the 2020 presidential election, modern American society is deeply divided. There are other examples of serious social division in world history. Moreover, even in modern times, various divisions can be seen in various countries. The worst consequence of the division of society is the American Civil War. In such a divided society, the people of the society are divided into multiple groups, and there is no relationship of trust between the groups. In addition, the strength of the relationship of trust as an in-group within the group should also be related to the division of society. Therefore, let us analyze this fragmented society using Ishii’s opinion dynamics [1–3], which deals with both trust and distrust among people in the theory of opinion dynamics. This model is named as Trust-Distrust Model [4]. In Sect. 2, the authors introduce the trust-distrust model. In Sect. 3, we show the model setting for social simulation of this paper. In Sect. 4, we show the results of the simulation. In Sect. 5, we review the results and discuss with them. The conclusion is presented in Sect. 6.
2 2.1
Trust-Distrust Model Theory of Trust-Distrust Model
Trust-Distrust Model is based on the original bounded confidence model [5–7]. For a fixed agent, say i, where 1 ≤ i ≤ N , we denote the agent’s opinion at time c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 438–452, 2022. https://doi.org/10.1007/978-3-030-82193-7_29
Divided Society
439
t by Ii (t). According to Hegselmann-Krause [5], opinion formation of agent i can be described as follows. N Dij Ij (t) (1) Ii (t + 1) = j=1
This can be written in the following form. ΔIi (t) =
N
Dij Ij (t)Δt
(2)
j=1
where it is assumed that Dij ≥ 0 for all i, j in the model of HegselmannKrause. The coefficient Dij is set to be a factor to determine the speed of consensus buildings in the society. Based on this definition, Dij = 0 means that the opinion of agent i is not affected by the opinion of agent j. In this theory, it is expected implicitly that the final goal of the negotiation among people is the formation of consensus. However, in the real society in the world, the formation of consensus among people is sometimes very difficult. The Trust-Distrust Model improves on this point by incorporating a sense of distrust among the people of a society [1– 3]. Thus, we employ the Trust-Distrust Model in this paper. The detail of this Trust-Distrust Model is described in the references [2,3]. The Trust-Distrust Model uses the following equation [2]. ΔIi (t) = ci A(t)Δt +
N
Dij Φ(Ii (t), Ij (t))(Ij (t) − Ii (t))Δt
(3)
j=1
where Ii (t) is the agent’s opinion at time t. In this model, Dij can be negative. The negative Dij means the agent “i” distrusts the agent “j”. The value range of Ii (t) is −∞ ≤ Ii (t) ≤ +∞. In the fist term in the right-hand side, the term A(t) is the mass media effect. This mass media term comes from the mass media term in the mathematical model for hit phenomena [8]. We assume here that Dij is an asymmetric matrix; Dij and Dji , Dij = Dji and Dij and Dji can have different signs. 2.2
Two-Agents Calculation
Solve the Trust-Distrust Model with two people. Let the agents be A and B. The equation is as follows ΔIA (t) = cA A(t)Δt + DAB Φ(IA (t), IB (t))(IB (t) − IA (t))Δt
(4)
ΔIB (t) = cB A(t)Δt + DBA Φ(IB (t), IA (t))(IA (t) − IB (t))Δt
(5)
As you can see from Fig. 1, if the coefficients of trust between A and B, DAB and DBA , are positive, then A and B trust each other, and their opinions get
440
N. Okano and A. Ishii
closer over time to reach a consensus. However, if the coefficients DAB and DBA are negative, A and B distrust each other, and their opinions repel each other and drift apart. After a certain degree of disagreement, they ignore each other, and so their opinions become parallel. When the coefficients of confidence between A and B, DAB and DBA , are positive values, the model is the same as the conventional bounded confidence model [5–7], where the coefficients DAB and DBA correspond to the speed of reaching consensus. On the other hand, when the coefficients of trust between A and B, DAB and DBA , are positive, it is not included in the bounded confidence model, which is a characteristic of the Trust-Distrust Model. Since the Bounded Confidence Model implicitly assumes trust among all people in a society, the coefficient of trust, Dij , is all positive values, which means that the coefficient of distrust is positive. Since the Bounded Confidence Model implicitly assumes trust among all people in a society, the coefficient of trust, Dij , is all positive, and it is not possible to handle dissent due to distrust.
Fig. 1. Calculation result for N = 2. (a) DAB > 0 and DBA > 0. (b) DAB < 0 and DBA < 0.
2.3
Calculation for 300 Persons
In the following, the number of people in society is set to 300. The network that connects people is a random network, and the probability of connecting nodes is 30%. The coefficient Dij of trust or distrust from person to person is determined by a random number from −1 to 1. As is known from Ishii-Kawahata’s research [9], in a complete network, if 55% or more of the confidence coefficient Dij among people is a positive value, society will reach consensus. The threshold of 55% does not change even with random network [10]. The Fig. 2 shows a typical example. In Fig. 2(a), the confidence coefficient Dij is all positive, so society has reached consensus. This is the same result as the calculation of the Bounded Confidence Model so far. On the other hand, in Fig. 2(b), half of the confidence coefficient Dij is a positive value, and the other half is a negative value meaning distrust. In this case, as seen in the calculation, society does not form consensus and the distribution of opinions expands.
Divided Society
441
Fig. 2. Calculation results of opinion dynamics calculation for (a) 100% positive trust and (b) 50% positive trust. The left figures are trajectories of opinion of people in society and the right figures are the opinion distribution at time = 10. The number of people in society is set to be 300.
What can be seen in the calculation of (b) is a typical example of opinion distribution in general society. Figure 3 shows us the typical example of the opinion distributions near the threshold value 55% for the positive rate of the confidence coefficient Dij for 300 persons on random network with connecting probability 30%. It can be seen that if it is more than 55%, the society will reach consensus, and if it is 55% or less, the opinion distribution tends to expand without consensus building. From this paper onward, we will calculate that society is divided into two groups, but if the confidence coefficient Dij in each of the two groups that form society is 55% or more, consensus building will be formed for each group. In this case, consensus is formed on different opinions for each group. If the percentage of positive confidence factors Dij in a group is 55% or less, the group does not reach consensus. Whether or not the group forms consensus is determined by the ratio of the value of the confidence coefficient Dij being a positive value as described above. It works effectively for the analysis even when the following societies are divided into two groups.
442
N. Okano and A. Ishii
Fig. 3. Calculation results of the opinion distributions at time = 10 for (a) 57% positive trust, (b) 56% positive trust, (c) 55% positive trust and (d) 54% positive trust.
3
Model Setting for Social Simulation
In the calculations in this paper, we assume here that the number of persons is 300. For the mass media effect, we set A(t) = 0, no mass media condition in this paper. The human network is assumed to be random network with connection probability of 30%. The mutual Dij value between 300 people is decided by homogeneous random number. In general, people decide which group to belong to based upon their social identity or political identification [11,12]. In deciding which group to support, they identify the group they belong to as their “in-group” and distinguish other groups as “out-groups” [13]. They strengthen a sense of unity with the group members of the group they identify with and start to support the in-group issues [12]. In this paper, we perform two types of calculations as an extension of the previous works [3,14]. The first type is that, as shown in the Fig. 4, society is divided into two groups, A and B. Within each group is the in-group, and the relationship between the groups is the out-group. In the calculation, the number of people in society as a whole is 300, and 150 people belong to group A and 150 people belong to group B. The ratio of positive confidence coefficient Dij is set to be TA for the group A and TB for the group B. The ratio of positive
Divided Society
443
Fig. 4. The first type model of our simulation. The society is divide into two groups A and B. The in-group trust of A is TA and the in-group trust of B is TB . The out-group trust between A and B is TAB .
confidence coefficient Dij between the groups A and B is TA B. In this paper, TA B is set to be zero so that every confidence coefficient Dij between the people of the group A and the people of the group B is between −1 to 0. The second type of our simulation model is shown in Fig. 5. In this case as well, society is divided into two groups, A and B, but some people in group B trust the people in group A. They trust each other equally in Group B, but some of them also trust the people in Group A. This second type of the simulation model is pointed out first in our previous work [3]. In the calculation, the number of people in society as a whole is 300, and 150 people belong to group A and 150 people belong to group B. And we set that 50 of the people in Group B also
Fig. 5. The second type model of our simulation. The society is divide into two groups A and B. The in-group trust of A is TA and the in-group trust of B is TB1 . In the group B, there is a Group B2. The out-group trust between group A and group B2 is TB2 .
444
N. Okano and A. Ishii
trust the people in Group A. This model is same as the last model of [3], but we do a lot of simulations with different settings. This model is, for example, a Republican in American society after the 2020 presidential election, looking to compromise with the Democratic Party rather than following President Trump. Also, in Japanese society, there are examples of opposition parties that are cooperative with the ruling party.
4 4.1
Results Calculation for the First Model
First, the figure shows a case where two groups A and B form a consensus as an in-group. Assuming that the positive value of the confidence coefficient as an in-group of A and B is 80%, consensus is formed as shown by past studies [9,10] for complete network and random network. For the case of Fig. 6, In the figure, groups A and B converge on different opinions and form consensus. In this case, the confidence factors between people as an out-group of A and B are all negative. In other words, if this is a society, the society is completely divided into two. An example of this is American society at the time of the American Civil War.
Fig. 6. Calculation results of opinion dynamics calculation for TA = 0.8, TB = 0.8 and TAB = 0. The left figures are trajectories of opinion of people in society and the right figures are the opinion distribution at time = 10. The number of people in society is set to be 300.
Figure 7 shows the calculation when the ratios where the confidence coefficient Dij as an in-group of groups A and B is a positive value is 60%, 55%, 50%, and 40%. The confidence factor Dij between A and B as an out-group is a negative value, −1 to 0. In Fig. 8, the ratio of the confidence coefficient Dij in A and B as an in-group is a positive value is 55%, and the ratio of the confidence coefficient Dij as an out-group between A and B is 30%, 50%, 60%, 80%. If the ratio of positive values of the confidence coefficient Dij among the people who make up group A and
Divided Society
445
Fig. 7. Calculation results of opinion dynamics calculation for TAB = 0. The in-group trust is set to be (a) TA = TB = 0.6, (b) TA = TB = 0.55, (c) TA = TB = 0.5 and (d) TA = TB = 0.4. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.
group B as an out-group is finite, the trajectories of the opinions of the people A and B are mixed due to the mutual trust relationship. Especially in the case of TAB = 0.8, the strength of trust as an in-group of Group A and Group B is not enough for consensus building. However, due to the strength of trust between A and B as an out-group, society leads to consensus building. In Fig. 9, the ratio of the confidence coefficient Dij in A and B as an in-group is a positive value is 70%, and the ratio of the confidence coefficient Dij as an out-group between A and B is 0%, 30%, 50% and 70%. If the ratio of positive values of the confidence coefficient Dij among the people who make up group A and group B as an out-group is finite, the trajectories of the opinions of the people A and B are mixed due to the mutual trust relationship. If A and B have a strong relationship of trust as an in-group and A and B have a weak relationship of trust as an out-group, Group A and Group B will form an agreement for each group. However, at TAB = 0.5, the opinions of A and B people become mixed, and at TAB = 0.5, A and B agree on the same opinion. In the case of (d), regardless of whether A or B is used, the ratio of the trust
446
N. Okano and A. Ishii
Fig. 8. Calculation results of opinion dynamics calculation for the in-group trust TA = TB = 0.55. The out-group trust is set to be (a) TAB = 0.3, (b) TAB = 0.5, (c) TAB = 0.6, and (d) TAB = 0.8. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.
coefficient Dij being a positive value is 70% as a whole society, so consensus building is formed in the whole society. 4.2
Calculation for the Second Model
The second computational model is when society as a whole is divided into the group A and the group B, but some of the group B trust the people of the group A. This would be the case, for example, at the end of the Trump administration in the United States, when some Republicans were compromised with Democrats. Also consider a referendum asking whether Britain should be separated from the EU. If some people in Britain want a coalition with the EU, think of A as a continental European countries and B as Britain. Our model calculation is similar to the case where some of B’s people have a strong trust in A. In Fig. 10, the ratio in which the confidence coefficient Dij as an in-group of group A and group B is a positive value is 70%, and the trust relationship Dij as an out-group between A and B is all a negative value. However, the percentages
Divided Society
447
Fig. 9. Calculation results of opinion dynamics calculation for the in-group trust TA = TB = 0.7. The out-group trust is set to be (a) TAB = 0.0, (b) TAB = 0.3, (c) TAB = 0.5, and (d) TAB = 0.7. The figures are trajectories of opinion of people in society. The number of people in society is set to be 300.
of people in group B2 of B whose confidence factor Dij in group A is positive are 70%, 60%, 50%, and 40%. Looking at the calculation results, when TAB2 = 0.7, the people in B2 who trust A are close to the opinions of people in A, but when TAB2 is 0.6 or less, a small number of people in B2 are in the opinion of A. Most people in B2 agree as an in-group B just by adjusting to. This is because the values of the out-group confidence factor Dij between groups A and B are all negative. In Fig. 11, the ratio in which the confidence coefficient Dij as an in-group of group A and group B is a positive value is 50%, and the trust relationship Dij as an out-group between A and B is all a negative value. However, the percentages of people in group B2 of B whose confidence factor Dij to group A is positive are 30%, 50%, 70%, and 90%. As can be seen in the figure, in this case, A and B are in-groups, and the ratio of the value of the confidence coefficient Dij being a positive value is 50%, and consensus is not formed as an in-group. Therefore, society as a whole does not reach consensus. However, because of the strong trust that B2 people have
448
N. Okano and A. Ishii
in A, when TAB2 = 0.9, B2 people form consensus with opinions close to those of A people.
Fig. 10. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.7. The in-group trust for B is TB = 0.7. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.7, (b) TB2 = 0.6, (c) TB2 = 0.5 and (d) TB2 = 0.4. The number of people in society is set to be 300.
In Fig. 12, the percentage of positive confidence coefficient Dij as an in-group is 70% for A and 30% for B. Therefore, the group A forms a consensus, but the group B does not. However, if the B2 people have a strong trust in A, the B2 people will converge on the same opinion as the group A. In Fig. 13, neither the group A nor the group B is strong enough to reach consensus as an in-group. Therefore, neither A nor B will form a consensus, nor will society as a whole form a consensus. However, when the B2 people have a strong trust in the A people, the B2 people tend to reach consensus with opinions close to those of the A people. However, this is the case for some of the B2 people, and with a strong trust of TAB2 = 0.9.
5
Discussion
In this paper, we have calculated various aspects of a divided society. As shown in Fig. 7 and 13, even if society is divided into two, if people’s trust is not
Divided Society
449
Fig. 11. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.5. The in-group trust for B is TB = 0.5. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.3, (b) TB2 = 0.5, (c) TB2 = 0.7 and (d) TB2 = 0.9. The number of people in society is set to be 300.
strong enough to reach consensus as an in-group of each, the group will not form consensus and the society as a whole will agree. Does not form. In the case of politics, if such a society as a whole does not reach consensus, it can be said that it is stable because the division of society does not become a serious conflict. This is a result supporting the conclusion suggested by Ishii and Okano [14]. The most dangerous aspect of social division is in the case of the Fig. 6, where society divides and each divided group reaches consensus. In the case of the American Civil War, this worst conflict occurred. In Fig. 7, consensus building has not occurred in society, but there has been no consensus building that divides into two and forms consensus separately. However, because there is no relationship of trust between A and B as an outgroup, the opinions of the people of A and the opinions of the people of B are separated. This would be close to conflict between conservative and liberal in many countries.
450
N. Okano and A. Ishii
Fig. 12. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.7. The in-group trust for B is TB = 0.3. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.1, (b) TB2 = 0.3, (c) TB2 = 0.5 and (d) TB2 = 0.7. The number of people in society is set to be 300.
It was also found that if there is strong trust between groups as an out-group, consensus can be reached by society as a whole beyond the factions, as shown in Fig. 8(d). In addition, as the second model, one of the divided societies was calculated in a setting that was not united. This situation is common in reality. In this second model, people in B2 have different opinions depending on their trust in people in A, and B may not function as an in-group in the calculation. A typical example of B not functioning as a faction is shown in Fig. 12. In Fig. 12, due to the low level of trust as an in-group, B lacks cohesion, and the people of B2 move to consensus with the people of A who have strong trust. This would be a calculation that corresponds to the coalition of political parties. In this way, Ishii’s opinion dynamics include distrust among people, so it can be seen that social simulations that respond to crisis situations such as social division are possible.
Divided Society
451
Fig. 13. Calculation results of opinion dynamics calculation for the case of B2 group. The in-group trust for A is TA = 0.3. The in-group trust for B is TB = 0.3. The outgroup trust between A and B is TAB = 0.0. The calculations are for various trust value from the group B2 to the group A; (a) TB2 = 0.3, (b) TB2 = 0.5, (c) TB2 = 0.7 and (d) TB2 = 0.9. The number of people in society is set to be 300.
6
Conclusion
In this paper, the Trust-Distrust Model is used to calculate the division of society. We also calculated that in a divided society, one of the two groups was not tightly bound. Trust-Distrust Model is a newly submitted theory, but it has been found that it is possible to simulate society in various aspects with considerable flexibility. Acknowledgments. This work was supported by JSPS KAKENHI Grant Number JP19K04881. The authors are grateful for frequent discussion with Prof. M Nishikawa of Tsuda University especially for the discussion of in-group and out-group problem.
References 1. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the internet. In: Proceedings of The 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems (IES2018), pp. 71–76. arXiv:1812.11845 [physics.soc-ph] (2018)
452
N. Okano and A. Ishii
2. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais D., Carreras A., de Almeida A., Vetschera R. (eds.) Group Decision and Negotiation: Behavior, Models, and Support, GDN 2019. Lecture Notes in Business Information Processing, vol. 351. Springer, Cham (2019) 3. Ishii, A., Okano, N., Nishikawa, M.: Social simulation of divided society using new opinion dynamics. Front. Phys. 9 (2021). https://doi.org/10.3389/fphy.2021. 640925 4. Okano, N., Ishii, A.: Opinion dynamics on a dual network of neighbor relations and society as a whole using the Trust-Distrust model. Submitted to the Proceedings of the 23rd International Conference on Artificial Intelligence (ICAI 2021) 5. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. 5 (2002) 6. Deffuant, G., Neau, D., Amblard, F., Weisbuch, G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3, 87-98 (2000) 7. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexity 7(3), 55–63 (2002) 8. Ishii, A., et al.: The ‘hit’ phenomenon: a mathematical model of human dynamics interactions as s stochastic process. New J. Phys. 14, 063018 (2012). (22pp.) 9. Ishii, A., Kawahata, Y.: Opinion dynamics theory considering interpersonal relationship of trust and distrust and media effects. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence, p. 33 JSAI2019 2F3-OS-5a-05 (2019) 10. Ishii, A., Yomura, I., Okano, N.: Opinion dynamics including both trust and distrust in human relation for various network structure. In: 2020 International Conference on Technologies and Applications of Artificial Intelligence (TAAI), pp. 131–135. (2020). https://doi.org/10.1109/TAAI51410.2020.00032 11. Berelson, B.R., Lazarsfeld, P.F., McPhee, W.N.: Voting: A Study of Opinion Formation in a Presidential Campaign. The University of Chicago Press, Chicago (1954) 12. Achen, C.H., Bartels, L.M.: Democracy for Realists: Why Elections Do Not Produce Responsive Government. Princeton University Press, Princton (2016) 13. Tajfel, H., Turner, J.C.: An integrative theory of inter-group conflict. In: Austin, W.G., Worchel, S. (eds.) The Social Psychology of Inter-group Relations, pp. 33– 47, Monterey, CA, Brooks/Cole (1979) 14. Ishii, A., Okano, N.: Social simulation of a divided society using opinion dynamics. In: Proceedings of the 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, pp. 660–667 (2020)
Simulation of Intragroup Alignment Using a New Model of Opinion Dynamics Nozomi Okano1 , Hitoshi Yamamoto2 , Masaru Nishikawa3 , and Akira Ishii1(B) 1
Tottori University, Tottori 680-8552, Japan [email protected] 2 Rissho University, Shinagawa, Tokyo 141-8602, Japan 3 Tsuda University, Kodaira, Tokyo 187-8577, Japan
Abstract. We performed a social simulation using the opinion dynamics theory that incorporates trust and distrust. In this paper, we simulated the invisible primary in the US presidential election as a social simulation. We found that the candidate with strong trust from the voters is advantageous. Besides, when simulating a model that is closer to the actual invisible primary with sub-leaders, it was found that candidates who have strong support from many sub-leaders are advantageous. All of the results we obtained is consistent with the findings of previous studies on invisible primaries. The most of the actual election of the US presidential election results are consistent with this study’s findings. Keywords: Opinion dynamics alignment · Voter · Sub-leader
1
· Invisible primary · Intragroup
Introduction
In the United States, the presidential campaign officially begins in February of the year the presidential election is held. However, many months before the first primary election, the “invisible primary” begins. There is no agreement among experts as to when the “invisible primary” begins. However, many consider the period between Labor Day and January of the presidential election year to be the “invisible primary.” Election candidates try to gain an advantage in the “invisible primary” to get ahead of the other candidates somehow. At this stage, candidates send out many press releases, try to raise money, and promote themselves [1–6]. During the invisible primary, voters will have to choose whom to support from the same party candidates. Therefore, the role of the media in influencing public opinion is significant. The news media will also have to choose which candidates to cover increasingly and which not to cover. Therefore, money and media matter most during the invisible primary [7–9]. Additionally, a candidate who successfully secures party leaders’ endorsements as many as possible is likely to win the caucuses. The candidate sends a signal to donors, party activists, party organizers, and rank-and-file voters that he/she is viable [10–12]. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 453–463, 2022. https://doi.org/10.1007/978-3-030-82193-7_30
454
N. Okano et al.
The FiveThirtyEight collected data on how many endorsements had the candidates secured during the invisible primary [10]. The FiveThirtyEight adopts a simple weighting system: 10 points for governors, 5 points for US senators, and 1 point for US representatives. They conducted a historical comparison of cumulative endorsement points at invisible primaries from 1980 until 2016. The candidate who won a large majority of support from party leaders at the invisible primaries became the presumptive nominee within both of the parties. However, the 2016 election was an anomaly: Donald Trump was declared the presumptive Republican nominee, even most Republican party leaders rallied around other party leaders at the invisible primary [9,12]. Stiles et al. utilize a threshold model of social interaction to simulate an invisible primary outcome. They conducted a network analysis simulation using three candidates while changing the size of the primary electorate. They indicated that frontrunners were likely to gain advantages over other candidates at the end of the invisible primary under certain conditions. They argue that three requirements must be met for a particular candidate to gain an advantage before the Iowa Caucus: “a sizeable lead in the polls at the onset of the race, an environment in which there is little information decay, and an unwavering base of support.” If even any of these requirements is missing, it becomes unclear whether a particular candidate will have an advantage. Therefore, the candidate’s campaign must use tools to inform the voters about themselves and keep their interest in the candidate [12]. However, Stiles et al. tended to focus on the role of media. It seems to be lacking from their analysis how much support is needed to survive a primary election. Also, are there any factors for candidates to win an invisible primary other than the role of media? Therefore, we adopt a newly developed opinion dynamic model to the invisible primary. The conventional approach of opinion dynamics has taken an approach that continuously responds to changes in opinion, rather than a binary opinions approach that either agrees or disagrees. The conventional theory of opinion dynamics that continually processes opinion transitions is “the bounded confidence model.” These theories aimed at consensus building [13–18]. In using the bounded confidence model theories, the simulation only calculates assuming trust relationships between individuals. In short, the primary purpose of these opinion dynamics would be consensus building in society. If anything, people’s opinions in societies do not always reach consensus. Opinions can be divided and disagreed. This is a problem that cannot be handled by the conventional bounded confidence model. Recently, Ishii et al. proposed a new theory of opinion dynamics that deals with human relationships of both trust and distrust [19,20] just as a simple extension of the bounded confidence model. Using this theory, Ishii and Kawahata calculated a charismatic person’s effect in the simulation [21]. Ishii and Okano [22,23] calculated the people who are untrusted by all in the society.
Simulation of Intragroup Alignment
455
In Sect. 2, the authors present the theory of opinion dynamics. Section 3 deals with applying the opinion dynamics to actual cases, i.e., the United States’ invisible primaries. The results of the simulations are shown in Sect. 4. In Sect. 5, we review the results and match those with the findings of the prior studies: we confirmed that candidates with solid trust from voters are advantageous. Also, it was found that candidates who have strong support from many sub-leaders are advantageous. In conclusion, we discuss Trump’s victory in the 2016 Republican primary was an exception; however, most of the actual election results are consistent with this study’s findings. This paper uses Ishii’s new opinion dynamics to simulate the United States’ invisible primary.
2
Theory
According to Ishii [19,20], we use the following equation: the equation of opinion dynamics of Ishii and coworkers. ΔIi (t) = −αIi (t)Δt + ci A(t)Δt +
N
Dij Φ(Ii (t), Ij (t))(Ij (t) − Ii (t))Δt
(1)
j=1
Dij is the coefficient of trust for agent i to agent j. The positive Dij means the agent i trust the agent j. The negative Dij means the agent i distrust the agent j. They assume here that Dij is an asymmetric matrix; Dij and Dji , Dij = Dji and Dij and Dji can have different signs. This function Φ(Ii (t), Ij (t)) is the Sigmoid function and it works as a smooth cut-off function at |Ii − Ij | = b. Using this Sigmoid function, we assume that if the opinions of the two are too far apart, they will not be influenced by each other’s opinions. 1 (2) Φ(Ii , Ij ) = 1 + exp(β(|Ii − Ij | − b)) Moreover, because of the factor Ij (t) − Ii (t), the opinion Ii (t) is not affected by the opinion Ij (t) if the opinion Ij (t) is almost same as the opinion Ii (t). This opinion dynamics theory’s characteristic is that it inherently incorporates trust and distrust of people. In this opinion dynamics theory, trust does not turn into distrust when opinions go away. No matter how close the opinions are, the two distrustful people are set initially to repel. This is a convenient theory for simulating conflicts between people who are determined to be distrustful, such as race, religion, or historical background.
3
Simulation Model of Intragroup Alignment
In this paper, we propose to apply opinion dynamics to the invisible primaries in the United States. As the first step, consider two models, as shown in Fig. 1. Figure 1(a) schematically illustrates the case where a charismatic candidate (e.g.,
456
N. Okano et al.
Donald Trump) is trusted by voters (constituents) by the value p. Figure 1(b) is a model representing the case where there are Nsub sub-leaders (e.g., senators, representatives, and governors) who support the candidate. In this model, the sub-leaders are considered to have a deeper trust in the candidate than the voters. The value of their trust is q. Here, it is assumed that the number N of people is not the whole population but the supporters of the political party to which the candidate belongs. Subleaders are people with influence who support a leader in an invisible primary. For example, how sub-leaders act in an invisible primary is vital for the surging growth of a candidate who is going to be selected among many candidates in a primary election.
Fig. 1. Schematic illustration of two models for calculation. (a) is the illustration of the amount of trust from every person in society to the candidate. (b) illustrates the amount of trust from sub-leaders to the candidate and the amount of trust from every person in society to the candidate.
In the concrete calculation, the number N of the target people is 300. The coefficient of confidence Dij among people is determined by a random number in the range −1 to 1. The candidate’s initial opinion is +15, the strength of the candidate is m = 10, and the strength of the voters’ will is m = 1. The connection between people is a random network, and the probability of being connected is 50%. A scale-free network is more realistic as a network of people’s connections, but in that case, whether or not the sub-leader is located at the hub affects the calculation result. Therefore, in the current situation where there is no measured value of the network structure of people’s connection in the actual example of the target, the random network structure is used.
Simulation of Intragroup Alignment
4 4.1
457
Results Trust to a Candidate from Voters
First, the calculation results corresponding to the model of Fig. 1(a) are shown. Let p be the degree of voters’ trust for a candidate in an invisible primary. Set the value of p in the range of −5 to 20 and show the calculation result. Figure 2 shows voters’ trust for a candidate at −5 and +10. In −5, the voters have a backlash against the candidate, and there is no one with the same opinion near the candidate’s opinion. In other words, the candidate is avoided by voters because of their adverse credibility. Conversely, if the confidence level is 10, voters have similar opinions with the candidate. If the reliability is positive, the approval rating is high.
Fig. 2. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and opposing opinions as a function of time. In (a), the amount of trust from every voter in society to the candidate is p = −5. In (b), the amount of trust from every voter in society to the candidate is p = 10.
Figure 3 shows the confidence level of p from voters and the percentage of positive opinions. Since a candidate’s opinion is 15, the percentage of positive opinions is the candidate’s support percentage. The numerical value of the approval rating shown here is the average value obtained by performing the simulation three times. Since a random number specifies the reliability Dij among voters, there are not a few fluctuations due to the random number’s numerical value for each calculation. However, it can be seen from the figure that the approval rating is low when the reliability of people is negative, and the approval rating is high when the reliability is positive.
458
N. Okano et al.
Fig. 3. The calculated value of the ratio of positive opinion as a function of time. Each value is the average of three calculations.
4.2
Sub-leaders
Next, the calculation when there are sub-leaders of Fig. 1(b) is shown. In this case, in addition to the trust p from voters, the number of sub-leaders Nsub and the value of the trust q that the sub-leaders have in a candidate are important. To compare various Nsub , the following conditions are imposed here, and the calculation is performed under the condition that this value is 3000. The calculation is the average of three simulation calculations. 300p + qNsub = 3000
(3)
Figure 4 shows the case where Nsub = 150 and the reliability from voters is −5, and the case where Nsub = 10 and the reliability from voters are +5. Reliability between the sub-leaders are set to 10. As it is shown in Fig. 4, the overall reliability is fixed to 3000. It seems that the larger the number of sub-leaders is, the higher the approval rating goes. The calculation confirms this in Fig. 5. It also calculates the approval rating by averaging three simulations. All the calculations in Fig. 5 are done under the condition that Eq. (3) is satisfied. Looking at Fig. 5, although the value fluctuates due to random numbers, the larger the number of sub-leaders is, the more advantageous in obtaining the higher approval rating.
5
Discussion
We applied the opinion dynamics model to the invisible primary of the United States. As a result of the simulation, the following two points were found. (1) Candidates who have the confidence of voters are advantageous. (2) When simulating a closer to the actual invisible primary model, including the sub-leaders,
Simulation of Intragroup Alignment
459
Fig. 4. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and negative opinions as a function of time. In (a), the amount of trust from every voter in society to the candidate is set to be p = −5, and the amount of trust between sub-leaders is 10. The number of sub-leaders is 150. In (a), the amount of trust from every voter in society to the candidate is set to be p = +5, and the amount of trust between sub-leaders is 10. The number of sub-leaders is 10.
Fig. 5. The calculated value of the ratio of positive opinion as a function of time. The five curves correspond to the number of sub-leaders of 150, 100, 50, 10, and 5. Each value is the average of three calculations.
460
N. Okano et al.
candidates who have strong support from many sub-leaders are advantageous. Stiles et al. focused on the role of media. However, the role of sub-leaders is also crucial for a candidate to survive an invisible primary. Except for that point, our findings are consistent and not significantly different from with previous studies’ findings on invisible primaries [10–12]. The actual election results are also almost consistent with the findings of this study. For some reason, only Trump in the 2016 Republican primary does not match this at all. We wonder why? It would be an interesting point to study shortly. As the next step, a simulation that considers the trust that voters have in the sub-leaders who work for a candidate can be considered. Figure 6 and Fig. 7 show an example of calculation using the model. Also, in this calculation, the total value of the reliability is set to 3000. In actual invisible primaries, sub-leaders are political insiders such as federal senators, representatives, and governors. Therefore, it would be interesting to have a simulation in which the sub-leaders are set in more detail according to the actual elections and primaries. 300p + qNsub + 300ps Nsub = 3000
(4)
Fig. 6. Schematic illustration of models for calculation, including the amount of trust from all voters to sub-leaders.
The social simulation using this paper’s opinion dynamics is a research method that can simulate a leaders’ appearance in society and the division by multiple leaders, not limited to the primary election.
Simulation of Intragroup Alignment
461
Fig. 7. Calculations for positive and negative trust to a candidate from voters. The left graphs are the opinion trajectories of people in society. The blue trajectory is that of the candidate. The middle graphs are the opinion distribution at time = 10. The right graphs are the number of positive opinions and negative opinions as a function of time. The amount of trust from every voter in society to the candidate is set to be p = −2, and the amount of trust between sub-leaders is +1. The amount of trust from the sub-leaders to the candidate is set to be 30. The number of sub-leaders is 100. The amount of trust from every voter in society to sub-leaders is set to be p = +0.02, and the amount of trust between sub-leaders is 1.
6
Conclusion
In this study, as stated in the Introduction, we performed a social simulation using the opinion dynamics theory that incorporates trust and distrust. In this paper, we simulated the invisible primary before the US presidential election as a social simulation. When we applied the opinion dynamics model to the invisible primary and simulated it, we obtained the calculation result that the candidate with strong trust from the voters is advantageous. Besides, when simulating a model that is closer to the actual invisible primary with sub-leaders, it was found that candidates who have strong support from many sub-leaders are advantageous – the role of sub-leaders failed to notice in the preceding studies. Basically, our findings are consistent with the findings of previous studies on invisible primaries. Though this is entirely inconsistent with Trump’s case in the 2016 Republican primary, most of the actual election results are consistent with this study’s findings. Therefore, it can be said that this study is a theory that can simulate the actual invisible primary election. We are aware that our research may have a limitation and future improvements. The evaluation is done solely on the proposed methods. Hence, we should work on multiple approaches to validate the results and show our model’s advantages.
References 1. Flowers, J., Haynes, A., Crespin, M.: The media, the campaign, and the message. Am. J. Polit. Sci. 47(2), 259–273 (2003) 2. Haynes, A.A., Flowers, J.F., Gurian, P.: Getting the message out: candidate communication strategy during the invisible primary. Polit. Res. Q. 55, 633–652 (2002) 3. Buell, E.H., Jr.: The invisible primary. In: Mayer, W.G. (ed.) Pursuit of the White House. Chatham House Publishers, New Jersey (1996)
462
N. Okano et al.
4. Hadley, A.T.: The Invisible Primary. Prentice Hall, Englewood Cliffs (1976) 5. Darr, J.P.: Earning Iowa: local newspapers and the invisible primary. Soc. Sci. Q. 100, 320–327 (2019) 6. Kenski, K., Filer, C.R., Conway-Silva, B.A.: Lying, liars, and lies: incivility in 2016 presidential candidate and campaign tweets during the invisible primary. Am. Behav. Sci. 62, 286–299 (2018) 7. Steger, W.P.: Who wins nominations and why? An updated forecast of the Presidential Primary Vote. Polit. Res. Q. 60, 91–99 (2007) 8. Han, L.C.: Off to the (horse) races: media coverage of the ‘not-so-invisible’ invisible primary of 2007. In: Bose, M. (ed.) From Votes to Victory: Winning and Governing the White House in the Twenty-First Century, pp. 91–116. Texas A & M University Press (2011) 9. Reuning, K., Dietrich, N.: Media coverage, public interest, and support in the 2016 republican invisible primary. Perspect. Polit. 17, 326–339 (2019) 10. Bycoffe, A.: “The Endorsement Primary,” The FiveThirtyEight (2020). https:// projects.fivethirtyeight.com/2016-endorsement-primary 11. Cohen, M., Karol, D., Noel, H., Zaller, J.: The Party Decides: Presidential Nominations Before and After Reform. University of Chicago Press, Chicago (2009) 12. Stiles, E.A., Swearingen, C.D., Seiter, L., Foreman, B.: Catch me if you can: using a threshold model to simulate support for presidential candidates in the invisible primary. J. Artif. Soc. Soc. Simul. 23 (2020) 13. Deffuant, G., Neau, D., Amblard, F., Weisbuch, G.: Mixing beliefs among interacting agents. Adv. Complex Syst. 3, 87–98 (2000) 14. Weisbuch, G., Deffuant, G., Amblard, F., Nadal, J.-P.: Meet, discuss and segregate! Complexity 7(3) 55–63 (2002) 15. Hegselmann, R., Krause, U.: Opinion dynamics and bounded confidence models, analysis, and simulation. J. Artif. Soc. Soc. Simul. 5 (2002) 16. Duggins, P.: A psychologically-motivated model of opinion change with applications to American politics. J. Artif. Soc. Soc. Simul. 20(1), 13 (2017) 17. Castellano, C., Fortunato, S., Loreto, V.: Statistical physics of social dynamics. Rev. Modern Phys. 81, 591/646 (2009) 18. Sˆırbu, A., Loreto, V., Servedio, V.D.P., Tria, F.: Opinion dynamics: models, extensions and external effects. In: Loreto, V., et al. (eds.) Participatory Sensing, Opinions and Collective Awareness. Understanding Complex Systems, 42 pages. Springer, Cham (2017) 19. Ishii, A., Kawahata, Y.: Opinion dynamics theory for analysis of consensus formation and division of opinion on the internet. In: Proceedings of The 22nd Asia Pacific Symposium on Intelligent and Evolutionary Systems, pp. 71–76 (2018). arXiv:1812.11845 [physics.soc-ph] 20. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais, D., Carreras, A., de Almeida, A., Vetschera, R. (eds.) Group Decision and Negotiation: Behavior, Models, and Support, GDN 2019. Lecture Notes in Business Information Processing, vol. 351, pp. 193–204. Springer, Cham (2019) 21. Ishii, A., Kawahata, Y.: Opinion dynamics theory considering interpersonal relationship of trust and distrust and media effects. In: The 33rd Annual Conference of the Japanese Society for Artificial Intelligence, 33 JSAI2019 2F3-OS-5a-05 (2019)
Simulation of Intragroup Alignment
463
22. Okano, N., Ishii, A.: Sociophysics approach of simulation of charismatic person and distrusted people in society using opinion dynamics. In: Sato, H., Iwanaga, S., Ishii, A. (eds.) Proceedings of the 23rd Asia-Pacific Symposium on Intelligent and Evolutionary Systems, pp. 238–252. Springer (2019) 23. Okano, N., Ishii, A.: Isolated, untrusted people in society and charismatic person using opinion dynamics. In: Proceedings of ABCSS2019 in Web Intelligence, pp. 1–6 (2019)
Random Forest Classification with MapReduce in Holonic Multiagent Systems Mich´ele Cullinan and Duncan Coulter(B) University of Johannesburg, Johannesburg, South Africa [email protected], [email protected]
Abstract. Multiagent systems are a dominant field of study within artificial intelligence. Holons are a special kind of agent implementation found in multiagent systems which has not received much attention in recent mainstream AI topics, such as machine learning. Their self-similar structure is both stable and coherent, and likewise consists of one or more holons. This paper studies how holonism, introduced through recursive modelling techniques, benefit multiagent systems. It also expands the scope of multiagent learning applications by proposing a new architecture for executing Decision Tree and Random Forest machine learning models that are both novel in terms of parallelism and extensibility in order to solve classification problems. The models were designed using a structured based approach of computer programs, with a special focus on recursive structures. Results obtained show that when the algorithms are applied in a classification problem domain, the algorithms are able to perform consistent with their expected behaviour. Keywords: Holonic multiagent systems Trees · Random Forests
1
· MapReduce · Decision
Introduction
The human brain has the ability to use different ways and multiple regions to learn information. This natural distributed learning mechanism allows information to be more interconnected and embedded in memory. Artificial intelligence (AI) problem solving modelling techniques such as machine learning (ML) are often inspired by tasks performed by the brain. Another fundamental method, specifically for those problems with a distributed nature, is multiagent systems (MAS). Together they produce AI theory concentrating on group intelligent behaviours emerging from the cooperation of multiple interacting intelligent entities forming multiagent systems. Knowledge about the problem is divided and shared to develop a solution as a result of coordination, cooperation and sometimes competition. The theory of multiagent learning (MAL) is the application of MAS together with ML algorithms involving multiple agents as described in [14]. Cooperative c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 464–483, 2022. https://doi.org/10.1007/978-3-030-82193-7_31
Holonic MapReduce Classification
465
MAL attempts to draw from multiagent theory in a spectrum of areas, including reinforcement learning, evolutionary computation, complex systems, agent modelling and robotics. The technological challenges of today involve tackling complex real-world problems in dynamic and unstable environments and according to [19], it means that MAL should depend on scalable theories so that agents can handle multiple states in continuous strategy spaces. The predominate field of study in MAL related to machine learning algorithms is reinforcement learning. More specifically, it looks at the connection between reinforcement learning and game theory [19]. Although a lot of progress has been made in understanding MAS based reinforcement learning, it has narrowed the scope of MAL research [19]. Another related field of study is holonic systems (HS). Holonicity is a concept arising from biological holons, which are systems forming a multi-levelled hierarchy of semi-autonomous sub-wholes where the whole becomes more than the sum of its parts due to its emergent properties. In a biological example, the human body contains organs which are groups of cells that can be divided into even smaller parts called organelles. Each of the components have a variety of functions, none of which can function without its sub-components or without reference to the organ that it is part of [5]. MASs can become very complex due to the interactions between their constituent autonomous agents, but when agents are implemented as holons they can assist in a system becoming self-organising [2]. Principles of divide-and-conquer will be used to model separate machine learning tasks by breaking them up into smaller part which are distributed to other holonic agents. Holarchies are also useful for defining a mixture of autonomous agents and the hierarchical organisation of those agents which is implemented by using recursion [19]. The models will be designed using a structured based approach of computer programs, with a special focus on recursive structures. According to [6], HS is considered a general paradigm for distributed intelligent manufacturing control, whereas MAS is regarded as a software technology that can be used to implement this type of holonic system. The coordination of a group of autonomous agents, resulting in intelligent behaviour, is supported by the research in subjects like human autonomy, cooperation among communities, and learning from past experiences [6]. Research into the application of MASs has focused on manufacturing enterprises ranging from product design to real time control [6]. Multiagent systems have been shown to support many applications, such as information retrieval, evolutionary algorithms, constraint satisfaction in the timetabling problem, flood forecasting, land use allocation in farming, localising and tracking of objects in motion, modelling road networks in industrial plants, and adaptive network meshing and traffic simulation [2]. Agent-based and holonic system design techniques have also been beneficial in the manufacturing sectors [6]. The intersection of these MAS paradigms, methodologies and fields of research, has identified a gap in the research of multiagent systems. Applications of holarchies are absent from the field of AI machine learning, most notably in the space of supervised learning. Since many distributed problems exhibit an inherent structure which may be beneficially mirrored in the relationships between
466
M. Cullinan and D. Coulter
groups of intelligent agent problem solvers found in multiagent systems, the exploitation of the inherently recursive nature of some learning algorithms could improve their processing adaptability in dynamic and distributed execution environments. The aim of this paper is to develop a novel and extensible learning system represented by multiagent communities and sub-communities which will contribute to broadening of the landscape of MAL research and its applications. The following section will give an overview of related work in the above mentioned fields of research. Discussions on multiagent systems, multiagent learning and holonic multiagent systems are provided in the Background section. The Materials and Methods section considers the research methodology and prototype implementation specific to the model presented applied to Decision Trees and Random Forest classification. The Results section describes the outcomes based on the model evaluation followed by a Discussion section. Finally, the conclusion and future work are presented the Conclusion section.
2
Related Work
Initially, biological modelling research in MAL mostly involved adaptive parallel computation inspired by nature, which included ant systems, flocking or herding behaviour, evolutionary computation, social learning, neural networks, and interaction and imitation learning [19]. This period was followed by study focused on multiagent learning techniques [19], dominated by applications of MAL reinforcement learning in a game theoretic context [19]. As is evidenced by this literature review, reinforcement learning is still the most commonly studied technique for multiagent learning. Multiagent systems that cooperate to achieve self-organisation have been successfully applied to domains in biological modelling, manufacturing and industrial control simulation, e-commerce, networking, robotics, avionics, and flood forecasting. Agents based on self-organisation through holons have been useful in domains such as transportation scheduling, RoboCup, flexible manufacturing systems and business process coordination in virtual enterprises [5]. Holonic multiagent systems were also effective in traffic signals control [1]. The MAS applications drawing on self-organisation discussed in [2] are listed in Table 1. In the article by [7] an algorithm was presented for a cooperative multiagent environment for agents to create a supervised ML model used to decide agents’ future actions. The algorithm was applied to traffic signals domain, learning with P-concept probability model. In [15] two multiagent learning algorithms where applied where multiple agents concurrently learn how to better interact with each other extending popular multiagent algorithms namely, cooperative Co-Evolution and multiagent reinforcement learning. Lenient learning was extended to multiagent deep reinforcement learning (DRL) in [13]. Two research approaches were then discovered to improve parallel reinforcement learning agents in applied to Hysteretic Q-learning and in leniency domains [13].
Holonic MapReduce Classification
467
Table 1. MAS applications drawing on self-organisation Application
Mechanism
Information Retrieval
Cooperative Information Agents
Timetabling
Cooperation in AMAS
Flood Forecasting
Cooperation in AMAS
Land Use Allocation
Eco-problem
Localisation and Tracking Reactive MAS Adaptive Meshing
Holons
Traffic Simulation
Holons, Holarchies
The paper by [3] suggested techniques for centralising training of deep multiagent RL using the model-free deep Q-Network as the baseline model and message sharing between agents. Classical RL is underpinned by a reward function that is the exclusive property of the environment, and is only altered by external factors. A paper by [10] introduced a novel peer rewarding system, in which agents could deliberately influence each others’ reward function. Likelihood Quantile Networks for coordinating multiagent RL was suggested by [11] to improve performance by making each agent consider the probability that another agent is changing its exploration policy and using that information to weigh the learning rate applied to samples. A new architecture was introduced by [9] that allowed holons to adapt to their environment relying on the recursive nature of the holarchy with an example of robot soccer players. The domains of applications ranged from manufacturing systems, transport, cooperative systems and radio mesh dimensioning. The goal of this research is to investigate new applications of holarchies in the machine learning domain, which considers how holonicity can capture the self-similar properties of certain supervised learning systems. For example, a hierarchical multiagent based ensemble Random Forest model can be expressed as a community of recursively composed smaller sub-communities consisting of binary tree data structures. The aggregate of theses sub-trees produce a predictor with the emergent ability to solve classification problems. Holonic multiagent systems (HMAS) where compared to holonic manufacturing systems in [5]. Furthering this topic, a framework for the analysis and design of HMASs was provided by [17] and the system was implemented with the Madkit platform. In HMASs agents interact with other agents through a hierarchical structure. Agents were defined by holons playing one of four roles, the StandAlone, Head, Part and Multi-Part. Bernon also showed a holonic approach for self-organisation in multiagent systems in [2]. Agents could self-organise and regulate the systems complexity by also playing one of the roles defined above. Algorithms running in imperfect and open real-world environments, for example cloud applications, benefit from
468
M. Cullinan and D. Coulter
self-organisation. In [9] an architecture was proposed with a recursive nature using holarchies. The solution proposed an artificial immune network as a holonic system with a reinforcement learning mechanism. The model-architecture proposed in this paper will be implemented as holonic MapReduce model which can be mapped onto multiagent implementations and that display suitable self-similar properties as identified from this literature survey. The aim is to produce a model that is scalable, distributable and can easily perform dynamic execution in a cloud environment.
3
Background
There are two ways in which independent agents in multiagent systems are programmed to achieve certain goals. The agents can either cooperate with one another or they can compete against each other [19]. MASs are well suited to distributed computation which adds to their complexity and so it becomes necessary for them to have a mechanism for self-organisation. This may involve grouping to form societies, resulting in coherent coordination emerging in their behaviour [2]. Mechanisms for generating self-organisation can be broadly categorised as types of either central control, partial control or completely decentralised control. Based on these control mechanisms systems can be further categorised as, Reactive MASs, Cooperative Information Agents, Adaptive MASs, or Holonic MASs [2]. Holarchies, which is significant to the model proposed by this paper, allows for an additional level of MAS sophistication. It is used to organise multiagent systems as recursive self-similar entities. This discussion will be further elaborated on in the HMAS section. 3.1
Multiagent Learning
Multiagent learning problems consider the design of algorithms involving agents, described as situated in a stochastic game, and must learn optimal behaviour [19]. One of the criteria for classifying MAL techniques is based on learning environments. Other criteria are the agents awareness to its environment and the use of models to learn. According to [19] the field of MAL and its classifications of MAL techniques are too limited. Most multiagent frameworks are described in terms of Stochastic or Markov games [19]. Reinforcement learning (RL) and its family of techniques is the most studied in MAL. Simply defined, RL is based on the observation that rewarding the desirable behaviour of an agent and punishing undesirable behaviour will lead to a behaviour change which attributes to the agent achieving its objectives. The positive reinforced feedback is usually coded as a scalar value which maximises the reinforcement it expects to receive [19]. Research in MAL started in a field called artificial life which involved adaptive parallel computation inspired by nature. Techniques explored where ant
Holonic MapReduce Classification
469
systems and steering behaviours, evolutionary computation social learning, neural networks, and interactive and imitation learning [19]. Following that the field of study became dominated by reinforcement learning MAL techniques. Some common types of algorithms in MAL system implementations are computational, descriptive, normative, prescriptive cooperative or non-cooperative [19]. MAL techniques can then also be classified according to the type of learning, which is shown in Table 2. Table 2. Types of MAL techniques from [19] Type of Learning
Description
Multiplied learning Several agents learning independently of one another Divided learning
Learning tasks are divided among one of more agents with the same known goal
Interactive learning Agents share the work in a single learning task
There are a few learning paradigms related to the concept of MAL, but have not found enough attention according to [19]. Two of them are Multistrategy learning and Parallel Inductive learning. Multistrategy learning combines two or more learning strategies into the learning system, and Parallel Inductive learning studies the exploitation of the inherent parallelism in many learning algorithms in order to scale to more complex problems. The main goal of this paper is to explore a unified approach to developing holonic multiagent learning systems. The two above mentioned techniques were selected to validate the approach due to their hierarchically self-similar nature. 3.2
Holonic Multiagent Systems
Biological holons are systems forming a multi-levelled hierarchy of semiautonomous sub-wholes where the whole becomes more than the sum of its parts due to its properties. Multiagent systems can become very complex due to the interactions between the autonomous agents, therefore, implementing agents as holons allows a system of agents to become self-organising [2]. A holon is therefore implemented as a special case of an agent that is composed of sub-agents with the same structure. At a low level both are simple model building blocks with a flow of incoming and outgoing data. According to [6], holonic systems (HS) can be considered to be a general paradigm for distributed intelligent manufacturing control, whereas MAS is regarded as software technology that can be used to implement this type of holonic system. The paper by [5] presents a general framework for holonic multiagent systems with multiple advantages. The model preserves compatibility with standard
470
M. Cullinan and D. Coulter
multiagent systems by representing each agent as a holon. The complexity of a group of agents is encapsulated as a super holon and the number of agents active in the holon does not change its communication with other agents in the system. The flexible and variable nature of cooperating agent systems allows for a design which can change at run-time [5]. According to Fischer there is currently no existing programming construct that can be used to design such an objectoriented programming styled approach. He provides terminology and a methodology in the context of HMASs for recursive modelling with multi-agency which can handle a dynamic runtime. 3.3
Decision Trees and Random Forests
The problem solving mechanism in supervised learning models involves a set of the data called the training set, which is a set of attribute measurements together with their observed outcome. This data can then be used to create a prediction model, also referred to as a learner, which can predict the outcome for unseen data. The following section will discuss a the Decision Tree supervised learning algorithm. Classification trees provide the building blocks for ensemble methods such as Random Forests, which will be covered more deeply in the remaining background section. Decision Trees. A classification tree is a Supervised learning algorithm. The model is built with input data from a training dataset. Each data point in the training set consists of an input vector X and its corresponding output value y producing an example, (X, y). X consists of a set of attribute values xi , i = 1, 2, ..., n and y is a single observed outcome value. To make a prediction the Decision Tree function takes an unseen example from the testing dataset and produces a “decision”, which is a single output value yj representing a class [18]. The predicted output is then compared to the actual output in order to determine the model’s accuracy. A Decision Tree learner is implemented as a binary tree data structure. Nodes represent a single input variable from the training dataset and any incoming data are recursively partitioned along some path in the tree. The tree uses split points to perform tests on the input data. The split points are calculated from the cost function and are determined according to the minimum cost of creating a branch for that attribute. Branches terminate at a leaf node representing the output value. The process of growing a tree takes a greedy approach [8]. The splitting method calculates which example attribute, referred to as the “important” attribute, minimises the cost of creating an internal node at that position in the tree structure. Good trees are ones in which the branches and sub-branches are short, making it easy to interpret. After adding a node the branched-off tree that follows can be seen as new sub-model created with a smaller training set and effectively considering fewer attributes [18]. Figure 1 shows how Decision Trees are defined by recursively
Holonic MapReduce Classification
471
partitioning the input data into localised regions, and defining a sub-model in each resulting region [8]:
Fig. 1. Decision Tree created by recursive binary splitting adapted from [8]
Random Forests. As briefly mentioned, Random Forests are an ensemble method derived from Decision Trees with hierarchical recursive properties. The advantage of Decision Trees in machine learning is that the algorithm is quite easy to understand and the results are easy to interpret. Depending on the dataset, however, they are not always the most accurate and especially where the data is highly dimensional. Decision Trees are also known to have the following disadvantages [8]: – they do not provide very accurate predictions due to the greedy nature of the tree construction algorithm – they are unstable and due to their hierarchical nature can change drastically due to small changes in the data – they can overfit the data In statistical terms, Decision Trees are high variance estimators. Random Forests are a method for reducing the high variance, by calculating a final prediction from the average of many weaker predictions that have each been calculated over a subset of the data [8]. This technique is called bagging, or bootstrap aggregation, which involves partitioning the dataset into subsets randomly with replacement to create many weak learners whose predictions are then averaged. Random Forests are good at reducing the variance because the methodology decorrelates the weak tree learners in the way that learning data is selected. These ensembles have shown very good predictive accuracy, although they do become more difficult to interpret [8].
472
4
M. Cullinan and D. Coulter
Materials and Methods
The following section describes the proposed model. The problem was divided into sub-tasks and distributed to decentralised problem solvers using a MapReduce derived approach. A prototype was implemented demonstrating the parallelism and flexibility of a holonic MAS interpretation specifically applied to supervised ML. This research attempts to test the hypothesis that holonicity in recursive modelling can be realised by a suitable self-similar multiagent learning system. Decision Trees and Random Forests were the taken approaches to explore the model on solving problems. The prototype validates the model using classification data sets. The proposed model exploits the novel benefits arising from the intersection of holonic self-similarity in multiagent systems by leveraging the hierarchical recursive structure in Decision Tree machine learning. The algorithm works by splitting a machine learning problem into modular algorithmic components, serialising each algorithm as a string and then solving the classification problem by recursive MapReduce. The algorithm was also run in a virtual cluster environment using a container orchestration tool. The model was implemented in the Python programming language with Decision Trees and Random Forests adapted from [4]. One of complexities of distributed computation is transferring code as data, specifically serialising recursive inner functions found in the algorithms of Decision Trees and Random Forests. This problem was addressed in the model by using the Y-Combinator methodology, which will be discussed in the following section. 4.1
Y-Combinator
In order to create a distributable and deployable model using multi-agency, and that can change its learning algorithm dynamically, the model components need to be serialisable. A challenge is found in using a language like Python to serialise recursive inner functions. If a recursive function is defined containing another recursive function as an inner function, then in order for the function to be serialisable it must be rewritten in combinator form and then have a Y-Combinator function applied to it [12]. Serialisation is the function of converting data from an in-memory construct into any linear sequence, for example, a string. This conversion also needs the ability to be undone by the function of deserialisation. The aim of this paper is to explore whether complex ML algorithms, in the context of holonic multiagent systems, can be encoded as a string to be later unpacked and evaluated either on the cloud, remotely, or by some other distributed architecture. Dill is a common Python library with methods for serialising and deserialising most Python functions. The library fails, however, to serialise recursive inner functions as demonstrated in [12] because it produces circular references in its closure. A way to resolve the problem is by manually removing the recursion from the function using concepts of functional programming [12]. A function’s “free
Holonic MapReduce Classification
473
variables” are those not local to the function. Combinators are a class of functions without free variables. Functions in Python have the property of being first class objects which means that they can be passed into another function as a parameter. There is a function called the Fixed-Point function defined as: def Y( f ) : return f (Y( f ) ) where Y is a special combinator function, being applied to a function f. Given a function h applied to f, which returns a new function g: h(f ) = g
(1)
From the fixed-point property, there exists a fixpoint f’ of h where: f = Y (h) h(f ) = f
(2)
Suppose a recursive function r is redefined in its combinator form using h as follows: def h ( f ) : def r ( . . . ) : ... f (...) ... return r Then the function r , which is the non-recursive counterpart of r, can be obtained from a transformation with the Fixed-Point function: r = Y (h) = r
(3)
Now, although the recursion has been removed from r and h is a combinator, Y (h) is still recursive. It therefore becomes necessary to introduce the Y-Combinator function, which further removes recursion from the Fixed-Point combinator Y by means of lambda expressions. The Y-Combinator function is defined as follows: def y c o m b i n a t o r ( f ) : return ( lambda x : x ( x ) )( lambda y : f( lambda ∗ a r g s , ∗∗ kwargs : y ( y ) ( ∗ a r g s , ∗∗ kwargs ) ) )
474
M. Cullinan and D. Coulter
The function defined above has the following properties [12]: – it is a counterpart of Y – it has no free variables and is a combinator Finally the recursive function r can be converted into a serialisable function by applying the Y-Combinator to the r-combinator defined by h: y combinator(h) = r
(4)
In the original Decision Tree’s recursive predict function, where it is referenced itself internally, predict becomes classified as a free variable. In order to transform the recursive predict function into its non-recursive counterpart the internal function reference must be replaced by another inner function as follows: def m a k e p r e d i c t c o m b i n a t o r ( ) : def p r e d i c t c o m b i n a t o r ( p r e d i c t ) : def i n n e r ( node , row ) : i f row [ node [ ’ i n d e x ’ ] ] < node [ ’ v a l u e ’ ] : return p r e d i c t ( . . . ) else : ... return i n n e r return p r e d i c t c o m b i n a t o r The newly created version of the predict function, in combinator form, is then made serialisable and deserialisable after applying the y combinator function to it as follows: predict combinator = make predict combinator() predict serializable = y combinator(predict combinator) 4.2
Decision Tree Classification
The proposed holonic MapReduce model was first evaluated by a Decision tree classification problem. For binary classification problems, a split in the dataset separate the training data examples into two groups, one for each class. The cost function calculating the node split points in the dataset is the Gini index. A Gini score is a measure of the split purity which is a function of the proportion of each class present in the resulting two groups of training data. Splitting is stopped if the size of the groups equals its minimum number allowed or the tree has grown to its maximum tree depth [4]. gini = (1.0 − (proportion × proportion)) × (n ÷ N ) (5) where: n is the group size N is the total samples
Holonic MapReduce Classification
475
The decision tree learner parameters for the model are defined in Table 3. Table 3. Decision Tree learner parameters Parameter
Value
Number of data folds 5 Maximum tree depth 6 Minimum size
2
In order to transfer the serialised model components to the ensemble orchestrator, they were encoded and stored as nested recursive structures and passed in the model parameters. The model parameters are received as a seed string which the ensemble orchestrator is responsible for unpacking and distributing across the nodes. The model parameters for holonic Decision Tree classification are presented in Table 4. Table 4. Holonic MapReduce model parameters for Decision Tree classification Parameter
Value
Dataset
ML dataset
Learner parameters Data folds, tree depth, tree size, sample size, number of trees Algorithm
4.3
name: accuracy metric, reduce function: serialised accuracy metric, algorithm: name: decision tree, map function: serialised decision tree, algorithm: name: predict, reduce function: serialised predict
Random Forest Classification
A recursive extension to MapReduce as discussed in [16] was used to implement the holonic multiagent system. Unlike the usual approach to MapReduce, large scale recursions are implemented by iterated MapReduce jobs to handle the problem with the recursive task not being able to restart if it failed. If they followed a restart policy then there might be a case where no task would ever receive any input and thus not produce any output for the next task. The Random Forest algorithm was evaluated as an iterated application of encoded learner algorithms defined by bootstrap aggregation with each step of recursion being a MapReduce job.
476
M. Cullinan and D. Coulter
The Random Forest learner parameters for the model are defined in Table 5.
Table 5. Random Forest learner parameters Parameter
Value
Number of data folds 5 Maximum tree depth 6 Minimum size
2
Sample size
0.5
Number of trees
5
Here the serialised model components were provided as a deeper nested structure, but still providing a uniformity in how the model parameters are defined an passed to the ensemble orchestrator. This allows extensibility in streaming information to models deployed on the cloud. Model parameters for the holonic Random Forest classification are presented in Table 6. Table 6. Holonic MapReduce model parameters for Random Forest classification Parameter
Value
Dataset
ML dataset
Learner parameters Data folds, tree depth, tree size, sample size, number of trees Algorithm
4.4
name: accuracy metric, reduce function: serialised accuracy metric, algorithm: name: bootstrap, map function: serialised bootstrap, algorithm: name: max predict, reduce function: serialised max predict, algorithm: name: decision tree, map function: serialised decision tree, algorithm: name: predict, reduce function: serialised predict
System Components
Designing the model architecture as a nested recursive structure reduced complexity when unpacking the parameters and applying the serialised components to solve the problem. The learning problem was defined in terms of nested recursion in which each node has a level of computation but the components that go deeper down the hierarchy are less complex, so that, as the problem unpacks into its smaller components each holon has a smaller self-contained problem to solve. The model architecture was achieved using a recursive extension to MapReduce
Holonic MapReduce Classification
477
with computations based on successive calls to a Mapper and Reducer, each component returning results to its parent holon. Figure 2 illustrates the model’s activity diagram.
Fig. 2. Holonic ensemble orchestration architecture derived from MapReduce
478
M. Cullinan and D. Coulter
The model consists of some major components and minor components. The major components compose the aspects of the MapReduce paradigm, which is also used as an interpretation of the ensemble method. The ensemble orchestrator component is called App. This component is responsible for receiving the initial data, ML algorithm learner parameters and a nested dictionary structure of strings containing serialised functions comprising the required ML algorithms. The App component is responsible for unpacking the model parameters and starting the recursive evaluation of the function sequence. The next component is the Reducer whose goal it is to deserialise incoming reduce functions and apply them to the streamed data. The Mapper component deserialises map functions and applies them to the data. As is usual in a MapReduce system, the components prescribe to manager-worker based roles. The App component is the manager and the Reducer and Mapper components are workers. This format also allows for parallelisation of the problem solving mechanism to solve the problem. The App, Reducer and Mapper components communicate via requests. Internally, the components also concurrently run their received functions with the implementation of tasks using the Python asynchronous model. Figure 3 shows a visualisation of the App, Mapper and Reducer cloud components.
Fig. 3. Holonic multiagent MapReduce cloud components
Holonic MapReduce Classification
479
Figure 4 illustrates the sequence diagram of the holonic MapReduce model showing the systems recursive MapReduce jobs.
Fig. 4. Holonic multiagent MapReduce sequence diagram
5
Results
The following plots were generated by the running system for each of the two problems. These results were compared to those obtained from running the original Python implementation in order to show that the Decision Tree and Random
480
M. Cullinan and D. Coulter
Forest algorithms still work after the application of the holonic multiagent architecture. The model was tested on the banknote dataset discussed in [4] to predict whether a given banknote is real based on data collected from banknote photos. It contains 1,372 rows with 5 numeric variables. It is a simple binary classification problem where the output y ∈ (0, 1). Each example in the dataset has the following attributes [4]. All the values are continuous except the integer class value: – – – – –
variance of Wavelet Transformed image skewness of Wavelet Transformed image kurtosis of Wavelet Transformed image entropy of image class value
The fold number in the graphs represent the data subsets used in crossvalidation training and testing to build the model. The bar chart in Fig. 5 shows the holonic MapReduce model accuracy applied to a Decision Tree learning algorithm compared to the original Decision Tree model. Both models achieve very high accuracy scores for the banknote classification dataset with the holonic model sometimes performing slightly better.
Fig. 5. Decision Tree holonic MapReduce model accuracy
Figure 6 compares the accuracy of the holonic MapReduce Random Forest model in Fig. 6a to the original Random Forest model in Fig. 6b. The model instances were created containing 3, 5, 10, and 15 trees, respectively.
Holonic MapReduce Classification
(a) Holonic MapReduce Model
481
(b) Original Model
Fig. 6. (a) Holonic MapReduce Random Forest model accuracy. (b) Original Random Forest model accuracy
Figure 7 shows the mean accuracy of the holonic MapReduce Decision Tree (containing a single tree) and Random Forest (varying instances containing 3, 5, 10 and 15 trees, respectively) models compared to the original models. The holonic implementation preforms slightly better than the original model except for the Random Forest instance containing 10 trees.
Fig. 7. Mean accuracy of the Decision Tree and Random Forest holonic MapReduce model
6
Discussion
The results in the previous section indicate that both the holonic MapReduce Decision Tree and Random Forest algorithms are running correctly. The application of the models to the problem of classifying data from the banknote dataset produce very high accuracy, which is expected since it is a small dataset with few attributes. Building the models using cross-validation is one of the parallelisable aspects of Decision Trees and Random Forests exploited by the model which improves
482
M. Cullinan and D. Coulter
the original algorithm by making it able to scale in a distributed environment such as a multi-node compute cluster. Using the holonic MapReduce architecture as an implementation mapping onto a multiagent system makes the model flexible and easily deployable. The difference between creating the Decision tree versus the Random Forest model instance is determined only by the definition of the model parameters. The Random Forest requires a deeper nested structure of serialised algorithms compared to the Decision tree, but the main model architecture remains unchanged between processing the different models.
7
Conclusion
This paper investigates the recursive nature of some machine learning algorithms that lend itself to holonic MAS design resulting in a flexible, reusable, and extensible MAS model for supervised learning problems. It also incorporated the Parallel Inductive learning domain in MAL which studies the ability to scale complex learning using the inherent parallelism in certain algorithms. The model parameters where serialised and uniformly passed to the model which simplifies complex distributed computation when streaming information to models deployed on the cloud. Large models can be computed by evaluating the model spread out across nodes for speed and leveraging the parallelisation found in the ensemble Random Forest algorithm. A supervised learning problem was computed as a MapReduce system by the recursive computation of serialised model components as verified by the consistent results when applied to Decision Trees and Random Forest classification. Future work aims to incorporate software languages into the model with specific features that can be leveraged for implementing self-similar agents. The objective would be to reduce the complexity introduced in serialising and deserialising recursive algorithms so that holonicity can be applied to more complex learning approaches, such as deep learning.
References 1. Abdoos, M., Mozayani, N., Bazzan, A.: Holonic multi-agent systems for traffic signals control. Eng. Appl. Artif. Intell. 26, 1575–1587 (2013) 2. Bernon, C., Chevrier, V., Hilaire, V., Marrow, P.: Applications of self-organising multi-agent systems: an initial framework for comparison. Informatica 30, 01 (2006) 3. Bhalla, S., Subramanian, S.G., Crowley, M.: Training cooperative agents for multiagent reinforcement learning. In: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2019, Richland, SC, pp. 1826–1828. International Foundation for Autonomous Agents and Multiagent Systems (2019) 4. Brownlee, J.: How to implement bagging from scratch with Python (2019)
Holonic MapReduce Classification
483
5. Fischer, K., Schillo, M., Siekmann, J.: Holonic multiagent systems: a foundation for the organisation of multiagent systems. In: Maˇr´ık, V., McFarlane, D., Valckenaers, P. (eds.) HoloMAS 2003. LNCS (LNAI), vol. 2744, pp. 71–80. Springer, Heidelberg (2003). https://doi.org/10.1007/978-3-540-45185-3 7 6. Giret, A., Botti, V.: Holons and agents. J. Intell. Manuf. 15, 645–659 (2004). https://doi.org/10.1023/B:JIMS.0000037714.56201.a3 7. Goldman, C.V., Rosenschein, J.S.: Mutually supervised learning in multiagent systems. In: Weiß, G., Sen, S. (eds.) IJCAI 1995. LNCS, vol. 1042, pp. 85–96. Springer, Heidelberg (1996). https://doi.org/10.1007/3-540-60923-7 20 8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics, Springer, New York (2001). https://doi.org/10.1007/ 978-0-387-21606-5 9. Hilaire, V., Koukam, A., Rodriguez, S.: An adaptative agent architecture for holonic multi-agent systems. ACM Trans. Auton. Adapt. Syst. 3(1), 2:1–2:24 (2008) 10. Lupu, A., Precup, D.: Gifting in multi-agent reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, Richland, SC, pp. 789–797. International Foundation for Autonomous Agents and Multiagent Systems (2020) 11. Lyu, X., Amato, C.: Likelihood quantile networks for coordinating multi-agent reinforcement learning. In: Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS 2020, Richland, SC, pp. 798–806. International Foundation for Autonomous Agents and Multiagent Systems (2020) 12. O’Regan, E.: Serialising functions in Python (2016) 13. Palmer, G., Tuyls, K., Bloembergen, D., Savani, R.: Lenient multi-agent deep reinforcement learning. CoRR, abs/1707.04402 (2017) 14. Panait, L., Luke, S.: Cooperative multi-agent learning: the state of the art. Auton. Agent. Multi-Agent Syst. 11(3), 387–434 (2005). https://doi.org/10.1007/s10458005-2631-2 15. Panait, L., Sullivan, K., Luke, S.: Lenient learners in cooperative multiagent systems, pp. 801–803, January 2006 16. Rajaraman, A., Leskovec, J., Ullman, J.D.: Mining Massive Datasets (2014) 17. Rodriguez, S., Hilaire, V., Koukam, A.: Towards a methodological framework for holonic multi-agent systems. In: Fourth International Workshop of Engineering Societies in the Agents World, 29–31 October 2003, pp. 179–185. Imperial College London (2003) 18. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice Hall Press, Upper Saddle River (2009) 19. Tuyls, K., Weiss, G.: Multiagent learning: basics, challenges, and prospects. AI Mag. 33(3), 41 (2012)
Monitoring Goal Driven Autonomy Agent’s Expectations Generated from Durative Effects Noah Reifsnyder(B) and Hector Munoz-Avila Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, USA {ndr217,hem4}@lehigh.edu
Abstract. One of the crucial capabilities for robust agency is self-assessment, namely, the capability of the agent to compute its own boundaries. A method of assessing these boundaries is using so-called expectations: constructs defining the boundaries of an agent’s courses of action as a function of the plan, the goals achieved by that plan, the initial state, the action model and the last action executed. In this paper we redefine four forms of expectations from the goal reasoning literature but, unlike those works, the agent reasons with durative actions. We present properties and a comparative study highlighting the trade offs between the expectations. Keywords: Goal reasoning · Expectations · Durative actions · Continuous domains
1 Introduction There has been an increasing interest in AI safety, creating reliable AI agents that perform within their boundaries. With the increasing sophistication of autonomous systems, situations arise in which unexpected situations may occur. This happens when the agent and/or the environment in which the agent is operating behaves in a way that is inconsistent with the agent’s planning knowledge. Goal-driven autonomy (GDA) agents supervise the agent’s execution of its current plan and formulate new goals when discrepancies arise between the agent’s expectations of the actions in the plan executed so far and the resulting state. To detect discrepancies, the agent generates a set of expectations X(π, a), as a function of the next action a in the plan π to be executed. The agent can then check if this expectation is satisfied in the state, s. Naively, for computing the expectation X(π, a) it would seem sufficient for the agent to simply check if the preconditions of a are satisfied in s (i.e., to define X(π, a)= “the preconditions of a”). However an agent using expectations defined this way would not be checking the plan trajectory in any way. Similarly, we could project the state s before the action a is the be executed from the initial state s0 using π. (i.e., to define X(π, a) = s). Expectations are often generated this way in the goal reasoning literature [1, 10, 13]. The issue here is that there are often cases where there are variables in the state that have no bearing on π and thus if they are altered it will not impact the agents c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 484–498, 2022. https://doi.org/10.1007/978-3-030-82193-7_32
Monitoring Expectations from Durative Effects
485
execution. To define the expectations over the entire state would cause discrepancies in situations where they are not needed. Researchers have observed that the notion of expectations plays a key role in the resulting performance of goal reasoning agents [6, 7]. A real world example of the use of expectations can be seen as follows. John recently bought a new car, and it was advertised with an average Miles Per Gallon of 32. However, after John has driven for a bit and filled up the tank for the first time, he notices that he only got 15 miles to the gallon. This would be a cause for concern and would likely result in him taking the car back to the dealer to be checked, and its possible there’s something wrong with the car like a fuel leak. This is the goal of expectations; to identify problems using given state information. In this paper we report on our studies computing expectations when the actions are durative. This means actions have continuous effects over some segment of time. For instance, the agent may control a camera to follow a target while turning around 20◦ on a swivel. Performing this action requires some time to complete as a function of the rotation speed of the camera. Furthermore, while performing this action, the agent may initiate another action such as zooming out by a 1/3 zoom ratio. This action itself requires time to complete and may be initiated while the previous action is still not completed. Therefore, expectations are also a function of the time t and not of a specific action in π since multiple actions may have been initiated. Thus, in our work, expectations are a pairs of the form X(π, t), where t is some time after π was initiated. The following are the main contributions of this paper, centered around the notion of expectations when GDA agents perform durative actions: – We re-define the notion informed, regression [5] and goldilocks [18] expectations. – We formulate properties on regression expectations providing guarantees on the success of the remaining plan when certain conditions are met. – We provide a empirical evaluation on two very different domains and discuss trade offs between the different forms of expectations. The rest of the paper is organized as follows: The next section will discuss related work. Section three will go through the preliminaries, with section four defining some operators for use in our later definitions. The next three sections will define the different types of expectations we have developed for these domains. Following that we will discuss some properties of our expectations. This will be followed by our empirical evaluation and our conclusions.
2 Related Work GDA is a goal reasoning model in which agents monitor the current plan’s execution and assess whether the observed states match the agents’ own expectations. GDA agents formulate a plan π achieving goals g from the current state s. They also formulate the expectations X(π, a) for every action a in π. As each action a in π is executed, the agent checks if its expectations X(π, a) are satisfied in the current state s. When they are not satisfied the GDA agents follows a procedure to formulate new goals g , a new plan π , new expectations X(π , a) and the process repeats itself.
486
N. Reifsnyder and H. Munoz-Avila
Research on GDA agents have explored different facets of the cycle including generating explanations for the mismatch between the agent’s expectation and the state [11] and procedures to generate new goals [16]. In this discussion, we focus on work formulating the expectations. A variety of formulations for the expectations have been formulated mostly for symbolic domains. This includes defining the expectations as [12]: – (1) immediate expectations: checking the preconditions of the next action to execute; – (2) state expectations: the projected state by applying the actions in the plan executed so far; – (3) informed expectations: the cumulative effects of the actions executed so far; – (4) goal regression expectations: starting from the goals, accumulating the conditions in the state necessary to execute the rest of the plan, building on classical work [15]; – (5) goldilocks expectations [18] combining (3) and (4). In our work we re-define informed, regression, and goldilocks expectation’s when actions have durative effects. GDA expectations have been explored for actions with numeric fluents. Intervals (x↓ , x↑ ) are used to indicate valid values for a variable. Actions define a function f (x) indicating new values for x after the action is applied. Like their symbolic counterparts, these works assume the actions in a plan π are executed instantaneously and in sequence. Therefore, expectations are also denoted as a function X(π, a) of each action a ∈ π. In Wilson et al. [21], state expectations are defined in which the intervals are projected forward (f (x↓ ), f (x↑ )) for each action. [17] extends these ideas for informed, regression, and goldilocks expectations. In our work, we re-define when the numeric effects of the action change over time (e.g., f (x, t)) and include situations when actions in the plan π are performed concurrently. Hence, we define expectations over time, X(π, t). Studies on expectation failures emanate from the plan execution literature [20]. For instance, [2] proposed a model to find the reason for a failure in the plan. [4] propose a taxonomies of expectation failures as a function of the plan, planning knowledge, the resulting state and the state expectations. For example, the incorrect domain knowledge class refers to expectations failures caused by planning operators incorrectly modeling the actual operators behavior. [20] presents an alternative taxonomy of failures related to the execution of the SIPE HTN planner. For instance, a failure maybe attributed to a condition that was held to be true at planning time but it is no longer true when the plan is executed. [8] proposes execution failures when the plan doesn’t meet quality considerations. None of these works consider failures due to durative actions.
3 Preliminaries Throughout this paper, we will use partial mappings. A partial f : A B indicates a function that is defined only for a subset of A. When referring to the set of variables defined in a mapping, we write Af , meaning the set of variables from A defined in the mapping f . When listing the entire mapping (e.g., in examples), we will use a dictionary
Monitoring Expectations from Durative Effects
487
format to represent the partial mapping, where the keys are the variables, and the values are what they map to. For example, f = {a : 1, b : 0} denotes that there are two variables in mapping f : a and b, and that their respective values, denoted f [a] and f [b], are 1 and 2. Since we are dealing with actions that have functional effects, such as changes over a time t, we use lambda calculus to represent them. Briefly, functions are represented as tuples. The first element in the tuple is the function to be exercised, all following elements in the tuple are arguments to that function. For example, (− 3 2) represents the minus function, where 3 is the first argument and 2 is the second, i.e., 3 – 2. Therefore (− 3 2) would return 1. There can also be functions with free variables, called lambda functions. We use these to represent functions dependent on a time variable. For example, one can write f = λt.(− t 3) to represent a function with a singular argument t. This function f thus returns the argument given it subtracted by 3, thus (f 5) would return 2. A state is a mapping S : V → R from a collection of variables V to a collection of real numbers R. Since we are dealing with real numbers, its unrealistic to always know the exact values of variables [19]. Therefore we represent the value of a variable, e.g., at-y[r], with two mappings denoting its lower and upper bounds, e.g., (at-y[r]↓ , at-y[r]↑ ). Actions have durative effects, meaning for a time period t the variable is changing as a function of t. Table 2 shows an example of a state, while Table 1 shows an example operator that can be applied to this state. As exemplified in Table 1, an operator is a 4-tuple o = (name parameters prea ef f a ). A set of goals G is a partial mapping G: V R from a collection of variables V to a collection of real numbers R. These are also represented with two mappings to denote an upper and lower bound for each variable in VG . An operator’s preconditions are a list of variables boundings. They are represented as the partial mapping prea : V C of variables to constants (with individual variables for upper and lower bounds). For example, in the operator move north shown in Table 1 we are checking that the lower bound of the variable at-y[x] (i.e., at-y[x]↓ ) is (∗ t move-rate[x]↑ )). We define the set of effects from an action as a partial mapping ef f a : V Λ from variables to lambda functions. Looking at Table 1, we can see that the effects are a list of 3-tuples. For the purpose of our calculations, we care solely about the functional changes to the state variables. Thus for all variables v ∈ e where e is the effect list, ef f a [v] = e[v][2][0]. For example, from Table 1, one of the effects is “[at-y[x]↓ → (+ at-y[x]↓ ((f1 x) t))),”. Thus ef f a [at-y[x]↓ ] would be the lambda function returned by “(f1 x)” (as defined in the operator). Applying operators change the values of variable as a function of time t. For example, consider applying the operator move north (Table 1) to the initial state defined in Table 2 with arguments (r, 2), indicating rover r is to move north for 2 time steps. The operator changes the rover’s fuel level fuel[r] and its y coordinate, at-y[r]. The upper bound of its fuel level, at-y[x]↑ , changes from 10 to (+ 10 (× −t .9); when t = 2, the execution of the operator is completed and the value of at-y[x]↑ is set to 8.2 (i.e., (+ 10 (∗ −2 .9)).
488
N. Reifsnyder and H. Munoz-Avila Table 1. Example of operator with a numeric fluent (Fuel). (:operator move north :parameters x t :condition at-y[x]↓ → (∗ t move-rate[x]↑ ) fuel[x]↓ → (∗ t fuel-rate[x]↑ ) :effect [at-y[x]↓ → (+ at-y[x]↓ ((f1 x) t))), at-y[x]↑ → (+ at-y[x]↑ ((f2 x) t))), fuel[x]↓ → (+ fuel[x]↓ ((f3 x) t))), fuel[x]↑ → (+ fuel[x]↑ ((f4 x) t)))] f1 = λx.(λt.(∗ t move-rate[x]↑ )), f2 = λx.(λt.(∗ t move-rate[x]↓ )), f3 = λx.(λt.(∗ -t fuel-rate[x]↑ )), f4 = λx.(λt.(∗ -t fuel-rate[x]↓ ))
A plan π is defined as a set of time stamped actions of the form (time, action). For example, in Table 2 the first action in π is (0, move north(r, 2)), indicating that at time 0, the action move north with the parameters r and 2 is executed. The next action is (0, move east(r, 2)). Since its starting time is also 0, the effect of this is the rover moving diagonally. We denote the partial mapping Ts : T P(A) as a mapping from a time t to the set of actions Ts (t) starting at time t. Analogously, we also define the partial mapping Te : T P(A) as a mapping from times to sets of actions that end at those times (P(A) is the power set of A). A planning problem is a triple P = (S0 , A, G). A plan π solves a problem P if the following conditions are met: there is a sequence of states S0 , S1 , ..., Sn such that: (1) in Sn . Si yields Si+i in π. (2) G is satisfied a fa (1) for all actions a such that a ∈ Ts (t ) ∪ Te (t ) Si yields Si+1 by adding with t ≤ i ≤ t , with fa being the functional effect athat changes a variable v in action fa [v](1). a. That is, for each variable v, Si+1 [v] = Si [v] + A mapping of variables to values such as G is satisfied in a state S if for all v↓ ∈ Vg , G[v↓ ] ≥ S[v↓ ] and for all v↑ ∈ Vg , G[v↑ ] ≥ S[v↑ ]. The same definition applies to an expectation X, which is also a mapping from variables to values, to be satisfied in a state S. Similarly, let πt be the portion of the plan π that remains to be executed at time t ≤ n + 1. That is, πt includes all actions in π not in Te (0) ∪ Te (1) ∪ . . . Te (t) (i.e., actions that are still under execution or whose execution has not started yet). πt is valid if there is a sequence of states St , St+1 , ..., Sn yielded such that: (1) Si yields Si+i in π. (2) G is satisfied in Sn . In our work, we assume the state persists unmodified after the completion of the plan π, meaning Sn = Sn+1 and therefore, the empty plan πn+1 = () solves (Sn+1 , A, G) whenever π solves (S0 , A, G).
Monitoring Expectations from Durative Effects
489
Table 2. Planning problem and a solution plan (:Initial State {fuel : {r↓ : 10, r↑ : 10]}} {Beacon fuel : {r↓ : 1, r↑ : 1]}} {at-y : {r↓ : 2, r↑ : 2, Beacon1↓ : 0, Beacon1↑ : 0}} {at-x : {r↓ : 0, r↑ : 0, Beacon1↓ : 2, Beacon1↑ : 2}} {lit : {Beacon1↓ : 0, Beacon1↑ : 0}} {fuel-rate : {r↓ : .9, r↑ : 1.1}} {move-rate : {r↓ : .9, r↑ : 1.1}} :Actions move north, move south, move east, move west, light beacon :Goals {lit : {Beacon1↓ : 1, Beacon1↑ : 1]}} :Plan π (0, move north(r, 2)), (0, move east(r, 2)), (2, light beacon(r, 1))
4 Two Basic Operations We introduce two basic operations ⊕S and P , which are used to define precisely the different forms of expectations. Informally, ⊕S compounds lambda functions (useful for adding together effects of actions), whereas P removes a function from a compounded set (useful for removing the effect of an action as its function of t after it is completed). We define D = A ⊕tS B, where A are some variables (e.g., accumulated changes while compounding functions), S is the current state, t is the current time, and B are the effects of some action (e.g., the next action in the plan). More generally, for any partial functions A and B, any time t , and any state S with A : V Λ, S : V → Λ, t ∈ Z + , and B : V Λ, A ⊕tS B is a partial mapping Dt : V Λ defined as follows: 1. if v ∈ VA ∩ VB where A[v] = M and B[v] = N , then Dt [v] = λt.(+ (M t) (N (− t t ))). 2. if v ∈ VA − VB then Dt (v) = A(v). 3. if v ∈ VB − VA where S[v] = M and B[v] = N , then Dt [v] = λt.(+ (M t) (N (− t t ))). 4. for all other variables Dt is undefined (i.e., VDt = VA ∪ VB )
Informally, A ⊕tS B results in a function addition (+ (A[v] t) (B[v] (− t t ))) when the variable v is defined in A and B (Case 1), or (+ (S[v] t) (B[v] (− t t ))) when v is defined in B but not A (Case 3). We are using an updated time variable (− t t ) for the
490
N. Reifsnyder and H. Munoz-Avila
Fig. 1. Example calculation of D[a] from the ⊕ operator example
functions from B since these represent the effects of the next actions to be added into the expectation set. Since we are in the middle of the plan, we need to shift the value of t to represent that that action isn’t starting at time t = 0. If the variable v is defined in A but not B, it’s assigned A[v] (Case 2). When it’s undefined in A and B, then it’s left undefined (Case 4). For example, if A, S, and B are defined as: • • • • •
A = {a : λt.(+(∗ 2 t)3), c : λt.(t)} S = {a : λt.(+(∗ 2 t)3), b : λt.(−(∗ 3 t)4), c : λt.(t)} B = {a : λt.(∗ 2 t), b : λt.(t)} t = 3 (current time is 3) Then D = A ⊕3S B = {a : λt.(−(∗ 4 t)3), b : λt.(−(∗ 4 t)7), c : λt.(t)}
In the resulting partial function D[a] = λt.(− (∗ 4 t) 3) is obtained by adding the functions (A[a] t) and (B[a] (−t 3)) (i.e., Case 1). This procedure is shown in Fig. 1. D[b] and D[c] are similarly obtained from the rules defined in the ⊕ operator. We define D = A tP B, where A are some variables (e.g., accumulated changes while compounding functions), P are the preconditions from some action (e.g., the current action we are regressing from) and B are the effects of the action. More generally, let A : V Λ, P : V C, t ∈ Z + , and B : V Λ, we define A tP B as a partial mapping D : V Λ with: 1. if v ∈ VA − VB then D[v] = A[v]. 2. if v ∈ VA ∩ VB where A[v] = M and B[v] = N , then D[v] = (− (M t) (N (− t t ))). 3. if v ∈ VP then D[v] = λt.(P [v]) 4. for all other variables D is undefined (i.e., VD = VA ∪ VP )
Informally, A tP B results in a new partial mapping that is defined for all variables from A and P . The new mapping takes the value A(v) if v is defined in A but not in B (Case 1). If a variable v is defined in A and B, the new mapping takes the values after subtracting (− (A[v] t) (B[v] (− t t ))) (Case 2), If a variable v is defined in P the new mapping takes the value P [v] (Case 3). If a variable is not defined in either A or P , it is left undefined (Case 4). For example, if we have the three partial functions A, P , and B, as follows: – A = {a : λt.(+ t 3), b : λt.(− (∗ 2 t) 4)}
Monitoring Expectations from Durative Effects
– – – –
491
P = {c : −2} B = {b : λt.(∗ 2 t)} t = 3 (current time is 3) Then D = A 3P B = {a : λt.(+ t 3), b : λt.(2), c : λt.(−2)}.
5 Informed Expectations with Durative Effects Agents using Informed Expectations check that the compounded and accumulated effects are valid in the environment. Informally, informed Expectations accumulate a set of functions over time extracted from the effects of all previous durative actions executed so far in π. They compound all active durative actions’ functions, and retain all past changes made to the state by actions that have finished executing. Formally, informed expectations are denoted as Xinf (π, −1, t, ∅) = inft , for some time t. Each inft is recursively generated as follows: 1. inf−1 = ∅. (we start with t = −1 for bookkeeping) 2. For all t ≥ 0, inft is generated by the following 3 steps: (a) inft = inft−1 (b) for all a ∈ Ts (t), inft = inft ⊕tst−1 ef f a . (c) for all a ∈ Te (t), inft = inft t{} ef f a Case 1, the base case, indicates that before the first action is executed, we have no accumulated effects. The 3 steps of Case 2, the recursive case, are as follows: we start with the expectations computed up to time t − 1 (Step 2 (a)). We add the effects of actions starting in t (Step 2 (b)). Finally, we subtract the effects of actions terminating at time t (step 2(c)). Example: If we are at time t = 2 in our plan, we can calculate inf2 for f uel as follows in Fig. 2: The first step in calculating the expectations is to combine inf1 (Informed expectations at time 1) with ef f move north (effects of the move north action) using the 2{} operator. Line 3 of the figure shows the substitutions of the values of these mappings (Line 2) applied using the operator definitions. The right side of line 3 shows the simplified result of this operator application. Lines 4 and 5 sets up the second operator, using the result of the last calculation and the effects of the move east action. Then, the left side of line 6 shows application of the operator, with the right side being the simplified result. We can see the entire calculation in Fig. 3. This figure includes all steps and for all variables and follows a similar calculation path as the simplified figure.
6 Regression Expectations Informally, Regression Expectations accumulate a set of functions starting backwards from the last action in π. It compounds all active durative actions’ inverse functions as well as their preconditions, ensuring that the goals will still be met after the completion of π. Formally, let n be the time step when π finalizes its execution, 0 the time when π starts its execution, t a natural number with 0 ≤ t ≤ n + 1, we denote the regressed expectations at time t with Xreg (π, t, n+1,G)= regt . Each regt is generated as follows:
492
N. Reifsnyder and H. Munoz-Avila
Fig. 2. Calculation of the informed expectations for variable f uel at time step t = 2. The last operation step is left out because ef f light beacon doesn’t alter the variable f uel and thus doesn’t alter the expectation set for this variable. Full expansion of all variables in the state can be seen in Fig. 3.
1. regn+1 = G. (we start with t = n + 1 for bookkeeping) 2. For all t ≤ n, we compute regt in three steps: (a) regt = regt+1 (b) for all a ∈ Ts (s), regt = regt tprea ef f a (c) for all a ∈ Te (t), regt = regt ⊕tst+1 ef f a . Case 1 is the base case; t = n is the time that the last action in π completes its execution. So t = n + 1 is the first time step after the completion of the plan’s execution and we expect the set of goals G to be satisfied. If the goals are unknown, then G = {}. The 3 steps of Case 2, the recursive case, are as follows: we start with the regressed expectations computed up to time t + 1 (Step 2 (a)). We subtract the effects of actions starting at time t (step 2(b)). Finally, we add the effects of actions ending in t (Step 2 (c)). Agents using Regression Expectations check that the rest of the plan can be executed, and when finished the set of goals G (if they exist) will be satisfied. Example: If we are at time t = 2 of the plan trace π in Table 2 we can calculate the Regression Expectations reg2 = Xregress (π, s2 ,G) as follows (The preconditions and effects for move east and light beacon have not been shown before): – reg4 = G = {lit : {Beacon1↓ : 1, Beacon1↑ : 1]}}. (i.e., Case 1 with n = 3) – reg3 = reg4 ⊕3{} ef f light beacon = {lit : {Beacon1↓ : (+ 1 (− t 3)), Beacon1↑ : (+ 1 (− t 3))]}} (i.e., Step 2 (b): light beacon ends at t = 3). – Thus reg2 = reg3 2prelight beacon ef f light beacon ⊕2{} ef f move north ⊕2{} ef f move east (i.e., Step 2 (b): light beacon starts at t = 2 and Step 2 (c): move east and move north end at t = 2).
Monitoring Expectations from Durative Effects
493
Fig. 3. Expanded calculation of the informed expectations at time step t = 2.
7 Goldilocks Expectations Goldilocks Expectations [18] combines Informed and Regression Expectations. Formally, we define Goldilocks Expectations as Xgold (π, t,G) = goldt , where goldt = (inft , regt ). That is, for ever time t, goldt is the pair containing the Informed and Regression Expectations for that time. An agent using Xgold (π, i,G) checks the overlap of the regressed and the informed intervals, [lef t(v ), right(v )] ∩ [lef t(v”), right(v”)]. This ensures completing the goals while checking for inferred considerations from the action model such as efficiency.
494
N. Reifsnyder and H. Munoz-Avila
Example: If we are at time t = 2, then we can compute the Goldilocks Expectations gold2 = (inf2 , reg2 ).
8 Property of Regression Theorem 1. Let π be a plan solving (S0 , A, G). If Xreg (π, t, n + 1,G) is satisfied in St then πt solves (St , A, G). Base case: t = n+1. At time n+1, Xreg (π, n+1, n+1,G) = G by definition. Therefore if Xreg (π, n+1, n+1,G) is satisfied, then the empty plan πn+1 = () solves (Sn+1 , A, G). Recursive case: We show that if regi is satisfied in t = i, then when executing one time step in πi , regi+1 will be satisfied in t = i + 1. By induction hypothesis, πi+1 solves (Si+1 , A, G) and, hence, πi solves (Si , A, G). The calculation for regi begins with regi equaling regi+1 . We analyze the three kinds of actions with respect to regi , and its such that if for each of these types of actions regi yields regi+1 individually, then regi yields regi+1 overall: – Actions that are continuing through time set i: that is, each action a such that a ∈ Ts (t ) ∩ Te (t ) and t < i < t . For each such action a, reg i will yield regi+1 a over the variables affected by a. That is, regi+1 [v] = regi [v] + fa [v](1). – Actions that end in time step i. That is, each action a with a ∈ Te (t). For each such action a, its effects are added into the regression set as a function of time using the ⊕ operator. Specifically, this will execute step 1 of the ⊕ operator resulting in regi = (+ (regi+1 t) (ef f a (− t t‘))). For this time step where the action is ending, t = t , thus (ef f a (− t t‘)) = 0 and (+ (regi+1 t) (ef f a (− t t‘))) = regi+1 – For actions that begin in time step i. That is, each action a with a ∈ Ts (t). For each such action a, we are removing their effects from the regression set using the operator. We consider first the effects and then the preconditions: • Effects: Specifically we use the second clause of the operator resulting in regi = (− (regi+1 t) (ef f a (− t t ))). For this time step where the action is starting, t = t , thus (ef f a (− t t‘)) = 0 and (− (regi+1 t) (ef f a (− t t‘))) = regi+1 • Preconditions: Lastly, we add in the preconditions for all actions that begin at time i. By adding these values into the regression set, we ensure that all actions beginning at time i will be executable.
9 Empirical Evaluation We did a comparative study of the 5 different types of expectations across 2 temporal domains. The 5 Expectation types we tested were Immediate, Informed, Regression, Goal Regression, and Goldilocks Expectations (Immediate expectations in this case are simply checking the preconditions of the action before execution). The difference between goal regression and regression is that in the latter G= ∅, accounting for situations when the goals are not known (e.g., using a domain-specific planner with implicit goals). For planning purposes, we use the Pyhop HTN planner [14], which was extended
Monitoring Expectations from Durative Effects
495
Fig. 4. Experimental results for our Marsworld domain
to handle numeric fluents over a temporal space and the HTN methods configured to generate correct plans. Other than the expectation type, the agent uses the same planning and discrepancy handling processes. Whenever a discrepancy is observed from the expectations, we use a simple goal-reasoning process to select the original goal and plan again from the current state. Thus, any performance changes is attributable to the expectations. Marsworld Definition. We used is a temporal variant on the Marsworld domain [6], itself inspired by Mudsworld [9]. The agent has to navigate a 10 × 10 grid to turn on 3 randomly placed beacons. The grid space is continuous in this version. Each movement action drains some amount of the agents fuel. There is a second fuel resource which is used solely for lighting the beacon. While executing its actions, the agent may unexpectedly have damage caused to it, forcing it to use more fuel per action until repaired (this can occur with a 2% probability after each time step). This damage may also cause the agent to lose beacon fuel. For testing, we ran 200 trials, each trial placed the rover and beacons randomly on the grid. During the trials we measured total fuel consumption as well as whether or not an execution failed. A failure means the preconditions of some action were not satisfied when it was to be executed (e.g., when the agent attempts to execute a move but it doesn’t have enough fuel). Results for Marsworld. In Fig. 4, we can see that Regression and Goal Regression Expectations consumed the most fuel, with the Informed and Immediate performing basically equally. Goldilocks Expectations outperforms all of the rest. This occurs because Regression and Goal Regression Expectations are the only ones not noticing when the agent is damaged, causing increased fuel consumption. They only look at future preconditions, so they only realize the damage once it drains enough fuel so that the plan can no longer be completed. The other 3 expectation types identify increase consumption after 1 action, since they monitor effects of the actions. Goldilocks has the
496
N. Reifsnyder and H. Munoz-Avila
Fig. 5. Experimental results for our camera domain
addition of noticing the mud from the regression part of its expectations, so it travels a more efficient route. Immediate, Regression, Goal Regression and Goldilocks are able to ensure that the plan will be completed without failures, while 27% of trials failed for Informed Expectations. Informed fails because it will attempt to execute an action without it’s preconditions being satisfied. All other expectation types check preconditions. Specifically, in this scenario, agents using Informed expectations will attempt to light a beacon after having lost beacon fuel, thus failing. Camera Surveillance Definition. The agent in this domain is a camera with the ability to swivel 360◦ as well as zoom in and out. The goal of the agent in this domain is to keep a moving object in the environment in view, while as zoomed in as possible. Without a state satisfiable goal, we instead represented the goal as a reward function. Both zooming and turning the camera are durative actions in this domain, both consuming from the same energy resource. While executing its actions, the agent may suffer unexpected damage causing its actions to consume more energy. This damage is repairable if recognized. Since there is no satisfiable set of goals G, goal regression and regression are equal. Camera Surveillance Results. All expectation types were able to track the object, so the only difference between them is in energy consumption. In Fig. 5, we can see that Regression and Goal Regression are very similar with the only differences coming from the random setup. We can also see these expectations used the most energy, while immediate and informed used a lower amount and Goldilocks used the lowest. This differential comes from the recognition of excess energy consumption, as well as efficient action taking. For example, both regression and goal regression only check the preconditions of actions. Therefore when an action consumes more energy than expected, the agent doesn’t recognize it. This results in the agent continuing to take actions while consuming more energy per action, which results in more energy consumption per episode.
Monitoring Expectations from Durative Effects
497
Fig. 6. Experimental results across Marsworld and camera domain
Looking at Fig. 6, we see that Goldilocks was able to generate the most reward. The reward is maximized when the moving object is in frame and is maximally zoomed in on.
10 Conclusions We see some commonalities from these two domains which are backed up by theory. Regression expectations may incur in a higher execution cost for the agent, but guarantee that the plan will succeed, illustrating Theorem 1. The increased cost comes from not monitoring the effects of actions previously executed. This allows for situations where the agent incurs excess cost without recognizing it. Informed and Immediate mitigate execution costs for the agent by checking the effects of previous actions. However by not checking the preconditions of future actions, the agent can fail to achieve its goals. Goldilocks is able to take the best from both and incur the smallest execution costs while guaranteeing the success of the plan. The agent checks its regressed conditions Monitoring Expectations from Durative Effects 15 ensuring its goals are achieved, while monitoring the accumulated effects of its actions ensuring it doesn’t incur in unexpected costs. Our definition of expectations are not limited to GDA agents and could be used by other execution monitoring agents. In particular for future work, we will like to explore agent’s expectations when the agent is part of a team achieving a common goal. This will require reasoning with social interaction aspects in addition to states’ fluents. This can follow ideas proposed in [3]. Acknowledgments. This research was supported by ONR under grants N00014-18-1-2009 and N68335-18-C-4027 and NSF grant 1909879. The opinions in this paper are from the authors and not necessarily from the funding agencies.
498
N. Reifsnyder and H. Munoz-Avila
References 1. Aha, D.W.: Goal reasoning: foundations emerging applications and prospects. AI Mag. 39(2), 3–24 (2018) 2. Birnbaum, L., Collins, G., Freed, M., Krulwich, B.: Model-based diagnosis of planning failures. AAAI 90, 318–323 (1990) 3. Coman, A., Aha, D.W.: AI rebel agents. AI Mag. 39(3), 16–26 (2018) 4. Cox, M.T.: Perpetual self-aware cognitive agents. AI Mag. 28(1), 32 (2007) 5. Dannenhauer, D.: Self monitoring goal driven autonomy agents. Ph.D. thesis, Lehigh University (2017) 6. Dannenhauer, D., Munoz-Avila, H.: Raising expectations in GDA agents acting in dynamic environments. In: IJCAI, pp. 2241–2247 (2015) 7. Dannenhauer, D., Munoz-Avila, H., Cox, M.T.: Informed expectations to guide GDA agents in partially observable environments. In: IJCAI, pp. 2493–2499 (2016) 8. Fritz, C., McIlraith, S.C.: Monitoring plan optimality during execution. In: ICAPS, pp. 144– 151 (2007) 9. Molineaux, M., Aha, D.W.: Learning unknown event models. In: AAAI, pp. 395–401 (2014) 10. Molineaux, M., Klenk, M., Aha, D.W.: Goal-driven autonomy in a navy strategy simulation. In: AAAI (2010) 11. Molineaux, M., Kuter, U., Klenk, M.: Discoverhistory: understanding the past in planning and execution. In: Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 989–996. International Foundation for Autonomous Agents and Multiagent Systems (2012) 12. Munoz-Avila, H., Dannenhauer, D., Reifsnyder, N.: Is everything going according to plan? expectations in goal reasoning agents. In: Proceedings of AAAI-19 (2019) 13. Mu˜noz-Avila, H., Jaidee, U., Aha, D.W., Carter, E.: Goal-driven autonomy with case-based reasoning. In: Bichindaritz, I., Montani, S. (eds.) ICCBR 2010. LNCS (LNAI), vol. 6176, pp. 228–241. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-14274-1 18 14. Nau, D.: Pyhop, version 1.2.2 a simple HTN planning system written in python (2013). https://bitbucket.org/dananau/pyhop. Accessed 30 Jan 2019 15. Pollock, J.L.: The logical foundations of goal-regression planning in autonomous agents. Artif. Intell. 106(2), 267–334 (1998) 16. Powell, J., Molineaux, M., Aha, D.W.: Active and interactive discovery of goal selection knowledge. In: FLAIRS Conference (2011) 17. Reifsnyder, N., Munoz-Avila, H.: Computing numeric expectations for cognitive agents. In: ACS-2020 (2020) 18. Reifsnyder, N., Munoz-Avila, H.: Goal reasoning with goldilocks and regression expectations in nondeterministic domains. In: 6th Goal Reasoning Workshop at IJCAI/FAIM-2018 (2018) 19. Scala, E., Haslum, P., Thi´ebaux, S., Ramirez, M.: Interval-based relaxation for general numeric planning. In: Proceedings of the Twenty-second European Conference on Artificial Intelligence, pp. 655–663. IOS Press (2016) 20. Wilkins, D.E.: Recovering from execution errors in sipe. Comput. Intell. 1(1), 33–45 (1985) 21. Wilson, M.A., McMahon, J., Aha, D.W.: Bounded expectations for discrepancy detection in goal-driven autonomy. In: AI and Robotics: Papers from the AAAI Workshop (2014)
Sublinear Regret with Barzilai-Borwein Step Sizes Iyanuoluwa Emiola(B) Electrical and Computer Engineering Department, University of Central Florida, Orlando, FL 32816, USA [email protected]
Abstract. This paper considers the online scenario using the BarzilaiBorwein Quasi-Newton Method. In an online optimization problem, an online agent uses a certain algorithm to decide on an objective at each time step after which a possible loss is encountered. Even though the online player will ideally try to make the best decisions possible at each time step, there is a notion of regret associated with the player’s decisions. This study examines the regret of an online player using optimization methods like the Quasi-Newton methods, due to their fast convergent properties. The Barzilai-Borwein (BB) gradient method is chosen in this paper over other Quasi-Newton methods such as the BroydenFletcher-Goldfarb-Shanno (BFGS) algorithm because of its less computational complexities. In addition, the BB gradient method is suitable for large-scale optimization problems including the online optimization scenario presented in this paper. To corroborate the effectiveness of the Barzilai-Borwein (BB) gradient algorithm, a greedy online gradient algorithm is used in this study based on the two BB step sizes. Through a rigorous regret analysis on the BB step sizes, it is established that the regret obtained from the BB algorithm is sublinear in time. Moreover, this paper shows that the average regret using the online BB algorithm approaches zero without assuming strong convexity on the cost function to be minimized. Keywords: Online optimization · Quasi-Newton methods Barzilai-Borwein step sizes · Sublinear regret
1
·
Introduction
This paper presents a gradient-based algorithm using the Barzilai-Borwein step sizes to solve an online optimization problem with regret analysis of an online player. Online Optimization involves a process where an online agent makes a decision without knowing whether the decision is correct or not. The objective of the online agent is to make a sequence of accurate decisions given knowledge of the optimal solution to previous decisions. A common term associated with many optimization problems known as regret measures how well the online agent performs after a certain time, based on the the difference between the loss c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 499–512, 2022. https://doi.org/10.1007/978-3-030-82193-7_33
500
I. Emiola
incured and the best decision taken [13]. The problem of online optimization has applications to a number of fields including game theory, the smart grid and classification in machine learning amongst others. Performance of online optimization algorithms is usually measured in terms of the aggregate regret suffered by the online agent compared with the known optimal solution of each problem across the sequence of problems. Online optimization methods and algorithms have been studied using different methods including gradient-based methods [13,22,28]. Extensions have been considered on unconstrained problems [17] and online problems with long-term [16]. Problems in dynamic environments have also been analyzed in [5,14,18,23,26,27]. The author in [27] used gradient tracking technique in a static optimization scenario and showed that the regret bounds in the dynamic optimization case is independent of the time horizon. In [26], the authors obtained sublinear regret in a dynamic case for a distributed online problem using the primal-dual descent algorithm. The author in [27] used gradient tracking technique in a static optimization scenario and showed that the regret bounds in the dynamic optimization case is independent of the time horizon. In [26], the authors obtained sublinear regret in a dynamic case for a distributed online problem using the primal-dual descent algorithm. The authors in [23] obtained sublinear regret for a distributed online framework that has time-varying constraints and presented a fit technique to deal with constraint violations. In [5], the authors applied the online optimization problem with an application to adversarial attack. The authors explored an online constrained problem with adversarial objective functions and constraints and obtained a sublinear regret. In addition, the authors in [5,14,18,23,26,27], used gradient methods in their computational approach to establish convergence. As well-structured as gradient methods are, applying them to large-scale online problems face several challenges and become impractical due to their well-known slow convergence rates in the static settings [2]. To address the slow convergence rates of first order methods, second-order (popularly called Newton-type) methods have been proposed [20]. The Newton method was applied in [14] where the authors showed that the Newton method performs similarly to a case where the strong convexity condition is used on the objective function. While Newton-type iterative methods have quadratic convergence, they also present a significant computational overhead from the need to invert and store the hessian of the objective function being optimized, which makes them impractical for large-scale online optimization problems. This paper aims at even improving the Newton method with the Barzilai-Borwein QuasiNewton methods in an online optimization scenario. To leverage the benefits of the computational simplicity of gradient methods and the convergence properties of second-order methods, the so-called quasiNewton methods have been introduced; for example, the Broyden-FletcherGoldfarb-Shanno (BFGS) algorithm [9,15] and the Barzilai-Borwein (BB) algorithm [1,7]. Quasi-Newton methods exploit the second-order (curvature information) of the objective function being optimized into the first-order framework. For example, the BFGS method approximates the information in the curvature
Sublinear Regret with Barzilai-Borwein Step Sizes
501
of the hessian between time steps to use in its update [9]. The Stochastic BFGS and its low-memory variant (the L-BFGS) quasi-Newton method has been studied in online settings [3,20] with good performance relative to the standard gradient method. The BB method, on the other hand, computes a step size such that the computed step size and gradient contain information that approximates the hessian curvature. Convergence rate analyses have been obtained for these quasi-Newton methods [4,6] and these methods are increasingly being used in large-scale, computation-intensive applications such as distributed learning. In this paper, an online Barzilai-Borwein quasi-Newton algorithm is presented and its performance for the two variations of the BB step sizes is analyzed using the regret. We show that the regret increases sublinearly in time. Following an introduction of the problem and brief summary of existing approaches (Sect. 2), we introduce quasi-Newton methods that exploit known fast convergence of second-order methods (Sect. 3) and present our main result (Sect. 4). Concluding remarks follow in Sect. 5. 1.1
Contributions
In this paper, a novel regret analysis using the Barzilai-Borwein Quasi-Newton method in an online optimization scenario is presented. Due to the fast convergence property of the Newton methods, the work [14] is an improvement on existing online optimizations application problems in [5,23,26,27]. However, the Quasi-Newton method using the BB step sizes presented in this paper is better than Newton methods in dealing with convergence speeds and computing the inverse of the hessian. Even though the author in [28] also obtained a similar sublinear regret result, BB Quasi-Newton algorithm is known to be suitable for dealing with large-scale optimization bottleneck that the Newton method is not appropriate for. Additionally, strong convexity assumption is not needed in this paper to establish sublinear regret. Notation: We represent vectors and matrices as lower and upper case letters, respectively. Let a vector or matrix transpose be (·)T , and the L2-norm of a vector be ·. Let the gradient of a function f (·) be ∇f (·), and the set of reals numbers be R.
2
Problem Formulation
Consider an online optimization problem min fk (x(k)),
x(k)∈X
(1)
in which the feasible decision set X ∈ Rn is known, assumed to be convex quadratic, non-empty, bounded, closed and fixed for all time k = 1 . . . , K. We assume the number of iterations during which the online players make choices, K, is unknown to the player. By convexity of the cost function fk (·) and X , Problem (1) has an optimal solution x∗ , which is the best possible choice or
502
I. Emiola
decision agents can make at each time k. A player (an online agent) at time k uses some algorithm to choose a point x(k) ∈ X , after which the player receives a loss function fk (·). The loss incurred by the player is fk (x(k)). These problems are common in contexts such as real time resource allocation, online classification [13]. The goal of the online agent is to minimize the aggregate loss by determining a sequence of feasible online solutions x(k) at each time-step of the algorithm. Let the aggregate loss incurred by the online algorithm that solves Problem (1) at time K be given by: f (K) =
K
fk (x(k)).
k=1
To measure performance of the online player, we use the regret framework. The static regret is a measure of the difference between the loss of the online player and the loss from the static case min fk (x),
x∈X
where the single best decision x∗ is chosen with the benefit of hindsight. Let the aggregate loss up to time K incurred by the single best decision be given by fx (K) =
K
fk (x).
k=1
Then the static regret at time K is defined as [13]: R(K) = f (K) − min fx (K). x∈X
2.1
(2)
Algorithms for Online Optimization Problem
A commonly used algorithm for solving the static case of Problem (1) is the gradient descent method, which involves updating the variable x(k) iteratively using the gradient of the cost function with the following equation: x(k + 1) = x(k) − α∇f (x(k)).
(3)
It is known that with an appropriate choice of the step size α, the sequence {x(k)} converges to x∗ in O(1/k); that is, an ε-optimal solution is attained in about O( 1ε ) iterations [19]. Moreover, when the cost function is strongly convex, the update equation in (3) reaches an ε-optimal solution in about O(1/ε2 ) iterations. Even though the update scheme of gradient method are easily implementable in a distributed architecture as seen in [11,19], there have been a need for an improvement in convergence rates of gradient methods as seen in [25]. Nonetheless techniques to accelerate convergence lag behind the Newton and quasi-Newton methods [25].
Sublinear Regret with Barzilai-Borwein Step Sizes
503
To improve convergence rates in static optimization problems, algorithms that use second order information (hessian of the cost function) have been introduced. These methods leverage curvature information of the cost function in addition to direction; and are known to speed up the convergence in the neighborhood of the optimal solution. The Newton-type method is an example used as an improvement in enabling faster convergence rates than the regular gradient method. In fact, when the cost function is quadratic, the Newton algorithm is known to converge in one time-step. For non-quadratic, the Newton method still converges in just a few time steps [10]. Though they have good convergence properties, there are computational costs associated with building and computing the inverse hessian. In addition, some modification are needed if the hessian is not positive definite [12]. To avoid the computation burden of second-order methods while maintaining the structure of first-order methods, Quasi-Newton methods have been introduced. 2.2
Quasi-Newton Methods
A number of quasi-Newton methods have been proposed in the literature including the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm [9] and the Barzilai-Borwein (BB) algorithm [8], as well as the David-Fletcher-Powell (DFP) algorithm [24]. The main idea in the performance of these methods is to speed up convergence by using the information from the inverse hessian without computing it explicitly; for example, Barzilai-Borwein computes step-sizes using the difference of successive iterates and the gradient evaluated at those iterates. Although the BFGS can be used to facilitate rapid convergence, scaling is an issue especially during the process where the method approximates the information in the curvature of the Hessian between time steps as seen in [9]. However the Barzilai-Borwein Quasi-Newton method uses just the step sized to approximate the inverse hessian. In this paper, we use the gradient-based method using Barzilai-Borwein step sizes to solve Problem (1) and show that the regret increases sublinearly in time.
3
The Barzilai-Borwein Quasi-Newton Method
The Barzilai-Borwen quasi-Newton method is an iterative technique suitable for solving optimization problems that can yield superlinear convergence rates when the objective functions are strongly convex and quadratic [1,6]. It differs from other quasi-Newton methods because it only uses one step size for the iteration as opposed to other quasi-Newton method that have more computation overhead. The Barzilai-Borwein method solves Problem (1) iteratively using the update in (3); however, the step-size α(k) is computed so that α(k)∇f (x(k)) approximates the the inverse Hessian. We briefly introduce the two forms of the BB step-sizes used in Algorithm 1.
504
I. Emiola
Consider the update x(k + 1) = x(k) − α(k)∇f (x(k)). The two forms of the BB step sizes [1] α1 (k) and α2 (k) are given by: α1 (k) =
s(k − 1)T s(k − 1) . s(k − 1)T y(k − 1)
(4)
α2 (k) =
s(k − 1)T y(k − 1) . y(k − 1)T y(k − 1)
(5)
and s(k) and y(k) are such that s(k − 1) x(k) − x(k − 1),
and
y(k − 1) = ∇f (x(k)) − ∇f (x(k − 1)). In general, there is flexibility in the choice to use α1 (k) or α2 (k) [1], and both step sizes can be alternated within the same algorithm after a considerable amount of iterations to facilitate convergence. The rest of this work will characterize performance of the online Algorithm 1 using the step sizes in Eqs. (4) and (5), which as we will show has a regret that is sublinear in time with the average regret approaching zero. Before stating the main result, we state some assumptions about Problem (1) and Algorithm 1. Assumption 1. The decision set X is bounded. This implies that there exists some constant 0 ≤ B < ∞ such that |X | ≤ B. Assumption 2. The decision set X is closed; that is, suppose all agents’ decisions follow an iterative sequence x(k) ∈ X . If there exists some x ˆ ∈ Rn such ˆ, then x ˆ ∈ X. that limk→∞ x(k) = x Assumption 3. For all decision iterates x(k), the cost function f (x(k)) is differentiable and the gradient of the objective function ∇f is Lipschitz continuous. This means that for all x and y, there exists L > 1 such that: ∇f (x) − ∇f (y) ≤ Lx − y.
Algorithm 1. Online Barzilai-Borwein Quasi-Newton Alg.
1: 2: 3: 4:
Given: Feasible set X and time horizon K Initialize: x(0) and ∇f0 (x(0) arbitrarily for k = 1 to K do Agents predicts x(k) and observes fk (·) Update x(k + 1) = x(k) − α(k)∇fk (x(k)) end for
Sublinear Regret with Barzilai-Borwein Step Sizes
4
505
Regret Bounds
Before we present our main results (Theorems 1 and 2), we first present two lemmas that will be used in its proof. The first is a result in [28], which will be used in the definition of regret and the other is the Sedrakyan’s inequality. Lemma 1. ([28]) Without loss of generality, for all iterates k, there exists gradient g(k) ∈ Rn such that for all x, gk .x = fk (x), where gk = ∇fk (x(k)).
Proof. The proof can be seen in [28]. Lemma 2. (The Sedrakyan’s Inequality) For all positive a1 , a2 , ........an and b1 , b2 , ........bn , the following inequality holds: n a2 i
i=1
bi
≥
reals
n ( i=1 ai )2 n . i=1 bi
Proof. We refer readers to [21] for a proof.
Another result we will use is the static regret bounds for R(K) which is shown in [28]: K 1 ∇fm 2 R(K) ≤ D2 + α(k), (6) 2α(K) 2 k=1
As seen in [28], D denotes the maximum value of the diameter of X and ∇fm = maxx∈X ∇fk (x). We will now proceed to characterize the regret obtained from Algorithm 1 for Problem (1) with the two BB step sizes. Theorem 1. Consider Problem (1) and let: α(k) = in Algorithm 1. If
e−d c−b
s(k − 1)T s(k − 1) s(k − 1)T y(k − 1)
≤ db , where
b = ((x(1) − x(0) + (x(2) − x(1))2 , c = 2((x(1) − x(0)2 + (x(2) − x(1)2 ),
d= and e =
K
K
(x(k) − x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1))),
k=1
k=1
Lx(k) − x(k − 1)2 .
506
I. Emiola
Also if P = min(P, Z) where: P =
K
α(k)
k=1
and
2((x(1) − x(0)2 + (x(2) − x(1)2 ) K L k=1 (x(k)2 + x(k − 1)2 )
Z=
Then the average regret is bounded by: R(K) ∇fm 2 1 ≤ D2 + Ψ, K 2Kα(K) 2K where Ψ =
2((x(1) − x(0)2 + (x(2) − x(1)2 ) , K K L k = 1 x(k)2 + L k = 1 x(k − 1)2
L = maxk Lk , Lk is the Lipschitz parameter of ∇fk (x(k), in Problem (1) and approaches 0. limK→∞ R(K) K Proof. First, by using the results of Lemma 1, the regret of Algorithm 1 can be expressed as: K R(K) = (x(k) − x∗ )g(k). k=1
Then from Eq. (3), the regret R(K) =
K
(x(k − 1) − α(k − 1)∇f (x(k − 1)) − x∗ )g(k),
k=1
where α(k) is as expressed in (4) . To prove Theorem 1, the approach will be to upper-bound the aggregate sum of the step size α(k) and use the generalized bound for online gradient descent in Eq. (6). This approach is possible since the gradient of the cost function at each time in the sequence of problems is bounded (Assumption 3). Proceeding, the running sum of the step sizes α(k) up to time K is expressed as K k=1
K s(k − 1)T s(k − 1) α(k) = s(k − 1)T y(k − 1) k=1
=
K k=1
=
K k=1
(x(k) − x(k − 1))T (x(k) − x(k − 1)) (x(k) − (k − 1))T (∇f (x(k)) − ∇f (x(k − 1))) x(k) − x(k − 1)2 . (x(k) − x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
Sublinear Regret with Barzilai-Borwein Step Sizes
507
By applying the result in Lemma 2 to the right hand side of the preceding inequality, we obtain that: K
K ( k=1 (x(k) − x(k − 1)))2
α(k) ≥ K
k=1 (x(k)
k=1
− x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
(7)
By inspection, if write the first few terms of the numerator of Eq. (7), it is evident that Eq. (7) can be further lower bounded according to the following: K
α(k) ≥ K
((x(1) − x(0) + (x(2) − x(1))2
k=1 (x(k)
k=1
− x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
(8)
Clearly because the terms (x(1) − x(0) and (x(2) − x(1) are positive, the numerator of Eq. (8) can be upper-bounded according to the following: ((x(1) − x(0) + (x(2) − x(1))2 ≤ 2((x(1) − x(0)2 + (x(2) − x(1)2 ) To bound the denominator of Eq. (8), we use the Lipschitz continuity of the gradients of f (·) with parameter L > 1. Therefore, K
(x(k) − x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
k=1
≤
K
Lx(k) − x(k − 1)2 .
k=1
If we represent the bounds in the numerator and denominator of Eq. (8) by the following variables such that: b = ((x(1) − x(0) + (x(2) − x(1))2 , c = 2((x(1) − x(0)2 + (x(2) − x(1)2 ), d=
K
(x(k) − x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1))),
k=1
and e=
K
Lx(k) − x(k − 1)2 .
k=1
It has been shown that b ≤ c and d ≤ e. Therefore to find an upper bound d for Eq. (8), we use the condition that if e−d c−b ≤ b , then we obtain: b c ≤ d e
508
I. Emiola
So we obtain the bounds of the right hand side of (8) as: ((x(1) − x(0) + (x(2) − x(1))2
K
k=1 (x(k)
≤ ≤ ≤
− x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
2((x(1) − x(0)2 + (x(2) − x(1)2 ) K 2 k=1 Lx(k) − x(k − 1) L
2((x(1) − x(0)2 + (x(2) − x(1)2 )
K
2 k=1 (x(k)
+ x(k − 1)2 − 2x(k)x(k − 1))
2((x(1) − x(0) + (x(2) − x(1)2 ) K L k=1 (x(k)2 + x(k − 1)2 ) 2
If we let the left hand side of Eq. (8) be represented by: P =
K
α(k)
k=1
and we let the right hand side of Eq. (8) be denoted as: ((x(1) − x(0) + (x(2) − x(1))2
Q = K
k=1 (x(k)
− x(k − 1))T (∇f (x(k)) − ∇f (x(k − 1)))
Similarly if we let the derived upper bound of Q be given by: Z=
2((x(1) − x(0)2 + (x(2) − x(1)2 ) K L k=1 (x(k)2 + x(k − 1)2 )
From the above analysis, we observe that P ≥ Q and Q ≤ Z. Therefore, if P = min(P, Z), then we can deduce that P ≤ Z. By the established relationship between P and Z and also using the triangle inequality, we obtain the bound for using the first BB step size as: K
α(k) ≤
k=1
2((x(1) − x(0)2 + (x(2) − x(1)2 ) K K L k=1 x(k)2 + L k=1 x(k − 1)2
By using the regret bound equation in (6), we obtain: R(K) ≤ D2
where Ψ =
∇fm 2 1 + Ψ, 2α(K) 2
2((x(1) − x(0)2 + (x(2) − x(1)2 ) . K K L k=1 x(k)2 + L k=1 x(k − 1)2
The average regret over K time steps can then be expressed as ∇fm 2 1 R(K) ≤ D2 + Ψ. K 2Kα(K) 2K
Sublinear Regret with Barzilai-Borwein Step Sizes
509
Since D is constant based on its value in (6), and ∇fm 2 is also constant, we conclude that that the average regret approaches 0. limK→∞ R(K) K Next, we consider the performance of Algorithm 1 using the second BB stepsize in Eq. (5). Theorem 2. Consider Problem (1) and let Algorithm 1 be used to solve Problem s(k1)T y(k−1) (1) where α(k) = y(k−1) T y(k−1) ; and L is the maximum of all Lipschitz continuity parameters of all gradients of the cost function in Problem (1), then, the regret is upper bounded by R(K) ≤ D2
1 ∇fm 2 + ζ, 2α(K) 2
where ζ=(
K
1
(((A(k)T )2 ) 2 (
k=1
K
1
((B(k))2 ) 2 (
k=1
and the average regret limK→∞
R(K) K
K
1
((C(k))2 ) 2 .
k=1
approaches 0.
Proof. The approach to proving Theorem 2 will be similar to that of Theorem 1, where we will obtain bounds for the aggregate sum of the step sizes in R(K) and use the generalized bound for online gradient descent algorithm. In this case, the sum of the aggregate step sizes is expressed as K k=1
K s(k − 1)T y(k − 1) α(k) = y(k − 1)T y(k − 1) k=1
By using the relationship s(k − 1) x(k) − x(k − 1),
and
y(k − 1) = ∇f (x(k)) − ∇f (x(k − 1)). and by noting that y(k − 1)T y(k − 1) = y(k − 1)2 , and also expressing as a product of three different functions, we obtain: K k=1
α(k) =
K
((x(k)−x(k−1))T (∇f (x(k))−∇f (x(k−1)))∇f (x(k))−∇f (x(k−1)−2 )
k=1
(9) For the purpose of clarity, let A(k) = ((x(k) − x(k − 1)) B(k) = (∇f (x(k)) − ∇f (x(k − 1))) −2
C(k) = ∇f (x(k)) − ∇f (x(k − 1)
and
510
I. Emiola
Applying the Cauchy-Schwarz inequality to the right hand side of Eq. (9), we obtain that: K
α(k) =
k=1
K
(A(k)T B(k))C(k)
k=1
≤
K
(((A(k)T )2 (B(k))2 )
k=1
≤(
K
K
1
(C(k))2 ) 2 ,
k=1 1
(((A(k)T )2 ) 2 (
k=1
K
1
((B(k))2 ) 2 (
k=1
K
1
((C(k))2 ) 2 .
k=1
Applying the generalized regret bound as seen in Eq. (6), we obtain the regret R(K) as: ∇fm 2 1 + ζ, R(K) ≤ D2 2α(K) 2 K where the value of ζ is the upper bound of k=1 α(k) obtained above after applying Cauchy-Schwarz inequality and it is given by: ζ=(
K
1
(((A(k)T )2 ) 2 (
k=1
K
k=1
1
((B(k))2 ) 2 (
K
1
((C(k))2 ) 2 .
k=1
Therefore the average regret is R(K) ∇fm 2 1 ≤ D2 + ζ K 2Kα(K) 2K Furthermore, since D is constant based on its value in (6), and the terms A(k), B(k) and C(k) are also positive, we conclude that the average regret approaches 0. limK→∞ R(K) K The Barzilai-Borwein step size in the gradient-based Algorithm 1 results in a regret that grows sublinearly in time and yields an average regret of zero as time K goes to infinity.
5
Conclusions
In this work, an online Barzilai-Borwein quasi-Newton algorithm using the regret framework is presented to show the usefulness of Quasi-Newton methods for large-scale and computational intensive optimization problems. The analysis for both Barzilai-Borwein step sizes showed that the regret of the algorithm grows sublinearly in time and that the average regret approaches zero. The use of the generalized regret bounds for online gradient descent introduced in [13] simplified the analyses. For future research work, a regret analysis in a dynamic scenario for online Quasi-Newton method will be presented using the Barzilai-Borwein
Sublinear Regret with Barzilai-Borwein Step Sizes
511
and the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. Another interesting optimization method with a fast convergence property is the Conjugate Gradient method. It should perform well in an online optimization problem but it is unknown whether it will be superior to most online optimization algorithms. Therefore, readers are free to explore research topics on applying the conjugate gradient method to improve the convergence rates for online optimization problems.
References 1. Barzilai, J., Borwein, J.N.: Two-point step size gradient methods. IMA J. Num. Anal. 8(1), 141–148 (1988) 2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004) 3. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Opt. 26(2), 1008–1031 (2016) 4. Byrd, R.H., Nocedal, J., Yuan, Y.-X.: Global convergence of a cass of quasi-newton methods on convex problems. SIAM J. Num. Anal. 24(5), 1171–1190 (1987) 5. Chen, T., Ling, Q., Giannakis, G.B.: An online convex optimization approach to proactive network resource allocation. IEEE Tran. Sig. Process. 65(24), 6350–6364 (2017) 6. Dai, Y.-H.: A new analysis on the barzilai-borwein gradient method. J. Oper. Res. Soc. China 1(2), 187–198 (2013) 7. Dai, Y.-H., Fletcher, R.: Projected barzilai-borwein methods for large-scale boxconstrained quadratic programming. Num. Math. 100(1), 21–47 (2005) 8. Dai, Y.-H., Liao, L.-Z.: R-linear convergence of the barzilai and borwein gradient method. IMA J. Num. Anal. 22(1), 1–10 (2002) 9. Eisen, M., Mokhtari, A., Ribeiro, A.: Decentralized quasi-newton methods. IEEE Trans. Sig. Process. 65(10), 2613–2628 (2017) 10. Emiola, I., Adem, R.: Comparison of optimization methods with application to a network containing malicious agents. arXiv preprint arXiv:2101.10546 (2021) 11. Emiola, I., Njilla, L., Enyioha, C.: On distributed optimization in the presence of malicious agents. arXiv preprint arXiv:2101.09347 (2021) 12. Gill, P.E., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA J. Appl. Math. 9(1), 91–108 (1972) 13. Hazan, E., et al.: Introduction to online convex optimization. Found Optim. 2(3–4), 157–325 (2016) 14. Lesage-Landry, A., Taylor, J., hames, I.: Second-order online nonconvex optimization. IEEE Trans. Autom. Cont. (2020) 15. Li, D.-H., Fukushima, M.: On the global convergence of the BFGS method for nonconvex unconstrained optimization problems. SIAM J. Optim. 11(4), 1054– 1064 (2001) 16. Mahdavi, M., Jin, R., Yang, T.: Trading regret for efficiency: online convex optimization with long term constraints. J. Mach. Learni. Res. 13(Sep), 2503–2528 (2012) 17. Mcmahan, B., Streeter, M.: No-regret algorithms for unconstrained online convex optimization. In: Advances in Neural Information Processing Systems, pp. 2402– 2410 (2012)
512
I. Emiola
18. Mokhtari, A., Shahrampour, S., Jadbabaie, A., Ribeiro, A.: Online optimization in dynamic environments: improved regret rates for strongly convex problems. In: 2016 IEEE 55th Conference on Decision and Control (CDC), pp. 7195–7201. IEEE (2016) 19. Nesterov, Y.: Introductory Lectures on Convex Programming volume i: Basic course. Lect. Notes 3(4), 5 (1998) 20. Schraudolph, N.N., Yu, J., G¨ unter. S.: A stochastic quasi-newton method for online convex optimization. In: Artificial Intelligence and Statistics, pp. 436–443 (2007) 21. Sedrakyan, H., Sedrakyan, N.: Algebraic inequalities. Springer, New York (2018) 22. Shalev-Shwartz, S., Kakade, S.M.: Mind the duality gap: logarithmic regret algorithms for online optimization. In: Advances in Neural Information Processing Systems, pp. 1457–1464 (2009) 23. Sharma, P., Khanduri, P., Shen, L., Bucci, D.J., Jr., Varshney, P.K.: On distributed online convex optimization with sublinear dynamic regret and fit. arXiv preprint arXiv:2001.03166 (2020) 24. Sofi, A.Z.M., Mamat, M., Ibrahim, M.A.H.: Reducing computation time in dfp (davidon, fletcher & powell) update method for solving unconstrained optimization problems. In: AIP Conference Proceedings, vol. 1522, pp. 1337–1345. AIP (2013) 25. Su, W., Boyd, S., Candes, E.: A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014) 26. Yi, X., Li, X., Xie, L., Johansson, K.H.: Distributed online convex optimization with time-varying coupled inequality constraints. IEEE Trans. Sig. Process. 68, 731–746 (2020) 27. Zhang, Y., Ravier, R.J., Zavlanos, M.M., Tarokh, V.: A distributed online convex optimization algorithm with improved dynamic regret. In: 2019 IEEE 58th Conference on Decision and Control (CDC), pp. 2449–2454. IEEE (2019) 28. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning (ICML 2003), pp. 928–936 (2003)
Fluid Dynamics of a Pandemic in a Spatial Social Network: A Reflective Measure of the Spreading Saad Alqithami(B) Computer Science Department, Albaha University, Al Bahah, Saudi Arabia [email protected]
Abstract. The understanding of a society has changed over the past few decades from the observation of a brick-and-mortar structure to a more evolving network of interconnectivity amongst its members. We were able to observe closely such evolution in the recent spreading of the COVID19 pandemic. People spatial interactions have a significant impact on the disease transmission and for the infection to further spread. Even though social precautious were recommended and in a few cases forced by local government and agencies for infection control, we faced and still living through an unprecedented viral spreading of the virus and its resultant many metamorphisms. In light of this, the article strides to understand the spreading of the virus starting from a case zero within a finite graph within an environment to a susceptible feature infected agent utilizing a simplified game theoretical approach. Argumentation between agent was simulated as a mean of spatial interactions for a better reflection of an agent going through the states of the epidemiological SEI compartmental model within different waves of transmission flow. Keywords: Multiagent systems Link prediction
1
· Graph theory · Game theory ·
Introduction
Agents dwell in an open environment and interact with one another for a perhaps partially common interest. Spatial Interaction where agent physically connect with others can be a result of trading, bargaining or even self reliance that reflect a dependency spatial social network. Transmission over the spatial network is imminent for other unexpected things as long as the suitable environment and the time constraint are met, e.g., germs and viruses transmit in the same way. Recently, a new strain of virus spread around the world affecting millions of people and resulting in many loses. It has been recognized as a global pandemic by the World Health Organization (WHO) and named as SARS-CoV-2 or 2019 Corona Virus Disease, i.e., “COVID-19” [14]. Human to human transmission has been reported with clear clinical sickness and fatalities between healthcare workers [20], and now, a continual increase of infected people c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 513–525, 2022. https://doi.org/10.1007/978-3-030-82193-7_34
514
S. Alqithami
have been reported dead [14]. Transmission of diseases has been a long standing puzzle for centuries. A recent observation of the resent wave of the pandemic and the subsequent metamorphisms of the virus have led the scientific community to tackle the issue from different fields of research in order to comprehend the rapid propagation of it and to find a quicker way to overcome the virus with the least possible consequences [1]. Although we were and still far behind in our scientific progress to be prepared for or even handle such waves, we stride in this article to propose a simplified model to reflect the spacial transmission of the disease from one agent to another which may contribute to the understanding of the wide spreading of the virus and help in economizing the distribution of resources (e.g., immunization and face-masks) to most people in need. A simplification of a society throughout the article is considered as a reflection of heterogeneous open multiagent system structured through coalition formation, where agents are any beings hosting and transmitting the virus whether human, animal, birds or even insects. Coalition formation reflects cooperation to structure a heterogeneous society which is a subservient of natural dependency and not just a self-reliance of one agent on another. Argumentation amongst individual heterogeneous agents within a coalition is the base and the initiation of interaction. A heterogeneous population is categorized into a finite number of coalitions at random with their own model and interaction. Each coalition is modeled as an open multiagent system in order to exert the emergence of behavioral patterns as a set of rules of interaction amongst other coalitions and agents themselves. The ground work on deterministic compartmental model allows the exertion of agent-based modeling as a reflection of median nodes for the spreading to take benefit from with a base of spatial network models presenting the dynamisms of ever changing hyper-interactions. With the different levels of network structure ranging from macro, meso to micro, the network model moves form one mode to a more bipartite network models. Statistical integration of epidemic models showed the need for a more developed framework for bipartite network modeling that derive the key bipartite statistics for contact free pathogen epidemics, e.g., fomite transmission. This highlights the research question of this article: How are diseases actually transmitted in space and time at the intermediate population and environmental spacial scale? The article considers one way to show haw to arrive to a certain susceptible to infectious through some of median agents following the best approximate path. It is mainly about showing the possibility of using some game theoretical strategies in order to find a better payoff among different agents inside the social networks, i.e., a high susceptibility to being infected. The payoffs for any individual in the game depend on two things: the distance between an agent as a source and the possible infected destination and between its colleagues, as it has been detailed in deep throughout this paper. In the first section, the history of social network will be shown and some of previous related work. In section two and three, we will consider one simple example to explain the way of choosing a better payoff. Finally, in the forth section, future work will be presented as well as some final statement.
Pandemic Spreading in a Spatial Social Network
2
515
Literature Review
The emergence of coalition formation and cooperation among agents is impacted by the underlying social structure of interactions that enables, constrains and limits transmission [11,13]. Social traits to include transmission in homologous structures is a result of a cooperative interaction that constructs a cohesive coalition [4]. Intra-coalition interactions are considered bonding measures to reflect frequent economic and social operations, while inter-coalition communications are bridging measure for agile adaptations and accessibility to critical resources across different coalitions [8,10,18]. Social structures are influenced by repeated patterns on interactions led by agents functioning on complex, in scope or impact, problems [16], which may affect the individual as well as a society performance features [3,19,21]. Game theory is mainly about studying the interactions among individual inside certain games mathematically [12]. In many games, the players work together with their neighbors directly or indirectly specially in endogenous networks [9]. According to [2], game theory was first used in addressing “zero sum” games, where the players’ gains corresponded to the total loss of another player. Currently, this theory is appropriate in broader range, including Computer Science field. Decision theory is a game against nature which used by self-interested agent to make optional decisions in uncertain environment. Nature is commonly behaves randomly. When the opponents are also independent and self-interested agents, a multi-person decision theory can be considered. Self-interested persons are not “selfish”, but they have an ego-centric perspective. Emotions are not considered. However, game theory’s developers assume that rationality to mean self-regard which is the premise for game theory. Argumentation in game theory is defined as a game where the payoffs are determined in terms of probability. Each player in his or her argumentative point of view aims to increase or decrease the payoffs of the others by using several strategies. These strategies lead to resolve the problems of controversial perceptions among the players. The arguments among n-players should be introduced in order to obtain the corresponding consequences of argumentative game. This introduces a lot of interaction probabilities among agents to arrive to best results [17]. Argumentative games, in general, have match and opposite arguments. That is necessary to find a factor acts as an intermediary between them. This one factor can make a balance between the supportive and opponent arguments in order to find a significant result. Some researchers have found the reasoning in an argumentative game can do this balance among them. This reasoning illustrates the way of the interaction among agents. In fact, it depends on some major aspects which are belief revision and argumentation for deliberation and means-ends reasoning. Those players try to extending the framework of belief revision based on defensible logic programs. For the second type, it depends on tasks of deliberation, and the last one is a means-end which allow to arguments to use preferences based on a changing context [15].
516
3
S. Alqithami
Methodology
We set a parameter for our observation of the infection possibility to be through actual/physical contact where the social transmission of the disease has a higher accuracy. This method supports contact tracing apps. From epidemiological point-of-view, we consider any agent within a population to potentially be in one of the following states of the SEI (Susceptible-Exposed-Infectious) compartmental model. We use matrix of the original spreading in already infected region, i.e., sub-population with already report cases. We set a time period Δ, and run an analysis for link prediction at each time frame to find out the possibility of tracking back the origin. We mirror the original matrix with another matrix within the new environment. We use a game theoretical approach to understand the possible spreading of the virus to the new environment. This will help in prediction possible infected people and segment of the population for a better distribution of immunization and resources for treating infected patients at an early stage. Figure 1 illustrate the processes of this paper from the consideration of a new heterogeneous environment going through modeling, analyzing and mirroring to present classified coalition member agents within the SEI compartmental model. 3.1
Preliminaries
The coalition is set out to be a network graph that is denoted as G(V, E), where V is the set of agents and E represents the set of interaction. A = {aij } is the adjacency matrix of G. If there is an interaction between agent i and agent j, then aij = 1, while if there is no interaction between agent i and agent j, then aij = 0. The total number of agents in the network graph is denoted as N ∈ RV+ . Interaction is a result of an agent behavior that is a settlement of an argument to gain a certain payoff. Definition 1 model an agent argument from interaction within a coalition.
Definition 1. An argument ARGvv [t0 , t] between agent v and v on the time interval [t0 , t] is a set of concurrent interactions Ik executed between the two agents v and v at the interval t0 and t as 0 t0 t.
ARGvv [t0 , t] = {Ik ∈ E : ∃t ∈ [t0 , t]s.t.Concurrent(v, Ik , t )} I where E = Ini =I1 ARGIi is a set of all interaction and Concurrent(v, Ik , t ) is the concurrent interaction of agent v at time t . An interaction is said to be contributing to a specific argument if it confirmed to the coalition formation guidelines by the judging agent at time t .
Pandemic Spreading in a Spatial Social Network
517
Populated Heterogeneous Environment
Model a Coalition as an Open Multiagent System [1]
Model of Coalition Formation
Model of Social Structure from Spacial Interaction (a) Modeling Model of Argumentation within Coalition (b) Analyzing Game Theoretical Measure for the Short Spacial Distancing
Modeled Coalition with Spacial Argumentaiton
New Populated Environment as an Adjacency Matrix [1]
Mirror Current Spreading with a New Random Coalition (c) Mirroring Identify New SEI within Uninfected Coalition
Classified Agent within Coalitions with SEI Model
Fig. 1. Steps to illustrate the process from a natural environment to a mirrored coalition with classified SEI agents.
518
3.2
S. Alqithami
Argumentative Game Theoretical Approach in Social Network
In order to build a social network among different individuals inside the system, we should figure out the connectivity among them and how do they bind in such a social group as such? However, between any individuals inside the game, there is an interaction using the base game which we call in social networks because it built of frameworks applied on two players’ game [5]. For further clarification, we will illustrate in this section one example to show how the individuals connected with each other inside the network socially. Consider the first individual (agent) as source and the aim to find the shortest path to get into the final destination passing some median destination that is not the aim. Suppose the next matrix, presented in Fig. 2, is one of the source-destination comparisons in the payoffs. Carrier/Susceptable
Positive
Positive
(+1, +1)
Negative
Negative Positive
(0, +1)
(+1, 0)
Negative
(0, 0)
(a) a) An extensive Tree form. Susceptible Positive Negative Positive +1,+1 +1,0 Carrier Negative 0,+1 0,0 (b) b) A Normal form or a Payoff Matrix.
Fig. 2. An abstract example of an agent carrying the virus engaging in an argumentative interactions with a susceptible agent in both a) an extensive-tree form and b) a normal-matrix form of a 2-player and 2-strategy game.
As the previous diagram shows, it might be easy to get strict pure-strategy Nash equilibrium precisely in this manner because one agent has the best response disregarding his equilibrium strategy to the strategies of the other [12]. Beside it is the only better way to get into the destination through the shortest path in the current moment, whether the destination gets a fare payoff or not, the aim is to let the source get into the destination with a better payoff (shortest path). Thus, the destination payoffs will be ignored from now further.
Pandemic Spreading in a Spatial Social Network
519
In order to provide best explanation to the current situation and for simplicity, we based on the assumption that the social network, which is usually build of n-number of agents, is contain five agents named alphabetically in sequence from A through E [12]. Moreover, the expectancy of any paths among the different agents and the goal is to get into the assign destination from the source going through any agents in the middle and by following the shortest path considering the scope of any agent’s current location or coordination with its neighbors. the same can be found in many social games when the players work together with their neighbors directly or indirectly specially in endogenous networks [9]. Consider the same previous example, suppose that the agents in the scope of the current chosen agent are mentioned by number 1 or 0 for otherwise, unlike the negative payoffs that been used in [7] when they use them to mention that the bad payoffs are for the long transmission; for that, we solve such a problem by considering the range. Nevertheless, the transmission range (T R) between any two agents can be calculated considering the coordination of their position using the equation: (1) T R = (x2 − x1 ) + (y2 − y1 ) After following certain coordination for the simplicity, the next matrix is build considering the range:
D 0 1 1
B 1 A
1 1
E
1 C
F
Fig. 3. An illustrative example of a carrier to a susceptible agent spatial-transmission.
From the previous matrix, presented in Table 1, it obvious to recognize that some of the source payoffs, A in that instance, with the other agents in the network are useless and it may harm the transmission if it occurs between them. Besides, some of the transmissions with the other agents within the range might not be useful to arrive to the destination in the shortest path because using the base game between the two players (source and destination) might not be helpful to link the original source with final destination in a better way or in the worst case it might have a dead end which will stop the transmission from occurring at
520
S. Alqithami
Table 1. Adjacency matrix of the transmission wighted graph presented in Fig. 3.
A B C D E F A 0
1 0
0
0
0
B 0
0
1
1
1 0
C 0
0
0
1 0
0
D 0
0
0
0
0
1
E 0
0
0
0
0
0
F 0
0
0
0
0
0
that moment. Nonetheless, the next pseudocode, presented in Algorithms 1, 2, 3, shows the process of directed or indirect transmission and to solve the problem mentioned above for the virus to find the next host going through the highest payoff/possibility caused by following the shortest path. Algorithm 1: To Find the Next to be Infected Agent Directly. Result: The nearest agent for the decease to be transmitted. initialization; Let i, j and k be counters that start form 0 and n is the number of agents; while i : 0 → n do while j : 0 → n do if i = j then T R = (x2 − x1 ) + (y2 − y1 ); if T R ≤ t and neighbors[i][j]=0 then neighbors[i][j] = 1; Show the susceptible and its distance; else neighbors[i][j] = 0; end end j+ = 1; end i+ = 1; end
Pandemic Spreading in a Spatial Social Network
521
Algorithm 2: To Find the Next to be Infected Agent through a Susceptible One. Result: The next agent for the decease to be transmitted. initialization; Let i, j, k and m be counters that start form 0 and n is the number of agents; for j = 0 → n do k = k+ neighbors[0][j]; // k: counter to exit the loop if k = 0 then the source can not reach any path; else if neighbors[0][n − 1] = 1 then susceptible infection is direct between patient and destination; else while i : 0 → n do while m : 0 → n do if neighbors[m][n − 1]= neighbors[0][m]=1 and k = 0 then Susceptible infection of [n − 1] is through [m]; k+ = 1; end m+ = 1; end i+ = 1; end end end j+ = 1; end
522
S. Alqithami
Algorithm 3: To Find the Next to be Infected Agent through One or More Susceptible Ones. Result: The upcoming agents for the decease to be transmitted. initialization; Let i, j and k be counters that start form 0 and n is the number of agents; while i : 0 → n do while j : 0 → n do if neighbors[j][n − 1] = 1 and neighbors[0][j] = 1 then for s = 0 → n do if neighbors[s][j] = 1 then if neighbors[s][j]= neighbors[0][s] and k = 0 then Susceptible infection of [n − 1] is through [s] then [j]; k+=1; end end s+ = 1; end end j+ = 1; end i+ = 1; end
4
Illustrative Example
For further illustration of this method, a simple example of a coalition and a continual agent argumentation have been simulated using NetLogo platform1 . Netlogo is a java based cross-software platform as a testbed that help in the simulation of natural and social phenomenon to better reflect a real multi-agent program modeling environment. The players in the coalition game vary in payoffs considering their available strategies that are organized in the order from 1 to 9. Consider the same previous example, suppose that the agents in the scope of the current chosen agent are mentioned by numbers, unlike the negative payoffs that been used in [6] articles when they use them to mention that the bad payoffs are for the long transmission. Thus, such a problem can be overcome by considering the range. Since the aim is to examine the level of the players’ payoffs, we have implemented number of agents that move randomly within certain area and argue with any other agents in their specified range, as in Fig. 4. Figure 4 clearly shows that each agent has been labeled with its payoffs depending on the argumentative game that he has with other agents. The plot in the left show the average payoffs among the different players which represent the variety of payoffs for each strategy the agent picks. It rises when the agents 1
https://ccl.northwestern.edu/netlogo/.
Pandemic Spreading in a Spatial Social Network
523
Fig. 4. Abstract simulation of the spatial environment using Netlogo.
inside the game argue with each other, and it gets higher with higher payoffs. Usually it plateaus after it reaches the threshold which means that the agent is not susceptible to the transmission of the disease but already infected. After the implementation of the new method inside the game, the optionality of introducing new more strategies among the players become possible. It allows more than two players to argue with each other and to have different payoffs.
5
Conclusion
Due to the social concern in regard to spatial interactions to be one of main parameters for the transmission of COVID-19, the paper outlined a simplified method to reflect the connection within a society. This method uses the shortest path as a better payoff which prompt the transmission of the virus to arrive at a host in an unfortunately short time. Besides, the individuals inside a specific network are apple to be in connection with any other member within the spacial network directly or through common agents that have been addressed throughout the paper. Proposed methodology can be extend to consider other applications as in argumentation. It allows us to extend our vision further than the traditional two players’ game. Sample argumentative game has been implemented in order to evaluate the proposed approach. It is appear that using this method of theoretical games will help the players navigate inside the spacial network, and allow them to include more players and strategies within their current games.
524
S. Alqithami
Acknowledgment. This work was funded by the Deanship of Scientific Research at Albaha University, Saudi Arabia [Grant number: 1441/10]. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the Deanship of Scientific Research or Albaha University.
References 1. Alqithami, S.: A generic encapsulation to unravel social spreading of a pandemic: an underlying architecture. Computers 10(1), 12 (2021) 2. Aumann, R.J.: Game theory. In: Milgate, M., Newman, P. (eds.) The New Palgrave Dictionary of Economics (1987) 3. Barab´ asi, A.-L., et al.: Network Science. Cambridge University Press, New York (2016) 4. Borgatti, S.P., Foster, P.C.: The network paradigm in organizational research: a review and typology. J. Manag. 29(6), 991–1013 (2003) 5. Davis, J.R., et al.: Equilibria and efficiency loss in games on networks. Internet Math. 7(3), 178–205 (2011) 6. Easley, D., et al.: Networks, Crowds and Markets, vol. 8. Cambridge University Press, Cambridge (2010) 7. Easley, D., et al.: Networks, crowds, and markets: reasoning about a highly connected world. Significance 9(1), 43–44 (2012) 8. Ekbia, H.R., Kling, R.: Network organizations: symmetric cooperation or multivalent negotiation? Inf. Soc. 21(3), 155–168 (2005) 9. Hojman, D.A., Szeidl, A.: Endogenous networks, social games, and evolution. Games Econ. Behav. 55(1), 112–130 (2006) 10. Hovorka, D.S., Larsen, K.R.: Enabling agile adoption practices through network organizations. Eur. J. Inf. Syst. 15(2), 159–168 (2006) 11. Kissler, S.M., Tedijanto, C., Goldstein, E., Grad, Y.H., Lipsitch, M.: Projecting the transmission dynamics of sars-cov-2 through the postpandemic period. Science 368(6493), 860–868 (2020) 12. Leyton-Brown, K., Shoham, Y.: Essentials of game theory: a concise multidisciplinary introduction. Synth. Lect. Artif. Intell. Mach. Learn. 2(1), 1–88 (2008) 13. Michalska-Smith, M.J., Allesina, S.: Telling ecological networks apart by their structure: a computational challenge. PLoS Comput. Biol. 15(6) (2019) 14. World Health Organization, et al.: Coronavirus disease 2019 (covid-19): situation Report 70 (2020) 15. Rahwan, I., Larson, K.: Argumentation and game theory. In: Simari, G., Rahwan, I. (eds.) Argumentation in Artificial Intelligence, pp. 321–339. Springer, Boston (2009). https://doi.org/10.1007/978-0-387-98197-0 16 16. Snow, C.C., Fjeldstad, Ø.D.: Network paradigm: applications in organizational science. In: International Encyclopedia of the Social & Behavioral Sciences, 2nd edn., vol. 16 (2015) 17. Thimm, M.: Strategic argumentation in multi-agent systems. KI-K¨ unstliche Intell. 28(3), 159–168 (2014) 18. Van Alstyne, M.: The state of network organization: a survey in three frameworks. J. Organ. Comput. Elect. Commerce 7(2–3), 83–151 (1997) 19. Watts, D.J.: The “new” science of networks. Annu. Rev. Sociol. 30, 243–270 (2004)
Pandemic Spreading in a Spatial Social Network
525
20. Zhou, F., et al.: Clinical course and risk factors for mortality of adult inpatients with covid-19 in Wuhan, China: a retrospective cohort study. Lancet 395(10229), 1054–1062 (2020) 21. Zhou, W., Duan, W., Piramuthu, S.: A social network matrix for implicit and explicit social network plates. Decisi. Supp. Syst. 68, 89–97 (2014)
Affective Story-Morphing: Manipulating Shelley’s Frankenstein under Program Control using Emotionally Intelligent Agents Clark Elliott(B) College of Computing and Digital Media, DePaul University, 1 DePaul Center, Chicago, IL 60604, USA [email protected]
Abstract. We present a theoretical model for the automated generation of plotconsistent, novel, engaging narratives based on a broad, computable model of emotion. We review the background theory, relevant to the morphing of narratives, composed of 28 emotion categories, 24 emotion intensity variables, and ~400 channels for emotion expression, which has been implemented in an AI program called the Affective Reasoner. We argue that what is primarily of interest in narratives is the emotion fabric present in the interaction between characters— much of which can be manipulated by the Affective Reasoner—and that while keeping the plot the same or similar we can, under computational control, create novel, interesting, consistent new stories that make sense to human observers. We present and explain preliminary examples and then apply the story-morphing techniques to a passage from Mary Shelley’s Frankenstein. Keywords: Affective computing · AI · Intelligent agents · Emotion · Stories · Narrative · Gaming AI
1 Introduction and Motivation This is a theoretical position paper arguing that we can create novel stories under computational control suitable for many contexts based on sound AI emotion reasoning principles. We first argue that using a highly-computable model of emotion allows us to extract an essential structure in stories which is independent of the narrative context. We then show that we can automatically manipulate these structures with emotionally intelligent story-morphing AI agents to produce novel emotion tapestries that are, themselves, the basis of good, new, stories in the original setting. We introduce the basics of the emotion theory relative to this work, and a few illustrative examples, and then analyze a short section from Mary Shelley’s Frankenstein, Or, The Modern Prometheus (Shelley 1998) to illustrate the techniques. Narrative structure and stories are an important element of human cognition, embodying universal metaphors such as those for making a journey through the time sequence of related events (Lakoff 1993), and it can be argued they are an important part of the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 526–542, 2022. https://doi.org/10.1007/978-3-030-82193-7_35
Affective Story-Morphing
527
what Fodor calls The Language of Thought (Fodor 1975; Rescorla 2019) which lies at the symbol-processing core of what makes us human. In the realm of neuroscience, we see that we take in much of our permanent life experience in the form of meaningful episodes, which then later, and slowly, migrate from the hippocampus and elsewhere into collections of semantic meaning in the neocortex (McClelland et al. 1995). Stories from our lives are retold over and over again, and can even dominate the rest of our lives when reliving, for example, PTSD episodes. And even these sad, catastrophic conditions can sometimes be addressed through the re-telling of such stories (Gray and Bourke 2015). Decades ago, Bruner (1957) showed us that our very (phenomenological) perception is based, in part, on previous episodes in our lives that color how we see the world, and that as a result we know that even the later stages of the vision system itself are cognitively penetrable by these stored episodic memories1 . In addition, Bruner later emphasized the importance of narrative in absorbing a culture’s folk psychology (Bruner 1990). Stories are powerful in manipulating our belief systems as well: for example, when we repeatedly hear stories that are based on lies, our natural processing finds ways to integrate them into our worldview as being true (De Keersmaecker et al. 2020). At an analytical level, various claims have been made about the essential nature of stories. For example, it is not uncommon to categorize stories based on themes such as good versus evil, love, and redemption (MasterClass 2021). Shank has argued that even the simplest story must have a point, and that a point is generated by a failure of expectation (Shank 1990). There are many theories of plot development (e.g., (Kim et al. 2017)). But computational models of world knowledge are still hugely lacking because of the overwhelming symbolic complexity of the real world. While it is a noble goal and some progress has been made (Reagan 2017), sentiment analysis of pure text runs into the problem that without natural language understanding the clues in the ambiguous statements and utterances that humans traffic in remain opaque. By contrast, the orthogonal position we take in this paper is that the simplest story is generated when something happens, and someone cares about it, and that the underlying ways in which people care creates narrative fabrics that are rich, complex and—most importantly—narratively consistent. Many of the important elements of plots, and allimportant themes, are all based on such caring, which yields a vast number of stories that are, irrespective of the events themselves, based on specifically unique, but generally identifiable emotion patterns. We have previously shown that these techniques are effective and that when presented with novel computer-generated story variations, subjects said that the new stories were plausible and made sense (Elliott et al. 1998). In the general case, building a content theory of plot manipulation is tantamount to building a content theory of the world. In our work, we go to great lengths to avoid this currently insurmountable task. Drawing on Love of Chair (Wikipedia 2021)—the Electric Company spoof of daytime soap operas—we might say, “The boy was sitting in the chair,” and it is very hard to argue that this is a story. If we say, “The boy was sitting in the chair, and felt guilty about it,” our analysis will change. The boy cares, and we might want to know why. Or we might sympathize with him because we have ourselves sat in a chair and known that 1 With a tip of the hat to Pylyshyn’s cognitively impenetrable early vision module (Pylyshyn
1999).
528
C. Elliott
we shouldn’t have. Or, we are reminded of being outside the principal’s office, feeling that we shouldn’t be there. And, too, there are dozens of flavors of that guilt the boy is feeling that might be of interest to us. We also have lots of stories where we know the outcome, and it is expected. “The boy sat in the chair and felt guilty about it. He just couldn’t stop himself from taking the candy.” There is not too much that is out of the ordinary here, but it is still a story. Yet we could say, “The boy sat in the chair, which had fourteen legs and was painted pink,” which does not meet our expectations, but is also not a story, because no one cares. If we want to build a robust, computational model of stories, we have to start with a computational model of emotion.
2 Story Morphing in the Affective Reasoner In the Affective Reasoning system (Elliott 1991), we view particular—and typically essential—narrative components based on emotion as being eminently computable. Our goal in the story-morphing project is to build a computational model of the full range of emotions present in the lives of people of all cultures, and the narratives that arise from them. For such a task we cannot use a structure based on limited sets of “basic emotions” such as those derived, for example, from Ekman’s facial expression analyses (Ekman and Oster 1979). Our attitude in this particular case is that emotion-face recognition is a valid component of emotion detection such as is used by, e.g., De Carolis et al.’ s Empathetic Social Robots (De Carolis et al. 2015), but that people are in fact capable of reasoning about a much broader set of emotions in their daily lives, in their social interactions, and in their narratives. Our basis for determining whether an emotion is necessary in the final full spectrum of defined emotion types is whether each is necessary—and whether the collection of them is sufficient—to fully represent the rich tapestry of emotion narratives that people can express in any domain of human endeavor. We also have put a high priority on the practical, computational makeup of our emotion model. For this reason, we explicitly exclude what Castelfranchi refers to as Felt Emotions (Castelfranchi 2015) that include modifications of emotion and cognition from endogenous, interoceptive and proprioceptive feedback loops. As described below we do allow for modification to the intensity of emotion instances based on some of these concepts. But this is done abstractly through the setting of static mood variables for arousal, physical well-being, depression-ecstasy and anxiety-invincibility that, in turn, directly affect the “cognitive” appraisals in our agents that later lead to emotion instances. We also model context-dependent intensity variables such as surprisingness and effort. No attempt is made, however, to model the kinds of neuro-feedback loops from body feelings that Castelfranchi describes which are beyond the scope of our current computational model. Our cognitive emotion model thus includes twenty-eight emotion categories with multiple intensities and qualities within each category. (For example, the category anger includes annoyance, indignation, exasperation, outrage, and so on.) We compute at least three different intensities for many of the emotions. The model also includes over fourhundred channels for expressing emotions—roughly twenty channels tweaked for each
Affective Story-Morphing
529
Fig 1. Ortony et al. (1988), modified Elliott 2015 and 2021: The Structure of appraisal within the content theory of emotions used as the basis for the dispositional component of the affective reasoner’s emotionally intelligent agents.
emotion. For example, a somatic channel for the expression of love would include flushing, and pulse increasing, while a verbal other-directed emotion modulation expression of love might include saying something sweet to encourage the object of one’s attentions to respond in kind. The essential structure of the appraisal mechanisms composing what we call the disposition of agents is contained in Fig. 1 giving the description of twenty-eight emotions, based, originally, on the seminal work of Ortony et al. (1988). In our work we have used the construct of Emotionally Intelligent (computer) Agents to take the place of characters in stories (Elliott et al. 1998). For each such agent we can
530
C. Elliott
ask, how does this agent feel about the events that are unfolding? and how might this agent express those feelings? Such agents are designed and implemented with two components: the aforementioned disposition which controls the way they interpret situations that unfold in a story, and a temperament which controls how they express any emotions that may arise [and see (Elliott 1997)]. Using these simple mechanisms, we can alter the disposition of the agents, such that they differently appraise situations that arise, and also alter their temperaments such that they express their emotions differently as well. In this way, within the constraints of not altering what happens in the plot of the story, we can still greatly change the emotion structure of the story and thus change the story itself. Because the agents are internally consistent within the content theory of emotion, their manifested emotions are as well, and the new stories generated make sense within the context of the new characters being portrayed. This is the crux of the matter: computational control of adding and manipulating complexity, while retaining the elegance of natural human interaction. For example, let us consider a simple example partially using what Ortony has called the fortunes-of-others emotions (Elliott and Ortony 1992; Ortony et al. 1988): Plot Steps: Lisa has a brother Jake who has a dog Scout. Scout gets out of the house and unbeknownst to Jake eats a dead squirrel she has found at the beach. Lisa visits Jake who comments that Scout seems subdued and possibly ill. Even though it is dinnertime, Scout shows no interest in her food. Story One: Lisa feels sorry for her brother Jake, with whom she is close. She believes that Jake is very protective of Scout and will be worrying about her, and will also, simultaneously be mad at Scout for getting out of the yard. Jake is indeed worried about Scout. He feels guilty that he let her get out. Scout does not feel well because of the rotten squirrel. Story Two: Lisa is gloating over her brother Jake, with whom she is very competitive. She feels reproach for Jake who does not know how to take care of his dog. Jake is indeed worried about Scout, but he also admires his intrepid escape-artist dog who is so smart, and is proud to be her owner. Scout does not feel well because of the rotten squirrel. Story Three: Lisa is jealous of her brother Jake because Scout loves him and not her. She is reproachful of Jake whom she feels should want to take better care of his dog. She makes a plan to scold him later. Jake is gloating over his stupid dog, because she has obviously done something wrong and is now sick because of it. Scout does not feel well because of the rotten squirrel. Story Four: Lisa does not feel much of anything—it is not her problem. Jake is furious at his dog Scout for getting out of the house and getting into trouble. He is speaking in a really loud voice. Scout is unhappy that Jake is mad at her. She is afraid that Jake will punish her. She is happy about having eaten the dead squirrel which was the high point of her day. Her stomach hurts but she doesn’t care much about that.
Affective Story-Morphing
531
Story Five: Lisa does not feel much of anything—it is not her problem. Jake is furious at his dog Scout for getting out of the house and getting into trouble. He is trembling and red in the face, but not saying anything. Scout has mixed emotions. She feels guilty that she ate the dead squirrel and she is afraid that Jake will start yelling at her, but she is also happy about having eaten the dead squirrel which was the high point of her day. In this way using only the crudest of the manipulations that can be computationally controlled by the Affective Reasoner mechanisms in real time, we are nonetheless able to generate scores of stories in this same vein. Let us now examine the mechanisms used in this simple example in a little more detail: In Fig. 1, we see that there are four main divisions of the twenty-eight emotion categories. First is the large set of emotions that arise because of the goals of agents— what agents want and don’t want (both now and in the future), and in a related way, what they believe their friends and adversaries want and don’t want. Second is the set of emotions based on the principles of the agents—relevant to actions agents believe should and should not be performed. Third is the small set of emotions based on what agents like and don’t like—their preferences. Fourth is the set of emotions based on combinations of other categories that subsume their constituent parts. Note also that in a pseudo category we have the mixed emotions—since while there is a conflict set in the manifestation of emotion (e.g., one can’t shout and be silent at the same time) there is no such conflict in the feeling of multiple and even contradictory emotions at the same time (e.g., sorrow over the death of a favorite uncle; joy over the fact that he has bequeathed much-needed money). On the one hand, to control how agents respond to the circumstances—the emotion eliciting conditions—that unfold in a story, we have to change the way they appraise those circumstances. These appraisals are part of what we build into an agent’s stable disposition. In the first case Lisa appraises her brother Jake’s distress as blocking one of her own fortunes-of-others goals (Lisa’s desire for Jake’s ongoing well-being): when Jake is sad, she feels sorry for him. Lisa is angry (a compound emotion) at Scout because Scout has both violated Lisa’s principle that dogs should not run away and get in trouble, and also blocked her goal for Jake’s well-being. Jake appraises the situation as indicative of a possible future goal of his own—keeping his dog healthy—being blocked, although this is currently uncertain (worry). He also, independently appraises the situation as an instance of him violating his own principle of keeping Scout safe (guilt). Scout’s own health well-being goal is being blocked by an uncomfortable stomach (distress). In the second case these appraisals change, and so do the resulting emotion states. For example, in the second story Lisa’s friendship relationship with Jake has changed to become (in this situation) adversarial. So now when Jake’s ongoing well-being has taken a downturn, Lisa feels good about it. All of the many changes in the emotions that arise in the various simple stories about Lisa, Jake, and Scout, stem only from such changes in the ways that agents appraise the otherwise identical events that that occur in the plot. Among the twenty-odd action channels that differentially control the expression of any particular emotion we have a spectrum of paths such as somatic, behavioral
532
C. Elliott
directed toward an inanimate [or animate] object, communicative [non-verbal/verbal] responses, evaluative self-directed attributions, ..., repression, suppression, reappraisal of the situation, reappraisal of one’s self, other-directed emotion modulation, ..., full plan initiation, and so on2 , from the simplest body responses, to the most complex intellectual responses. So, on the other hand, to control how agents manifest (express) their emotions we have to change what we build into their temperaments. We achieve this by changing the action-expression channels that are activated at any given moment for an agent, which in turn, taken together, compose the agent’s temperament. For example, in story four Jake expresses his anger by shouting, a communicative verbal response, or possibly (because he is talking to a dog) a behavioral response directed toward an animate object indicating that for this temperament, those channels are activated. By contrast, in story five, Jake is trembling and red in the face which are somatic responses. Having introduced the basics, we can now look at some additional ways in which we can control the emotion structure of stories. Relationships: First, as hinted above, we have four relationships that we model, between agents: friendship, adversarial, cognitive-unit, and no-relationship. A Friendship relationship is intended to collect together all relationships wherein, e.g., agent Lisa will feel good when good things happen to agent Jake, and bad when bad things happen to Jake. An adversarial relationship is the opposite: Lisa will feel bad when good things happen to Jake, and good when bad things happen to Jake. A cognitive-unit relationship is when Lisa feels exactly what Jake feels, as though she were in his shoes. (For example, a mother may feel frightened with her son when he is forgetting his lines during the school play.) These relationships are unilateral (and even when bi-lateral might not be symmetric). It is possible to change the relationships that agents have with one-another, and in this way affect the fortunes-of-others emotions that will arise in the system (sorry-for, happy-for, gloating, resentment). In story one above Lisa felt sorry for Jake because of her friendship relationship with him. In story two she gloated over him because of her adversarial relationship (here modeling competition). Emotion Intensity Variables: Next we can change the emotion-intensity-variables which contribute to the particular (intensity of) emotion that is generated within any one of the emotion categories, and also, thus, subsequently any change in action-responses that are dependent on emotion intensity. In the Affective Reasoner we can manipulate up to twenty-four different variables that affect how strongly situations elicit emotional responses in the agents, divided into three categories (Elliott and Siegle 1993). The first category of such variables, the simulation-event variables, are those that are contained in the (simulation of the) situation itself. For story-morphing there are constraints on the usefulness of this set of variables because they are bound to the external plot and description of the story itself, which changes we always want to keep 2 Based on unpublished work of Ortony, Elliott and Gilboa.
Affective Story-Morphing
533
to a minimum. That is, these are always actual plot changes, albeit possibly representing purely local changes that do not affect the remainder of the plot. For example, if we change the amount of money a patron leaves as a tip, the waiter might also appraise the event of getting the money differently—which is what we want with story-morphing— but we have to be very careful that such a change does not alter the plot in ways requiring real-world knowledge to control, which is explicitly beyond the scope of the AR’s capabilities. Nonetheless, within the story-morphing context, some changes are possible within these constraints. A special non-theoretical subset of simulation-event-variables is the manifestations of emotions that agents create, and which are themselves events within the system to which other agents may respond. To the extent that they do not change the plot, they are allowed, but with constraints. The values in the simulation-event variables change independently of an agent’s interpretation of them, and one change in a single place (the simulation event) can conceivably affect all of the agents in the system at once. Among these variables are how much a goal is realized or blocked, the extent of the blameworthiness or praiseworthiness an action is as it is performed (e.g., how drunk the driver was), how certain the situation is, how real it seems, how surprising it is, how deserving the agent is of the situation, and so on. For example, if an adversary is particularly deserving of her bad fortune, an adversary agent observing that bad fortune may gloat in a particularly strong way. The second category, the stable disposition variables are those variable values that are internal to each agent. Changes in one of these will not affect any other agent’s interpretation of situations. Unlike the simulation-event variables these values can generally be changed at will, and thus are easy to use with story-morphing. That is, how an agent is disposed to see situations that arise is, in essence, up to them. Among these variables are how important it is to achieve a goal or keep it from getting blocked; or uphold a principle, or not have it violated. For example, after losing a game three times to a rival, the importance of the goal of winning might become increasingly important. Included in this category would be the emotional interrelatedness of two agents. The more (unilaterally) interrelated they are seen to be by one of the agents, the stronger the emotions (of that agent) generated in the context of this relationship. There is room for some finesse here, as well. For example, we have independent variables for how an agent sees the importance of upholding of a principle, and for the importance of not violating it. In this way one version of an agent’s disposition might have strong emotions over the admirability of hearing romantic passion in music, but not care much at all if someone does not. An alternate version of the agent’s disposition might find the agent greatly disdainful of those that cannot hear romantic passion in music, but not feel much admiration for those that do—taking it to be expected of them. Or, the agent might feel strongly in both instances, or only mildly experience emotions in both cases. Next, we have mood variables, which are intended to vary over time and affect both agents’ dispositions, and also their temperaments. Non-relationship mood variables which include changes in values like arousal and physical well-being, can make emotions stronger or weaker. Also included (among others) are a bias toward negative or positive emotions across a strength spectrum (a good mood, a bad mood), and anxietyinvincibility, which affects the strength of prospect-based emotions. Relationship mood
534
C. Elliott
variables affect how an agent is disposed toward judging another agent, or, differently, other agents as a whole. This affects how an agent is biased toward praising the actions of another, or toward condemning them. Concerns of Others: Lastly, because emotions are sometimes generated on behalf of agents according to how they believe others to interpret situations—stored in what we call “concerns of others” structures (COOs)—we can alter these beliefs and thus change the appearance and strength of the fortunes-of-others emotions. External Constraints on Story Morphs: When a character’s emotion changes using the story-morphing techniques, it may alter, reduce—or in the degenerate case, be incompatible with—the motivation for further steps in the plot: if a story character morphs to feel sorry for another, instead of being angry at them, it doesn’t make sense for them attack the other agent. In practice, we allow such constraints to be externally recorded as part of the morph instructions. Typically, these are not the burden they appear to be: (a) A simple control on the valence of the emotion an agent feels at a particular plot point is generally enough to reasonably avoid the problems. (b) Humans are very forgiving of plots that involve the emotional inconsistency we perceive in others. In real life we are often mystified by the emotions of others and have libraires full of explanations of unexpected behavior. But—and this is important—the behavior must be consistently explainable (and emotions in the Affective Reasoner paradigm always have explanations based on the appraisals leading to them) and the personalities must remain reasonably consistent throughout the narrative (though we can manipulate the moods of agents in theoretically consistent ways as part of the story-morph). And, (c) as a last resort we can simply mark critical plot points as locked, and not allow story morphing of that particular situation within the larger plot. To generate new stories using the story-morphing system, in any context, we must give some thought to the nature of each part of the plot. In particular, we have to decide which parts of the plot—“what happened”—fall into one of three general categories: (1) those parts that are independent of how agents feel about them (e.g., the invaders overrun the castle walls), (2) those parts that allow for emotional freedom, but within constraints (e.g., an agent has to feel positively (for some reason) about the prospect of meeting a stranger because they are next going to pursue a meeting in the plot, and (3) those parts of the plot that require specific emotions such that those emotions cannot be changed in any significant way. To meet these constraints we have to give some thought to possible appraisals that agents might make for each of the situations that arise during the unfolding of the plot. In general, we currently keep these local to a particular plot step, but it is possible to imagine a global collection of emotion constraints that allows for, say, a change in the valence of emotions at plot point 7 as long as some matching change in the emotions under specific constraints occurs at plot point 12. Within this context we typically still find a wide range of concerns that we can build for our agents, and a wide range of ways in which they express their resulting emotions, yielding dozens, and sometimes hundreds, of valid variations of the narrative. One serendipitous reason that story-morphing works is that we do not require a perfect product. Humans consider the reading of the emotions of others (and even themselves) an inexact science: we are often willing to jump to conclusions and provide our
Affective Story-Morphing
535
own (possibly incorrect) abductive explanations about what someone is feeling, and why they are feeling it. This is a normal part of the human condition.
3 How Development Proceeds The story-morphing system is a structure-based system that can apply in many application modes, and development for each will be different. The handling of text is an important consideration. The Affective Reasoner has no built-in text-generation capabilities, or language understanding capabilities. However, it is a system designed to drive such processes by providing computer-controlled manipulation and relatively sophisticated understanding of the underlying emotion fabric of stories, which we claim is one of the most critical elements of most narratives. Specifically-detailed themes arise from the emotion fabric, and can thus be transferred from one story context to another and from one story to another within the same context. Memorable scenes in stories are often dependent on the underlying emotion structure. Identification with characters often comes from identifying with the emotions they are experiencing, independent of the context of the characters’ lives. To illustrate how development might proceed, let us suppose that we wish to manipulate characters within a computer-controlled presentation of our stories in some chosen mode. For the purposes of this position paper we need not be more specific. Such underlying work would apply to a number of applications: We might, for example, wish to generate teaching stories as part of an automated tutoring system [e.g., (Elliott et al. 1999)] We might wish to play out our stories through virtual actors, using emotionappropriate background music. We might wish to address the computer game content bottleneck problem by generating real-time characters as part of a game that act in novel (surprising, internally consistent...) ways according to the current configuration of their personalities. To build a platform that supports such systems we encode the plot as a series of events unfolding in a simulation. These sim-events (simulation events) contain ground instances of “what happens” in our modeled world, played out as a discrete series of states within the system. Included in this series of sim-events is what the characters themselves—our agents—do. Actions of agents generated by the emotion system are inserted into the plot as additional sim-events. Our agents contain specialized internal, potentially matching, versions of all those sim-events that are important to them for one reason or another (based on their appraisal of the those events as being relevant to their goals, standards and preferences). These internal structures, which we call appraisalframes form the basis of each agent’s disposition. The appraisal-frames support complex unification matching algorithms containing variables and functions, and themselves form the left-hand side of rules such that when a match succeeds between an appraisal-frame and the current sim-event the rule fires and an emotion is generated. And, notably, the emotion is generated along with all the variable bindings that were created during the match process. In this way, for example, an agent doesn’t just get angry, she gets angry at the other agent that was bound to the “agent-that-violated-my-principle” variable during the match of the offending sim-event.
536
C. Elliott
Once an emotion is generated, it is expressed through the particular temperament channels that are activated as part of the agent’s current personality. The resulting expression of the emotion is formed as a new sim-event, and placed in the simulator’s event queue. In this way all of the events in the plot are simulated, along with all of the emotion events that the plot generates. To play the original story we configure our agent personalities with (our interpretation of) the concerns and temperaments of the original characters. To create a story-morph we variously alter the appraisal-frames (comprising an agent’s disposition) to embody different concerns of the characters, different moods, different interpretations of the concerns of others, and different relationships. Then we also alter the activation of the emotion-expression channels (comprising an agent’s temperament) to embody different ways of the agents expressing themselves. Then, we re-run the simulation to generate a new story. An important feature of this system is that novel, new, internally-consistent stories can be generated automatically by the Affective Reasoner system without intervention by human authors when it choses which appraisal-frames, temperament channel activations, mood values and relationship values it prefers to build into a novel personality on behalf of any particular agent.
4 Additional Aspects of the Affective Reasoner 4.1 Humor Certain types of stories are humorous because of their particular emotion structure (Elliott 1999). When this structure can be fully captured by our emotion theory, or minor extensions of it involving specific roles, we can also generate humorous stories. 4.2 Case-Based Reasoning Some versions of the emotionally intelligent agents have case-based reasoning capabilities that allow them to dynamically choose different internal Concerns-of-Others representations (COOs) for how they believe others see the world. These are based on the features of eliciting situations that arise, and the responses of agents to those situations, used as remindings for how others they have known—or they themselves—see the world. In the design of a separate Affective Reasoner project, to build a compassionate computer, emotionally intelligent agents use collections of emotion-story templates to ask increasingly appropriate questions when users are attempting to convey narratives about important events in their lives. These new stories are then indexed under the templates as new, retrievable, cases. 4.3 Applications A computational emotion-based theory of stories is widely applicable. One obvious application is in addressing what is known as the content bottleneck in computer games.
Affective Story-Morphing
537
In the gaming industry there is a delicate balance between (a) making a computer game too easy to “figure out,” so that interest is not maintained for long, and (b) making it too complex so that it is too opaque to understand initially, and interest is never piqued. Appropriately complex content is expensive to develop, and varied plots are both difficult to generate and burdensome to make cohesive and interesting. Using story-morphing techniques we claim that highly complex and novel game-stories can be automatically generated based on how the characters feel about what is unfolding—by controlling their dispositional behavior—rather than on the external complexities of the plot. Other applications include the building of a compassionate computer, role-playing therapy, story-telling and story-understanding systems, applications for children, and as a component part of emotionally-intelligent agents. 4.4 Users as Agents Lastly, and importantly, users are treated the same as agents within the real-time simulation. Input from users creates sim-events, agents reason about users’ emotions and motivations in the same way they reason about other agents, and they may have relationships with users in the same way they have relationships with other agents. This allows for the possibility of rich, novel interaction with users in general, and specifically with players of computer games, students in tutoring systems, and users in role-playing therapy games.
5 Morphing the Monster Let us now look at a more extended example which will help to illustrate the richness of the story-morphing pallet. We borrow a short passage from Chapter Five of Mary Shelley’s Frankenstein, Or, The Modern Prometheus (Shelley 1998): 5.1 A Paraphrase of the Original Narrative—Snippet One • Background: Victor Frankenstein has created an artificial human being which is just now coming to life. • Victor is in a high state of arousal because the monster is finally coming to life after years of work. • Physical beauty is important to Victor. Victor likes correct design proportions in the human body. Victor likes clear beautiful skin, and dislikes yellow pasty skin and yellow eyes. • Victor has worked very hard to make a beautiful artificial human. • Victor was hoping to celebrate creating new life, but is now disappointed that his creation is ugly. • Victor leaves to get some sleep, and dreams of his wife. • The Monster gets up, goes to see Victor, and loves him. He wants approval. He smiles at Victor and reaches out his hand. • Victor wakes, is overcome with disgust, and rushes downstairs. • Victor is bitterly disappointed that the new life he has created is not beautiful.
538
C. Elliott
Using only those story-morphing techniques that manipulate computable aspects of characters’ personalities (disposition and temperament) within the context of the narrative we can create different meaning, using the same plot steps that take place in the original narrative. For the purposes of illustration we present only snippets of what would be significantly more extensive passages in a full story-morph treatment. 5.2 Story Morph Snippet Two • Victor loves his creation (admires the struggle for life; feels the fact of newly created life is itself beautiful). • However, with conflicting emotions Victor is also quite repulsed by the ugliness of the monster’s actual appearance. • Victor feels that parents should love their children and see them as attractive no matter what (a principle), and sees himself as the parent of the monster. • Victor is ashamed that he does not see his creation as beautiful. • The monster is afraid that Victor will not like him. • The monster’s fears are confirmed when Victor runs away from him. • The monster is angry at Victor for hurting his feelings, and for not taking care of him. 5.3 Story Morph Snippet Three • Victor makes an assessment of the monster. He feels nothing but scientific curiosity. He is tired and leaves to get some sleep. • The monster comes to life and is desperate for the affection of his creator. • When victor does not respond the monster is sad and thrown into depression. • Victor feels reproach for the monster for being so emotional and leaves. • The monster feels abandoned. 5.4 Story Morph Four • Victor loves his monster. • He fears that others will harm his monster because they will see the monster as ugly. • Victor feels guilty he did not make a beautiful creation and it is his fault that others will harm his creation. • Victor admires the monster’s strength. • Victor leaves to sleep. • The monster is curious and goes to see Victor. • Victor is hoping to see signs of love in the monster’s eyes, but sees none. • The monster feels nothing for Victor. • Victor feels rejected by the monster and this leads to bitter disappointment because Victor has been hoping for two years to build someone that will love him. He has invested a great deal of effort in this project. • Victor feels shame that he has not provided his monster with a family where members love one another. He can’t bear his shame and leaves.
Affective Story-Morphing
539
5.5 Story Morph Five • Victor is in a strong adversarial relationship with (toward) the monster he has created. His goal is to create life that he can mistreat with impunity. The monster has a strong friendship relationship with (toward) Victor. • Victor is gloating because his monster has ugly skin and he will be able to use this against the monster. He is sad that the monster has good proportions. He is afraid that because the monster is so strong he will not be able to mistreat him very extensively. • Victor looks forward to the moment when the monster realizes that Victor despises him. • Victor leaves to get some rest. • The monster comes to life and has a strong desire for human contact. He is lonely, but is hopeful of being loved by others. • He finds Victor and is satisfied to find human company. He feels love for Victor and reaches out to him. • Victor expresses his disgust at the monster. He is very excited about the impending feelings of rejection the monster will feel. • The monster is now terribly sad to be rejected by Victor. • Victor gloats over the monster’s distress. • The monster gets angry at Victor for behaving so badly. • Victor fears that the monster will hurt him and runs away. 5.6 Story-Morph Snippet Six • Victor loves his monster very much. He believes that parents should love their children and also that they should always find their children beautiful despite their faults. He is proud of loving his creation despite his ugliness, but he is remorseful that he finds the monster repulsive. He does not express his love strongly because such emotions are repressed in his temperament, but his temperament is also such that disgust is shown in a communicative-verbal way. He calls out at the monster saying, “You are disgusting!” • The monster, who desperately wants love, does not see that Victor loves him, but only that Victor is repulsed by him. 5.7 Story-Morph Snippet Seven As above in story-morph five, but... • The monster is very happy to find himself alive. This puts him in a very good mood, and he is predisposed to appraising the world in a positive light. • The monster ignores Victor’s comment that he, the monster, is disgusting, but notices Victor’s obsessive attentional focus on him as an object of love. He is satisfied to feel Victor’s love.
540
C. Elliott
5.8 Story-Morph Snippet Eight [Using the case-based reasoning capabilities of agents] The monster knows that he has pasty skin and yellow eyes. He sees that Victor is disgusted. Taking these features together he changes his internal representation of how he believes Victor sees him, the monster. He feels sorry for Victor because he, the monster, is so ugly, and now believes this makes Victor unhappy. 5.9 Story-Morph Snippet Nine [Using extensions for Affective Reasoner-based computations of humor] • The monster sees Victor as an authority figure. He believes Victor holds everyone to high standards of behavior. He believes Victor will hold him, the monster, to high standards. Victor has created a monster that is ugly, thus violating one his own standards. The monster observes that Victor knows the monster knows that Victor has violated his own standards. The monster finds Victor’s chagrin funny, and laughs. • Victor is embarrassed. He resents the monster laughing at him. 5.10 Story-Morph Snippet Ten [Using extensions for altering Concerns-of-Others structures] • Victor looks at his sleeping monster and feels pity for him because he assumes the monster will feel very badly about being so ugly. • Victor leaves to sleep. • The monster wakes and is happy to be alive. This happiness trumps all other feelings. • Later when the monster comes, Victor realizes that the monster is happy, and changes how he believes the monster perceives himself. He stops feeling pity and now feels happy for the monster that he enjoys being alive. 5.11 Some Finer-Grained Variations The monster is now terribly sad to be rejected by Victor: • The intensity of this rejection is increased because this is unexpected by the monster, and the surprisingness contributes to intensity. • The importance of not being rejected is very high for the monster, and this increases the intensity. • Feeling rejected causes the monster to express this by reappraising [himself] as being ugly and unlovable. • The disgust on Victor’s face is taken by the monster to be a very strong indication of intense dislike, and this [sim-event variable] contributes to the intensity. • Victor’s temperament is manipulative and he expresses his disgust through Otherdirected emotion modulation channels to make the monster feel as badly as possible.
Affective Story-Morphing
541
Each of these snippets from different story-morphs is based exclusively on components of emotion that the Affective Reasoner system can manipulate. In addition, because they are based entirely on the logical structure of how emotions arise, and are expressed, they come with sophisticated explanations, embodied as explicit values in what might be hundreds of details for each emotion generated.
6 Implementation Current technical development is focused on putting the agents on the web using AWS Linux, perl, python3, PhP, the AI engine in ABCL/SBCL Common LISP, Java networking, the Google speech engine, Google speech recognition, Chrome browser (for the speech interface), websockets, Javascript, Midi-to-MP3 for computer-selected music expression and browser-based SVG for morphing 72 facial expressions. The current design focus is on building a corpus of modifiable, common, emotion-story schemas as a basis for constructing compassionate software agents.
7 Conclusions and Summary Complex, but precise, emotion structure can be teased out of all stories, based on how the characters appraise the events that arise in the narrative. This emotion structure is essential to what makes a story a story. The emotion structure is both portable (it can be repeated in an entirely different context) and subject to manipulation. The Affective Reasoner, which does manipulate such emotion structures, can be used to automatically generate new, and novel, stories which are nonetheless internally consistent because of the consistent nature of the artificial personalities that are dynamically constructed by the computer. We have discussed the nature of what makes a story a story, and not simply a collection of plot events, claiming that how characters care about unfolding events is critical, and also—necessarily—gives rise to emotions. We introduced the underlying emotion theory that drives the Affective Reasoner. We worked through several snippets of stories to see how story-morphing works, and we concluded with a more extended example from chapter five of Mary Shelley’s Frankenstein, Or, The Modern Prometheus.
References Bruner, J.: Culture and human development: a new look. Hum. Dev. 33(6), 344–355 (1990) Bruner, J.S.: On perceptual readiness. Psychol. Rev. 64, 123 (1957) Castelfranchi, C.: Felt emotion. In: Emotion and Sentiment in Social and Expressive Media @AAMAS, pp. 67–76 (2015) De Carolis, B.N., Ferilli, S., Palestra, G., Carofiglio, V.: Towards an empathic social robot for ambient assisted living. In: Emotion and Sentiment in Social and Expressive Media @AAMAS, pp. 19–34 (2015) De Keersmaecker, J., et al.: Investigating the robustness of the illusory truth effect across individual differences in cognitive ability, need for cognitive closure, and cognitive style. Pers. Soc. Psychol. Bull. 46, 204–215 (2020)
542
C. Elliott
Ekman, P., Oster, H.: Facial expressions of emotion. Ann. Rev. Psychol. 30(1), 527–554 (1979) Elliott, C.: The affective reasoner: a process model of emotions in a multi-agent system, Technical report #32. Ph.D. dissertation, Northwestern University, The Institute for the Learning Sciences (1991) Elliott, C.: I picked up catapia and other stories: a multimodal approach to expressivity for emotionally intelligent agents. In: Proceedings of the First International Conference on Autonomous Agents, pp. 451–457 (1997) Elliott, C.: Why boys like motorcycles: using emotion theory to find structure in humorous stories. Unpublished paper, School of Computer Science, DePaul University, Chicago (1999) Elliott, C., Ortony, A.: Point of view: modeling the emotions of others. In: Proceedings 14th Annual Conference of the Cognitive Science Society, pp. 809–814 (1992) Elliott, C., Siegle, G.: Variables influencing the intensity of simulated affective states. In: AAAI Spring Symposium on Reasoning about Mental States: Formal Theories and Applications, pp. 58–67 (1993) Elliott, C., Brzezinski, J., Sheth, S., Salvatoriello, R.: Story-morphing in the affective reasoning paradigm: generating stories semi-automatically for use with __emotionally intelligent__ multimedia agents. In: Proceedings of the Second International Conference on Autonomous Agents, pp. 181–188 (1998) Elliott, C., Rickel, J., Lester, J.: Lifelike pedagogical agents and affective computing: an exploratory synthesis. In: Wooldridge, M.J., Veloso, M. (eds.) Artificial Intelligence Today. LNCS (LNAI), vol. 1600, pp. 195–211. Springer, Heidelberg (1999). https://doi.org/10.1007/ 3-540-48317-9_8 Fodor, J.A.: The Language of Thought, vol. 5. Harvard University Press, Cambridge (1975) Gray, R.M., Bourke, F.: Remediation of intrusive symptoms of PTSD in fewer than five sessions: a 30-person pre-pilot study of the RTM Protocol. J. Milit. Veteran Family Health 1, 13–20 (2015) Kim, E., Padó, S., Klinger, R.: Investigating the relationship between literary genres and emotional plot development. In: Proceedings of the Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 17–26 (2017) Lakoff, G.: The contemporary theory of metaphor (1993) MasterClass: Complete Guide to Literary Themes: Definition, Examples, and How to Create Literary Themes in Your Writing, 15 February 2021. From MasterClass: https://www.masterclass.com/articles/the-complete-guide-to-narrative-theme-in-litera ture-definition-examples-and-writing-how-to#what-is-a-literary-theme. Accessed 15 Feb 2021 McClelland, J.L., McNaughton, B.L., O’Reilly, R.C.: Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychol. Rev. 102, 419 (1995) Ortony, A., Clore, G.L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press, Cambridge (1988) Pylyshyn, Z.: Is vision continuous with cognition?: the case for cognitive impenetrability of visual perception. Behav. Brain Sci. 22, 341–365 (1999) Reagan, A.J.: Towards a science of human stories: using sentiment analysis and emotional arcs to understand the building blocks of complex social systems (2017) Rescorla, M.: The Language of Thought Hypothesis (E.N. Zalta ed.) (2019). From the Stanford Encyclopedia of Philosophy (Summer 2019 Edition). https://plato.stanford.edu/archives/sum 2019/entries/language-thought/. Accessed 14 Apr 2021 Shank, R.C.: Tell Me a Story: Narrative and Intelligence. Northwestern University Press, Evanston (1990) Shelley, M.W.: Frankenstein, Or, The Modern Prometheus: The 1818 Text. Oxford University Press, New York (1998) Wikipedia: Love of Chair, 15 February 2021. From Wikipedia: https://en.wikipedia.org/wiki/ Love_of_Chair. Accessed 15 Feb 2021
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning in Gin Rummy Game Yuexing Hao1(B) and Mark Vaysiberg2 1 Tufts University, Medford, MA 02155, USA
[email protected] 2 Rutgers University, New Brunswick 08901, USA
Abstract. Gin Rummy Card game is an old and popular game, which was created by Elwood T. Baker and his son C. Graham Baker in the 20th century. This imperfect information card game allows players to create strategies and mathematical calculations to maximize their win rate. In this paper, we presented an AI Gin Rummy player by using Java. Based on the game’s rules, we developed both a discard algorithm and a draw algorithm to maximize the win rate by switching different strategies. Moreover, to win more points, we developed dynamic strategies, which can be more responsive and intelligent. The strategies’ algorithms are dynamic based on previous games’ results. By implementing discard and draw algorithms, the win rate increases from 50% to 57.85%. Including our dynamic strategies further increase the win rate to 67.735% (among 100,000 games). Keywords: Reinforcement learning · Dynamic strategies · Opponent hands estimation model
1 Introduction Gin Rummy is a card game consisting of two players who are dealt with ten card hands. Each player takes turns drawing cards from an unknown down set or visible discard pile and discarding a different card. The objective is to create melds which are sets (3 or more cards of the same rank) or runs (3 or more consecutive cards of the same rank) in order to minimize their deadwood (sum of cards that do not belong to melds) and to knock (end the round) when they have less deadwood than their opponent. Knocking can only occur when a player has less than 10 deadwood and the opponent has the opportunity to lay off or remove deadwood from their hand that can be added to the current player’s melds. To achieve the objective of minimizing deadwood, we created discard and draw strategies. These algorithms maximize the player’s chances of making melds and remove large deadwood by prioritizing easy potential melds over hard potential melds and over dead deadwood, while minimizing the likelihood that the opponent benefits by estimating their hand. We also built an opponent hand’s estimation model (OHEM) to reduce the possibility of choosing the wrong cards. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 543–555, 2022. https://doi.org/10.1007/978-3-030-82193-7_36
544
Y. Hao and M. Vaysiberg
To find the correct time to knock (or choose not to knock at all), we modeled the Simple Gin Rummy Player’s deadwood as a function of round number using an exponential regression. To make the knock strategy dynamic, a variable threshold was subtracted from the deadwood estimate that would change depending on the opponent.
2 Gin Rummy Rules In the Gin Rummy poker game, there will be two players playing against each other with a standard 52 cards without Jokers. Each player will have 10 cards in their hands that are randomly distributed by the computer. The first player who reached or exceeded 100 points will be the winner of that game. Each player should form melds in their hands. In Gin Rummy, there are two kinds of melds: sets and runs. Sets are defined as 3 or 4 same rank cards; runs are defined as 3 or more cards of the same suit in sequence. In this game, each card counts their rank number of points, except the face cards which count as 10 points. Each player must draw and discard cards in each round. The player will first discard cards to the discard pile which is public to both players. Then the player could choose to draw a card from the discard pile or the random pile. The player may pick up a card from the random cards pile and discard it in the same round. The round will end when a player chooses to “knock” or there is no card in the random pile. The restriction to “knock” is the total deadwood should be 10 points or less. The total deadwood is the sum of the cards that do not be- long to any melds. When a player knocks, the opponent has the opportunity to “lay off” cards that can be added to the other player’s melds. The player who has the smaller dead- wood wins the round and gets the difference of the total deadwood. If the player “knocks” with 0 deadwood, they get a 25 “gin” bonus. Additionally, if the player who did not knock won the round, they get an additional 25 “undercut” bonus. In Table 1 below, we provide the definitions of the new concepts appear in our paper, which will have detailed explanations later. Table 1. Definitions of new concepts New concepts name
Meaning
˛, ♥, ♣, ♠
Diamond (D), Heart (H), Club (C), Spade (S)
Potential Set
Has great possibility to form a set; already has 2 same rank cards
Potential Run
Has great possibility to form a run; already has 2 same suit cards
Potential Meld
The combination of potential set and potential run
Dead deadwood
Cards that cannot be used to form a meld/potential meld anymore
Easy (Potential) Run
The cards are consecutive, like 6♥ and 7♥
Hard (Potential) Run
The cards are not consecutive, like 6♥ and 9♥
Random pile
The rest of the cards which are not in both players’ hands
Discard pile
The cards discarded by the players (which are public)
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
545
3 Related Work A lot of prior research focuses on the imperfect information card games. Since this kind of game has limited information on opponents, it is important to use AI algorithms or strategies to predict the opponent’s hands or estimate more information. For example, the former researchers built the AI player “DeepStack” for Texas Hold’em, and the Google “DeepMind” team built the software “AlphaGo” for chess and shogi [1]. These works built a solid base for solving the incomplete information game [3]. The previous studies also developed strategies and models to increase the win rate. They developed reinforcement learning method to train the model and kept track of the dynamics of human’s behavior in poker [5]. Learning the human’s behavior in these imperfect information card games is an important aspect of AI, which could let the strategies and models to better adapt every situation [4, 6].
4 Static Strategies The number of information combination sets are quite large in Gin Rummy. Since there are 52 cards in the game, and each player has 10 cards, the combination possibility of the cards in each player’s hand is 52C10 ∼ = 15.8 billion. The initial possibility for the random pile is 52C32, and the possibility will decrease when players begin to draw cards from this pile. Therefore, it is important to build strategies rather than doing huge calculations [2]. 4.1 Discard Strategy The discard strategy is always connected with the opponent’s cards. The goals of using this strategy are to get rid of dead deadwoods, which could lower the total deadwood points. To decide which card to discard, we need to make a prediction about which cards the opponent holds in order to minimize the chance that the discarded pile will help them. When the opponent picks up a face-up card, we can form a matrix like Fig. 2. The card drawing from the discard pile has great possibilities to form a meld. For instance, if the opponent picks up 6♥, they have a great possibility to have 6♣, 6˛, 6♠, 5♥, or 7♥ in their hands. However, to form a run, it is more complicated to estimate the other cards since the run may have 3 or more cards. Therefore, we still include 4♥ and 8♥ but with lower possibility. When we start to discard cards, we will match our intended discard card with the opponent’s cards in the matrix. For example, in Fig. 1, we decide to discard 8♥ because it is the maximum dead deadwood (mdd) in our hand. The mdd is the biggest card that cannot be used to form a meld/potential meld. It can lower our total deadwood if we discard it. In Fig. 2’s situation, the opponent picks up a card 6♥ from the discard pile. They probably have 4♥, 5♥, 7♥, 8♥, 6♣, 6♠, 6˛ in their hand. Then, we will match 8♥ with the orange boxes, which has some possibility to form the opponent’s run. Therefore, to avoid opponent’s from forming a meld, we will not discard 8♥ in this round.
546
Y. Hao and M. Vaysiberg
Fig. 1. Opponent picks up one card
However, sometimes the matching card is quite large and may become a burden to our total deadwood. We will compare the maximum matched card value and the maximum unmatched card value. After trying different thresholds, we discovered that if the difference between them is bigger than 7, we can choose to discard the maximum matched card. Like the example above, 8♥ is the maximum matched card value here. 8♥ may be useful for the opponent to form a meld, but to lower our total deadwood, the algorithm still decides to discard this dead deadwood. Thus, we will compare the value of the second largest mdd and the maximum potential meld’s card. Then we will discard the bigger value card of them. For discarding unmatched cards, we also compared the value of mdd and the biggest meld’s card. We chose to discard the mdd when it is bigger than or equal to 6, then discard the biggest meld’s card. By implementing the discard strategy, our win rate increases from 50.034% to 53.412% (in 100,000 games). It shows that choosing the right card to discard is an important strategy in this game. 4.2 Draw Strategy In Gin Rummy, we will also draw a new card in each round. We can draw a discard pile card or a random card. Since we cannot get any information from the random card, it is sometimes riskier to choose a random card than a discard pile card. Therefore, we develop a draw strategy to increase greater possibility to choose the right card. If the card from the discard pile can be used to form a meld or be added to an existing meld, then it will be automatically picked up. In all other cases, we consider the card we expect to discard with the discard algorithm, our current deadwood total, whether the card will be classified as dead deadwood, a potential hard meld, or a potential easy meld, and the change in our deadwood after we pick it up. The strategy is split into two scenarios: when the player has less than or equal to 26 deadwood or when the player has greater than 26 deadwood. In the former case, the player is expected to knock in the following rounds. Then, the change in deadwood is considered as a priority over forming potential easy melds, potential hard melds, and dead deadwood. Thus, the card will be picked up only if the change in deadwood is greater than or equal to 7.
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
547
Algorithm 1. Draw Strategy Input: Our total deadwood, discard pile card’s value, the discard card from discard algorithm IF the discard pile card can match with one of our sets/runs, we pick it up. Else IF (total deadwood = 7) THEN take discard pile card Else, take the random pile card Else IF(Discard card value - faceup values >= 7 AND faceup becomes potential HARD meld), take discard pile card Else IF(Discard card value - faceup values >= 5 AND faceup becomes potential EASY meld), take discard pile card Else, take the random pile. Output: random/discard pile card
In the later case, the classification of the card is considered. First, if the change in deadwood is greater than or equal to 7 and the card becomes a potential hard meld, then it will be picked up. Next, if the change in deadwood is greater than or equal to 5 and the card becomes a potential easy meld, then it will be picked up. In all other scenarios, the card is taken from the random set. These particular constants were all determined experimentally to maximize the win rate against the SimpleGinRummyPlayer over 100,000 games. 4.3 Opponent Hand’s Estimation Strategy To build the opponent hand’s estimation strategy, it is important to know the source of their cards. In this game, we only know the opponent’s cards when they draw the discard pile cards and discard cards. We don’t know the cards the opponent draws from the random pile. Therefore, based on the cards they draw and discard, we can know the melds they may or may not have. Below are some situations where the opponent picks up cards from the discard pile.
Fig. 2. Opponent picks up two face-up cards
For Fig. 2, the opponent picks up 6♥ and 9˛, which seems to be irrelevant. We may not know the connections between drawing these two cards. We can only estimate that the opponent may have melds around 6 or 9.
548
Y. Hao and M. Vaysiberg
Fig. 3. Opponent picks up two face-up cards and sets up an obvious run
For Fig. 3, the opponent picks up 6♥ and 8♥. In this situation, the opponent has a great possibility to have 7♥ in their hands, since 6♥, 7♥, 8♥ is an obvious run. Because a run may have more than 3 cards, it is possible that the opponent has 5♥ or 9♥.
Fig. 4. Opponent picks up two face-up cards and form an obvious set
In Fig. 4’s situation, the opponent has a great possibility to have a set of 6. If we have 6♣ or 6˛ in our hand, we should not discard it. Because the opponent may knock earlier, then we can lay off 6♣ or 6˛ to the set. Opponent Hand’s Estimation Model (OHEM) In order to estimate the opponent’s cards more accurate, we set up an opponent hand’s estimation model (OHEM), as Fig. 5 shows. At the beginning of the game, we created a 13 * 4 matrix for opponent’s hands. Since we do not know any of their cards, we set every card’s possibility as unknown. Then, after the first round, the opponent draws 6♥ and discards 3˛. The blue boxes mean that the opponent may not have the set of 3 or the run around 3. The orange boxes mean that they may have the set of 6 or the run around 6. In the end of the first round, we estimate that the opponent has the meld of 6 and does not have the meld of 3. Therefore, we marked the possibility of cards in melds of 3 as 0.5, and the further cards in the run (4♥ and 8♥) as 0.3. In the second round, after the opponent draws 8♥ and discards 9♣, we can estimate that they have the run of 6♥, 7♥, 8♥. They may not have melds around 3 and 9. After each round, the matrix updates to the latest version, and we can match our cards to their hand’s estimation. By building the OHEM, we can estimate each card’s possibility and re-decide whether we should discard/draw the specific card. We can also predict the melds they have and the next card they would like to draw. If the opponent has high possibility to
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
549
Fig. 5. OHEM. The dark orange boxes represent the cards that the opponent picks. The lighter orange boxes represent the cards the opponent may have. The dark blue boxes represent the cards the opponent discards. The lighter blue boxes represent the cards they may not have.
own a card, it will be riskier to discard that card. OHEM helps us to foresee the consequences of discarding and drawing a card and to reduce the chance of choosing the wrong cards. 4.4 Knock Strategy We found out that there are some relationships between the time to knock and win rate. From the Fig. 6 below, we find a model for total deadwood as a function of round number. The figure shows an exponential regression of the Simple Gin Rummy Player’s deadwood obtained from 5,163 games between two Simple Gin Rummy Players. Because the number of rounds per game is not constant, in order to have an equal sample size for each round, after the player knocks, the remaining rounds are filled in with the knocking deadwood. This ends up with a slight over approximation of the player’s deadwood. The black dotted line represents the total deadwood is 10. The purple solid line represents an exponential regression of average total deadwood per round. The red dashed line represents a linear regression of average total deadwood per round. The orange dots represent the mean value for the total deadwood in each round. The function for purple line is y = Ae-Bx+C, where A = 53.88331799 B = 0.22199779 C = 4.45358599. In order to avoid undercutting, it is better to knock when the player’s deadwood is under the purple line. In order to account for layoffs, a threshold is introduced that is
550
Y. Hao and M. Vaysiberg
Fig. 6. Relationship between total deadwood and rounds
subtracted from the exponential. In the Advanced strategy, it is determined that the optimal threshold is 9 which was obtained experimentally by playing matches consisting of 100,000 games. In the Dynamic strategy, this threshold becomes a variable and changes based on the opponent.
5 Dynamic Strategies Dynamic strategy is an active and responsive strategy [8]. We create a scoreboard to keep track of the win rate. Based on our scoreboard, we will decide whether to keep the current algorithms or to switch to more aggressive algorithms. In the beginning of the game, we will choose the safest algorithm to ensure we could minimize our total deadwood and maximize our chance to get an undercut bonus. After playing more games, our algorithm will choose whether to remain the current algorithm or change to another algorithm. 5.1 Dynamic Knock Strategy While we played more gin rummy games, we discovered that knock as early as possible is not a good strategy [5, 9]. Even if the total deadwood is less than equal to 10, we can wait more rounds to win more points. To determine the time to change the thresholds, we combined 10 games as a group to test the thresholds. If the thresholds work well, we will keep the same thresholds for the next 10 games; if they perform bad, then we will recalculate our threshold again and switch to a new threshold. Among 100,000 games, the win rate will be better if we combined 30 games as a group.
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
551
Table 2. Four categories of points Points Explanation x
Number of points we get from undercut (opponent knocks and our deadwood < their deadwood)
y
Number of points we lose from opponent undercut (we knock and our deadwood >= their deadwood)
z
Number of points get from we knock and had less deadwood
w
Number of points opponent gets when they knock and have less deadwood
To determine whether the thresholds perform well or not, we will divide the total points into four categories (x, y, z, w). From Table 2 above, our total net points = x + z, while the opponent’s total net points = y + w. The total net points for InitialThreshold = x + z - y - w. From the four categories, we prefer more points from x, which means the opponent always knocks earlier than us and our deadwood points are always less than the opponent’s deadwood points. We can earn more bonus points in each round. In the first 10 games, we will start with the safest threshold 10. Based on the previous 10 games’ results, we can recalculate the four categories’ points and the average net points for that particular threshold. If the average for the current threshold is not the max, then we switch to the threshold with the greatest max points, in all other cases we perform the following algorithm. If x or y has the max points among x, y, and z, then we will set a bigger threshold value. If z has the max points among x,y, and z, then we will set a smaller threshold value. W is not considered when changing the threshold as we cannot make any decisions about knocking if the opponent is the one that knocks first. The threshold formula (F1) is newThreshold (nT) = oldThreshold (oT) ± 1, while nT = 117 nT -= 1 ENDIF ENDIF ENDFor Output: nT
Fig. 7. The change of thresholds in dynamic strategy.
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
553
Fig. 8. Strategies vs. win rate. The blue line represents the advanced strategy which uses the static thresholds. The threshold at 10 gets the highest win rate, which is 61.376% among 100,000 games. The red line represents the win rate by using dynamic strategy. The dynamic threshold which starts at 10 also gets the highest win rate, which is 67.735%. The grey area is the range for the specific threshold’s win rate which runs 5 times.
5.2 Dynamic Draw/Discard Strategy After employing our dynamic knock strategy and getting high improvement in win rate, we decided to try our dynamic algorithms in draw/discard strategy [7]. However, after experimenting different thresholds in both strategies, we discovered that the win rate is nearly the same. The thresholds for discard strategy are for choosing which cards to discard. Because the opponent hand’s estimation is not always accurate, sometimes it may not matter to discard the mdd or biggest matched card. Thus, we continued to use the same thresholds for both strategies and did not use the dynamic strategies for draw and discard strategies.
6 Experimental Results Table 3. Best win rate (among 100,000 games) Simple Simple
Advanced Dynamic
50.034%
Advanced 60.419% 49.847% Dynamic
67.735% 49.732%
50.013%
* The win rate values belong to the left
column’s strategies
554
Y. Hao and M. Vaysiberg
From the Table 3 above, it is clear that both the Advanced and Dynamic strategies have superior win rates against the Simple strategy. Because the Simple strategy knocks as early as possible and has very basic draw and discard algorithms, it was concluded that the optimal knock strategy would be to abstain from knocking entirely because of the very high likelihood of getting the undercut bonus. On the contrary, when playing against the Advanced strategy which has more advanced draw, discard, and knock strategies, it was determined that abstaining from knocking was not as effective as the advanced strategy only knocks when it has below average deadwood. This is why the Dynamic strategy was introduced, which included a dynamically changing variable threshold that would be subtracted to the knock equation introduced in the Knock Strategy and Dynamic Knock Strategy sections. This allowed the Dynamic strategy to resemble an abstaining from knocking strategy against the Simple strategy, which explains the increased win rate as compared to the Advanced strategy. On the contrary, because of the constant fluctuations in the knock threshold in the Dynamic strategy, the threshold deviates from the optimal value of 10 which shows why the Dynamic strategy performs slightly worse than the Advanced strategy when they play against each other.
7 Conclusion and Future Work After implementing the strategies, our win rate against the Simple strategy improved significantly (from around 50% to over 67%). This was accomplished through draw, discard, hand estimation, and knock strategies. In particular, we created a dynamic knock strategy that attempts to find the optimal time to knock depending on how passive or aggressive the opponent is playing which contributed the greatest increase in win rate out of the previously mentioned strategies. We learned that choosing when to knock is the most important decision that a player needs to make, due to the 25 points undercut bonus (which is one quarter of the total points needed to win a game). In future studies, it may be worthwhile to consider dynamic versions of the other strategies and a card prediction algorithm when choosing from the random pile.
References 1. DeepStack: Expert-Level Artificial Intelligence in Heads-up No-Limit Poker. Science 356(6337), 508–513 (2017) 2. Lucas, S.M., Kendall, G.: Evolutionary computation and games. IEEE Comput. Intell. Mag. 1(1), 10–18 (2006) 3. Kotnik, M.C., Kalita, J.: The significance of temporal-difference learning in self-play training TD- rummy versus EVO-rummy. In: Proceedings of the Twentieth International Conference on International Conference on Machine Learning (ICML 2003) (2003) 4. Arrington, R., Langley, C., Bogaerts, S.: Using domain knowledge to improve monte-carlo tree search performance in parameterized poker squares. In: AAAI Conference on Artificial Intelligence (2016) 5. Ponsen, M., Tuyls, K.P., Jong de, S., Ramon, J., Croonenborghs, T., Driessens, K.: The dynamics of human behaviour in poker. In: Nijholt, A., Pantic, M. (eds.), Proceedings of the 20th Belgian-Netherlands Conference on Artificial Intelligence (BNAIC 2008), pp. 225–232. Universiteit Twente (2008)
Dynamic Strategies and Opponent Hands Estimation for Reinforcement Learning
555
6. Francisco, M.: AI Learning Gin Rummy. Towards Data Science (2017) 7. Davidson, A., Billings, D., Schaeffer, J., Szafron, D.: Improved opponent modeling in poker. In: Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI 2000), pp. 1467–1473 (2000) 8. Gibson, R., Szafron, D.: On strategy stitching in large extensive form multiplayer games. In: Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS 2011) (2011) 9. Sandholm, T.: The state of solving large incomplete information games, and application to poker. AI Magazine, pp. 13–32. Special issue on Algorithmic Game Theory (2010)
Wireless Sensor Network Smart Environment for Precision Agriculture: An Agent-Based Architecture AbdulMutalib Wahaishi1(B) and Raafat Aburukba2 1 Rochester Institute of Technology University, Rochester, NY 14623, USA
[email protected]
2 Department of Computer Science and Engineering,
American University of Sharjah, Sharjah, UAE [email protected]
Abstract. Wireless sensor networks (WSN) are becoming a prominent technology of an eminent importance in many agricultural-based applications. Precision agriculture (PA) is one of the venues that can significantly benefit from this technology in which high resolution data collection can be envisioned and thus enabling efficient and effective decision-making. Features such as cost-effective monitoring, collection and transmission of real-time data to facilitate decisionmaking directives, makes the WSN a perfect candidate for precision agriculture. In this paper, an agent-based architecture for precision agriculture (PA) is presented. Unlike traditional approaches, entities are modeled as autonomous agents that can collaborate to monitor soil, crops and climate in an agricultural field and hence provide timely decisions to facilitate real-time agricultural activities such as irrigation, fertilizer and pesticide application, for specific parts of a field proactively and in real time. An important issue within the PA environment is the heterogeneity and the diversity of the data. The data is collected from distributed heterogeneous devices and sensors that are interconnected through the Internet. These issues lead to the lack of the interoperability and thus make the task of processing, analyzing and interpreting of PA data, a challenging task. The proposed architecture was implemented using JADE (Java Agent Development platform) to support monitoring and control capabilities in precision agriculture environments. Keywords: Precision Agriculture · Wireless Sensor Network · Agent Technology
1 Introduction Precision agriculture (PA) utilizes modern technologies to provide valuable insights, treatment, operation management and decisions for quality crop production. The combination of Internet of Things (IoT) with wireless sensor networks (WSN) are the enabler technologies that contributes in making precision agriculture a reality. Features such as cost-effective monitoring, collection and wireless transmission of real-time data to facilitate decision-making directives, makes IoT and WSN perfect © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 556–572, 2022. https://doi.org/10.1007/978-3-030-82193-7_37
Wireless Sensor Network Smart Environment for Precision Agriculture
557
candidates for precision agriculture. PA is characterized by the spatial variations and the incessant dynamic changes of the environmental aspects as it varies from one field to another. Aspects such as soil, temperature, drainage, humidity, water and vegetation strongly affect crop growth and production. Although farmers and agriculturists are always aware of these factors, they were not able to quantitatively measure these environmental variations, predict outcomes and proactively take the appropriate actions. Various measurements readings provide useful guidelines to avoid water stress by projecting irrigation schedules improve production efficiency, product quality, and reduce the environmental impact. It is noteworthy that in such environments, the need for context-sensitive assistance and support that can autonomously plan timely directive actions and pre-emptive guidelines becomes a real vital necessity. Within the PA domain the approach will make the most of the WSN technology to provide services, expertise guidelines and farming relevant-information and hence will radically transform the way agriculture-related services are conceived and delivered. This paper presents a novel architecture that provides efficient means and decisionmaking mechanisms geared towards maximizing production and profits while taking into consideration the optimal utilization of various natural and artificial farming resources in precision agriculture. The architecture models PA entities as autonomous agents that collaborate to monitor soil, crops and climate in a farm field, and hence, provide timely decisions to facilitate real-time agricultural activities such as irrigation, fertilizer and pesticide application, for specific parts of a field proactively and in real time. Given the open, non-deterministic, and the dynamic nature of the entities participating in PA activities require the ability to change configuration according to their roles such as agronomists, inspectors, aquatic ecologists, foresters and Arborists. Moreover, farmers and agricultural personnel need spontaneous answers to various questions that are usually susceptible to a conjecture and speculation and prone to very intuitive analysis. It is clear that the design and implementation of solutions within distributed, open and heterogenous milieus such as the PA environments require a new modeling paradigm, integration architectures, and reliable uninterrupted self-configurable services. Such paradigm must describe the organization of the entities within the PA environment and the interconnection among them. Moreover, it must implement real-time smart monitoring to facilitate the various agriculture operations through managing the activities to enable the coordination amongst various PA entities and support the ad hoc and automated configurations. Given the aforementioned characteristics of the PA environment, agent-orientation is a suitable design paradigm as it provides the ability to monitor and coordinate PA tasks among services by providing adequate interaction mechanisms. Such paradigm enables modeling the nature of the PA environment being open, distributed, and heterogeneous environments in which an agent operates as part of a community of systems, as well as, human users. In this work, the notion of agent-hood is defined as a metaphorical conceptualization tool that captures, supports and implements features at a high level of abstraction (knowledge level), which is useful for the PA environment. These features are classified
558
A. Wahaishi and R. Aburukba
as primary features such as rationality and coordination; and secondary features such as intelligence and learning. The paper is organized as follows: Sect. 2, presents the evolution of the agriculture domain with the evolvement of technology, Sect. 3, presents agriculture 4.0 conceptual model, Sect. 4, provides a review of enabler technologies and existing related approaches, Sect. 5, presents details of the a multi-agent architecture for PA, Sect. 6, presents the architecture and the necessary means and decision-making mechanisms that are geared towards the optimal utilization of natural and artificial farming resources in precision agriculture, Sect. 7, presents a prototype implementation using JADE (Java Agent Development platform) to support monitoring and control capabilities in precision agriculture environments, Sect. 8, provides a conclusion and discussion. 1.1 Agriculture Evolution The practice of agriculture existed since ancient days where human nurtured lands and raised animals to survive. With the evolvement of technology, the agriculture domain has been adopting the existence of the current technology and evolved from manual processes to automated ones. Figure 1 presents the evolution of the Agriculture 1.0 to the current existence of Agriculture 4.0 based on the industrial evolution. Traditional agriculture1.0 relied on simple tools, manpower, and animal forces with low productivity. In the 19th century, with industrial evolution of steam engines, the use of machinery that is operated by farmers in agriculture 2.0, as well as the use of chemicals.
Agriculture 1.0
Agriculture 2.0
Use of man & animal forces, simple tools such as shovels
Use of machinery and chemicals
Agriculture 3.0
Agriculture 4.0
Time
2020
0 Use of computer programs and robots
Smart systems and devices
Fig. 1. Evolution of agriculture with the industrial evolution
This allowed for an increase in productivity of farm operations, however, it brought to the agriculture domain other issues related to the field contaminations, destruction to the eco-environment, and waste of natural resources. In the 20th century, the evolution of computing with computer programs and robotics enabled the agricultural machines to perform farming operations efficiently and intelligently, and hence, the 3rd agricultural evolution (Agriculture 3.0). This allowed in the reduction of chemical usage by optimizing the use of machinery, as well as, the improvement in the precision of irrigation. Current technologies and techniques within the fourth industrial revolution such as Cloud Computing, Internet of Things (IoT), Big Data, Artificial Intelligence, and Wireless Sensor Networks (WSN) allowed for more improvements to the agriculture domain and the existence of Agriculture 4.0.
Wireless Sensor Network Smart Environment for Precision Agriculture
559
Such technologies and techniques provide significant improvements within the agriculture domain in relation to the operation and the production. The work in [1] utilized IoT and WSN to optimize the production efficiency through efficient use of energy and water, as well as on maximizing on the agriculture quality, minimizing the impact on the environment. Moreover, the use of machine learning techniques in [19] to provide rich recommendations and insights for farmers to make adequate decisions within the agriculture domain. Other works looked into the efficient use of water for irrigation by utilizing IoT, WSN, and Cloud Computing [2, 20]. This paper explores the Agriculture 4.0 and the development of the smart system and devices using the multi-agent system paradigm with the support of the current existing technologies such the IoT, WSN, and Cloud Computing. 1.2 Agriculture 4.0 Conceptual Model The conceptual model presents the required layer to deliver on agriculture 4.0 which enables precision agriculture. Figure 2 shows the architecture with the essential layer. The following subsections elaborate on the main functionalities of each layer and its contributions to enable the current agriculture 4.0.
Fig. 2. Agriculture 4.0 conceptual model
Physical Layer. The current Industry 4.0 requires the definition of any object into the physical space and the digital space. The physical layer consists of domain related sensors (such as moisture sensor), transducers, and micro controllers/servers. The sensors provide the electrical signal readings within the agriculture environment. The electrical signals with the transducers are then shifted, scaled, and/or amplified with the micro controller/server digital, analog, and serial communication ports. Some decisions could be made with the aid of the micro controller/server device at the edge. such decision could be related to the operation of some of the resources related to the agriculture environment such as the irrigation system. The collected data from the physical layer is transmitted through the agriculture communication network layer through the supported protocols for the purpose of data storage and processing. Communication Layer. This layer is to make use of the adequate communication protocols and technologies in the transmission of the collected data from the sensor devices
560
A. Wahaishi and R. Aburukba
within the agriculture domain. This is an essential layer that provides the ability to connect the physical resources (in the physical layer) into the digital space. The digital space usually does not exist within the same network as the physical location of the farm, and hence, this requires access to the Wide Area Network (WAN). This can be either wired or wireless. Wireless communication network has been around for decades and has evolved from the first-generation to the current fifth generation of mobile communication network. Data Collection and Storage Layer. This layer provides the ability to collect, aggregate, and model the data captured from the physical layer. This data provides the ability for the control layer to provide the adequate decisions based on the collected agriculture data from the farmlands. The aggregated data may involve current collected data, as well as, historical data, depending on the model generated. Data from the agriculture domain can be either structured or semi-structured. Control Layer. This layer provides the ability to manage and decide upon specific criteria within the agriculture domain such as increasing productivity, allocating resources effectively, adapting to climate changes, and avoid food wastage. the layer contains Decision Support Systems (DSS) that provides the ability to provide the right decision as the right time depending on the implemented criteria. The DSS may utilize specific design paradigm and techniques for the agriculture domain such as Agent Oriented (AO) design paradigm and machine learning approaches that utilized current and historical data from the data collection and storage layer. Application Layer. This Layer Provides the Ability for the End User Within the Precision Agriculture to Interact with the System. The Layer Provides the Web Portal with the Exposed Services and Functionalities. Security Layer. This is an Essential Layer Since Each of the Aforementioned Layer Can Be Vulnerable to Various Malicious Attacks. Agriculture 4.0 Solutions Must Implement Security Practices to Protect the Full Environment from Information Disclosure, as Well as from Unauthorized Access, Improper Usage, Modification, or Destruction of Resources and Services. The Goal of the Security Layer is to Safeguard the PA Assets from Any Vulnerability Risks, Provide Confidentiality and Ensure the Integrity and the Overall Intrinsic Service Availability. These Goals Are Supported Through the Use of Authentication, Authorization, and Auditing Processes.
2 Enabling Technologies for Precision Agriculture With the rapid advancement in information technology, tremendous effort was devoted to developing new technological venues and design paradigms such as cloud computing, ubiquitous computing, and the context-aware computing in many application domains. Traditional agricultural management and processing methods need to be complemented by innovative sensing and driving technologies and state of the art technologies. Current trends of Agriculture 4.0 adopted innovative design and modelling paradigms as
Wireless Sensor Network Smart Environment for Precision Agriculture
561
well as AI-based technologies. Precision agriculture is one of the promising application domains in which several technologies such as adaptive planning and scheduling, pattern recognition, neuron networks and machine learning models and methods, big data and predictive analytics were proposed [3, 9, 14, 25, 35]. Although more complex definitions exist, it was claimed that term of PA was first coined in 1997 by the US House of Representatives, in which the PA was defined as “an integrated information- and production-based farming system that is designed to increase long term, site-specific and whole farm production efficiency, productivity and profitability while minimizing unintended impacts on wildlife and the environment”. Within this context, Precision Agriculture is viewed as a farming management concept based upon monitoring, computing and responding to variability either in crops and produces or in aspects of animal rearing and cultivation [6]. While the focus of some definitions was impacted by the modern technical approaches on climate change, such as climate smart agriculture, for which the definition of the PA was aimed to develop the structural policies and the proper techniques that will achieve sustainable agricultural development to achieve food security under climate change. Other definitions envisioned that PA was not only restricted to the cropping systems per-se, but to include the entire agricultural production system such as animal industries, fisheries and forestry. Other approaches were proposed in the context-aware computing and grid computing based architecture [7, 17]. However, the main focus of these paradigms was geared towards the automation of collaboration activities and computation on a global scale. Several solutions were proposed to support agricultural activities and domain services that use real-time sensing networks for site sprinkle irrigation and control of watering via Bluetooth wireless technology [5, 11, 12, 18, 23]. Other approaches proposed several ways of controlling infertility through sensing capabilities and hence, the increase of soil fertility [13, 16]. Some of the existing approaches provide decision-based monitoring architecture for pesticide control and disease monitoring prediction that address climate change and hence, predict rampant pest outbreaks and diseases [6, 29, 30]. Observing climate and soil relative parameters such as humidity, temperature and plants moisture are used as vital indicators to the development of the disease in potato fields was exhibited with the utilization of wireless sensors. Phenomena such as the Internet of Things [31] and Big Data as well as the utilization of Cloud Computing resources and Artificial Intelligence techniques are gaining a prominent acceptance and expected to leverage the development of various precision agriculture application [32]. Massive volumes of various data can be captured, analyzed and thus used for better automated control and decision-making. Wolfert provides a comprehensive review of the state-of-the-art of big data applications in smart farming and identifies the related socio-economic challenges [34]. In spite of the prominent rapid development of computer technology, field measurements of agricultural environment parameters and spatial data collection still depend highly on human intervention, stationary sensors and very traditional data repositories such as data loggers, paper-based records and static storage sources which are laborintensive and prone to very vital errors. In this work, Precision Agriculture is viewed
562
A. Wahaishi and R. Aburukba
as: an abstraction level at which the agricultural environment is viewed as a coherent universe that provides coordination solutions to a variety of agricultural entities in open distributed and heterogeneous settings.
3 Multi-agent Architecture for Precision Agriculture Farmers and agricultural personnel entities such as agronomists, inspectors, aquatic ecologists, foresters and arborists within a PA environment need spontaneous answers to various questions that are usually susceptible to a lot of conjecture and speculation and prone to very intuitive analysis. The main challenge in PA is how to adopt an adequate technology that provides means and mechanisms to improve field-level management with regards to, crop monitoring, farming practices and crop needs. Multi-agent design paradigm provides and improved integration architectures, and services for monitoring and control of the PA environment. Moreover, the multiagent design paradigm provides the ability to implement realtime smart monitoring to facilitate various agriculture operations and thus enables the interaction amongst various PA entities and supports the ad hoc and automated configurations. In the traditional point-to-point interaction patterns, entities engage directly with each other to satisfy controlled and directed coordination. It is clearly obvious that in such configuration is both inflexible and computationally expensive. For instance, there is no separation of concerns and responsibilities between the computation elements and the incurred required interactions and coordination. The lack of a distinct medium that deals solely with the coordination activities, imposes a considerable overhead on the entities in which they have to carry out the “interaction work” themselves to satisfy common or local task, in addition to their fundamental computational responsibilities. As an alternative, a capability-based coordination approach for interaction is a favorable and effective choice in which entities does not need to be concerned with how the interaction is performed or done. In this work, we consider the agent technology as the applicable medium that supports the coordination approach that is based on the capabilities of the agents within the PA environment. The notion of agent-hood is defined as a metaphorical conceptualization tool that implements features within the PA environment at a high-level of abstraction at the knowledge level. These features are classified as primary features such as rationality and coordination; and secondary features such as intelligence and learning. The multi-agent architecture for PA takes into consideration the creation of a smart environment with the ability to combine distributed information sources that may have different schemas. Furthermore, it provides a level of assistance to agriculturists through the ability to integrate distributed multi-modal sensors and actuators, and further analyze the integrated information. Such heterogeneity of the PA environment adds burden on fulfilling the interoperability among the different PA participants. Many semantic solutions were proposed in the literature in order to solve the problem of data heterogeneity and providing interoperability between the devices, sensors and relevant entities in PA, such as semantic annotations, linked sensor data, and ontologies. In addition, different information models might use different ontologies languages in terms of representations, which make the process of the semantic interoperability a complicated activity.
Wireless Sensor Network Smart Environment for Precision Agriculture
563
Thus, our presented agent-based architecture defines an ontological view and representation that act as an explicit semantic ontology, which consequently provide meaningful integration processes such as matching, mapping, and aligning as well as enabling the interoperability among the PA entities. 3.1 Modeling the Precision Agriculture Smart Environment The architecture for a smart PA environment can be viewed as a multi-layered system. The nodes of a WSN are composed of sensors that collect data from the agriculture field. Sensors may also get data from other nodes by exchanging messages. The data obtained from the WSN describes how properties vary within agriculture fields. This information is then used to analyze and predict values that provide efficient control of important farming operations. The results of the interventions are monitored, recorded and used as a starting point to initiate the next crop cycle. From the collection and intelligent analysis of these parameters can then plan interventions such as fertilizers, irrigation and plant protection treatments tailored to the actual needs of the crop, which is implemented only when necessary. The positive effects of a site-specific management of agricultural practices affect not only the quantity and quality of products, but also a significant reduction in production costs and more effective action to protect the environment by reducing the use of pesticides. This approach also allows identifying and evaluating the risk indices plant pathology in order to implement appropriate strategies of “integrated pest management”. Figure 3 shows the proposed PA environment as a collection of software agents that interact autonomously and cooperatively analyze information, generate valuable decisions, and accordingly provide agricultural directives and guidelines.
Fig. 3. Smart environment monitoring and control architecture for PA
564
A. Wahaishi and R. Aburukba
Agent-orientation provides the ability to develop entities that are able to detect and react to the nature of PA dynamic stimuli. Agent-hood features such as autonomy, reactivity and mobility drive certain desirable similarities with WSN. These similarities can be summarized as follow: • Reactivity: Agents are able to detect and react to events, and external stimuli. • Ability to work asynchronously: The agents come into operation, upon receipt of a stimulus. In PA, this feature favors the implementation of techniques to ensure that messages are asynchronously delivered and managed over the network. • Autonomy: Agents have control over their actions and the various internal states without an intervention of humans. In a PA environment, this condition is very useful, since if any node in the network fails, the network must be able to meet its target objectives without impacting the overall common goals such as sustainability, availability and robustness. • Communication capability: Agents need to interact and communicate to achieve their goals. The communication component defines the language and the set of protocols of exchanging messages. This ability is reflected in the distribution of information obtained from sensors or from other network nodes. • Coordination and Cooperation. The agent must be able to perform tasks and coordinate their interaction through collaboration with other agents. The coordination defines both the system structure and the • Interaction mechanism and the cooperation defines the policies that govern the interaction. • Mobility: It is the ability of an agent to move around a specific network. The deployed node is in mobility to either track or get the data from a corresponding PA field. Agent Architecture. The agents of the PA environment are built on the foundation of Collaborative Intelligence Rational (CIR) agent architecture with the focus on utilizing the model to capture the participating agent’s behavior towards achieving the desirable goal. Each agent has two main components: 1) Knowledge Component, 2) Capability Component. Each component is designed according to the agent’s specific role [21]. The knowledge component consists of the following: • • • • •
The agent’s self-knowledge based on its role within the PA environment. The knowledge of the other agents within the PA environment. The specific goals to be satisfied within the PA environment. Possible solutions to satisfy each goal within the PA environment. The local history of the PA environment that consists of all possible local solutions for an agent at any given time. • Agent’s desires commitments and intentions toward achieving each goal within the PA environment. The capability component consists of the following: • The domain actions which contains the possible set of actions that impacts the state of the PA environment when executed.
Wireless Sensor Network Smart Environment for Precision Agriculture
565
• The communication where each agent within the PA environment sends and receives messages. • The problem solver which consists of reasoning algorithms that are executed based on the role of the agent to satisfy its goal. The generated solution from the problem solver within the agent depends on the aforementioned knowledge component. From Fig. 3, we can observe the interdependency challenge among the Ontology Agent, Personal Assistant Agent, and the Gateway Agent. Interdependency is defined as a goal-relevant interrelationship between actions performed by various participating agents [22]. The coordination mechanisms reduce and resolve the issues associated with interdependencies. Furthermore, the agent’s interaction module identifies the type of interdependencies that exists within the PA environment and accordingly utilizes the suitable interaction device. The interaction devices are categorized as: Contractbased, includes the assignment device; Negotiation-based, includes resource scheduling, conflict resolution, synchronization, and redundancy avoidance devices [10]. The interdependency within the PA domain is classified as capability interdependency and the interaction device is the “assignment” device. The main function of the assignment device is to resolve problems associated with capability interdependencies. The basic characteristics of the assignment device are problem specifications and evaluation parameters. With reducing complexity in achieving a goal as the agent’s main objective, a solution can be selected based on the goal quality. The basic characteristics of the assignment device are problem specifications, evaluation parameters, and the sub-processes. The problem specifications might include, for example, the request, the desired-satisfying time, and the expiration time. A collection of basic components comprises the structure of the agent model and represents its capabilities. The proposed architecture defines three types of rational intelligent agents, namely: The Personal Assistant Agent (PAA), the Ontology Agent (OA) and the Gateway Agent (GA). Personal Assistant Agent (PAA): possesses the capability of three major components: 1) the communication component, 2) the analysis data component, and 3) the decisionmaking component. The role of the PAA is to formulate the agriculturists’ request and collects the various climates relevant readings obtained from the PA environment. Upon receiving these readings, the PAA analyses the information and determines the ideal settings and the possible recommended set of needed actions (for example, if the humidity or temperature exceeds/less than a predefined value, an action alert will be generated). In order for the PAA to capture the other agents’ model, it has to be able to communicate only about facts that can be represented and expressed in the defined ontology view and representation. In other words, agents need to adhere to the ontological semantic and structure. Moreover, in order for the PAA to fulfill this capability, it asks the Ontology Agent (OA) for the definitions and possibly requests translation of agricultural terms and concepts inherent is a specific request. Ontologies provide support for semantic data and semantic integration. However, there are many cases where organizational, cultural, or infrastructural constraints hide or even disallow the adoption of such semantic artifacts, in other words, there is a lack of explicit ontologies.
566
A. Wahaishi and R. Aburukba
Ontology Agent (OA): captures and defines the set of agriculture concepts, set of activities and domain events. It provides various abstract levels of domain knowledge within the PA related applications and hence derives the decision–making process. The corresponding ontology specifies a formal representation for the different agricultural concepts (semantic ans syntax) and the different relationships amongst these concepts. This ontology must be agreed and understood by the agent community that constitute the PA environment (or at least among its part) in order to enable each agent to understand messages from other participants. The formal precision agriculture ontology presented in this work adopts a similar structure defined by Goumopoulos [11] that can be formally represented as follows PAOnto = OntDes , PAConcept , PARelation , Axioms The OntDes is a tuple that consists of the following elements: OntDes ≡ OntID, DevId , ver, md , where OntID is the ontology name, DevId defines the device name, ver specifies the time of relevant to ontology creation and md represents the source and the purpose of the ontology metadata information. The PAConcept represents the PA set of concepts and terminologies; PARelation defines both the set of hierarchical and non-hierarchical relationships between the PA concepts and Axioms are these assertions (including rules) in a logical form. The representation of the PA Ontology captures the knowledge of various agricultural resources in terms of their classifications and descriptions. In this work, the knowledge representation of ontology adheres to the five-elements ontology approach, which can be formally, represents as follows: Ont ≡ C, Attr c , Rel c , Attr Rel , H c The C defines the set of agricultural concepts and terminologies; Attr c is the set of all the possible states defined by the attributes of each concept; Rel c is the set of relationships among the ontological agricultural concepts; Attr Rel defines the various attributes of specific relationship along with the applicable multiplicities. All dependencies can be further viewed in terms of a generalization (is-a) or aggregation (part-whole) relationships; H c specifies the concept hierarchy and its sub-elements. For example, the potato’s insecticide control ontology shall adhere to the aforementioned representation in which the elements will be defined as follows: • C = {Plant, Insects, Pests, Insecticide, potato, Blister beetle, Aphid, Grubs, Russell Potato, White Potato, moisture, hydrometer} • Attr c = {(Name, Plant), (Reading, Moisture), (Poisoning percentage, Insecticide), (Color, Grubs)} • Rel c = {Kill (Insecticide, Aphid)} • Attr Rel = {((Name, (is-a), Plant)), ((Nutrients, (is-composed-of), ingredients))} • H c = {(Potato, Solanaceae), (Blister beetle, Insect), (Aphid, pests), (Herbicides, Insecticide)}. The Gateway Agent (GA): has the role of information gathering that receives sensors’ readings based on pre-set sampling intervals. For each sensor set, the GA is assigned the
Wireless Sensor Network Smart Environment for Precision Agriculture
567
task of retrieving readings from the underlying scattered sensors based on the defined ontology. The GA monitors the status of the sensors and performs checking operations and set the data collection frequency. It is noteworthy that the PA environment encompasses different kinds of sensors with various specifications. The GA supports on-demand reading delivery as well as the full pre-defined measurement lifecycle. The knowledge component of the GA includes the sensors model available in the PA environment starting from the registration of physical sensor nodes as well as the addition or removal of a physical sensor. The knowledge component also contains a description of the sensors measuring capability, the agricultural terms understood and the description of the relationship between the concepts. The domain knowledge provides mappings models for transforming a domain queries into a set of queries to the appropriate target sensors. All these models need to be stated in an expressive common language that can allow capturing all the relevant distinctions amongst various terms and concepts as per the aforementioned ontology view. The gateway agent’s capability component identifies, dynamically selects the appropriate sensor and accordingly retrieves the relevant readings. Query processing requires developing a plan for obtaining the requested data. It is noteworthy, that the gateway agent also acts as a firewall and a proxy server to prevent unauthorized access to the private agricultural field network.
4 Agent-Based PA Implementation Directives The proposed architecture is validated by implementing the system of the PA in a real environment. The implementation of the system is viewed as a collection of tiers and layers with various autonomous agents that can act independently and cooperate in providing services and synergize agricultural environmental and spatial data according to mutual interests. The model provides querying ability and coordination activities that enhance the overall connectivity of distributed, autonomous, and heterogeneous WSN information sources in PA. In this work the Mote-View WSN package [4] was used as the “client tier” application interface for control and monitoring supervision for a specific topology of wireless sensors deployment [33]. 4.1 Hardware Specifications Two types of sensors were used; namely: the (MEP410) for microclimate and ambient light monitoring, and (MEP510) sensor, in addition to a gateway (MIP520) [33]. The sensors and the gateway are installed with the distance between the nodes is 3.5 m. Each WSN package contains of the following: • Radio Platform (PR2400CA): the platform is based on a low-power microcontroller (AtmelATmega128L) and runs MoteWorks [24] that is an end-to-end platform for the creation of wireless sensor networks applications and the processing of simultaneous radio communications.
568
A. Wahaishi and R. Aburukba
• Data Acquisition Board (MDA300CA): a compliant data acquisition board that can adopt various types of sensors. The board contains standard environmental measurements sensors such as ambient temperature, humidity, soil moisture, leaf wetness and atmospheric pressure. • Sensor Node: Each sensor node consists of 2.4 GHz MICAz mote, MDA300CA data acquisition board, Irrometer Soil moisture sensor, atmospheric pressure sensor MPX4115A, and leaf wetness sensor. The board is managed and controlled by an operating system (TinyOS). TinyOS is an open-source operating system designed for low-power wireless devices [28] and event-driven programming language [8]. • Sink Node: The Sensor –System interface is supported by MIB510, which provides data transfer and allows aggregation of sensor network data on different computer platforms via an RS-232 serial programming interface. Additionally, the sensor node (MICAz) acts as a base station when assembled with the MIB510 interface board [8]. 4.2 Software Specifications The Mote-View software framework is used for managing, monitoring, and visualizing sensor network deployments. The modules within the Mote-View client are conceptually viewed as a multi-tier-based architecture in which four layers provide dedicated services (see Fig. 4).
Fig. 4. MOTE-VIEW client-tier abstraction of sensor network data
A Data Access Abstraction Layer (data layer), comprises the PostgreSQL relational database [26] and captures various system data such sensors readings, and environment corresponding surroundings information as well as a real-time measurements’ readings. The Node Status and Configuration Layer (Node Layer) manages all the relevant sensor configuration and gets updates upon successful events that are sent from the data layer; Calibration and Conversion Layer which provides calibration and data conversion service to the measured parameters and uses calibration coefficients raw data readings. The Presentation layer provides a Graphical User Interface (GUI) for to provide users of wireless sensor networks an end-to-end management and supervision of a specific sensor’s deployment. Each layer includes a plug-in capability for providing modular extensions. The Mote-View package consists of the following modules.
Wireless Sensor Network Smart Environment for Precision Agriculture
569
• MoteViews: it is software used for collecting data from the nodes in different fields. • XSniffer: Monitors the 802.15.4 wireless network packets. • MoteConfig: it is a Windows-based GUI utility for programming Motes. This utility provides an interface for configuring and downloading pre-compiled XMesh/TinyOS firmware applications onto Motes. • Cygwin: is a command-line interface. 4.3 Experiment Environment Setting The proposed architecture was tested under experimental conditions and was meant as a proof-of-concept. Two rows of shallow trenches, one meter apart was dug and seed potatoes half (½) meter apart were planted accordingly, each seed location is characterized by a soil moisture sensor. The temperature sensor was installed halfway between the two rows and mounds it against the plants, burying the stems halfway. A total of seven sensors (Mica2 motes) were organized in a parcel for monitoring the crop and recording the temperature and relative humidity. In order to maintain an effective network connectivity and reliable communication, the sensors were installed at a height of 50 cm. It is noteworthy that during the potatoes flowering, the radio range was dramatically affected and thus manual adjustments for the sensor’s heights were required. All the system settings, readings and corresponding sensors data were captured in multiple databases by using the PostgreSQL DBMS. The implementation utilizes the JADE platform [15]. JADE is a software framework geared towards developing agent applications in compliance with the FIPA specifications for multi-agent systems. JADE supports a distributed environment of agent containers and allow several agents to execute concurrently. The role of the PAA agent is to formulate the service requests, sends it out to all gateway agents and collects the required information, evaluates the received data, determines the applicable decision accordingly and notifies the corresponding agent about the outcomes. The knowledge component includes the agent’s self-model, model of gateway and ontology agents, and the local history. The main capabilities of the PAA agent include communication, reasoning and domain actions components. The PAA problem-solver component incorporates the Jade behavior classes (simpleBehaviors and cyclicBehaviors) that fulfill tasks such as registration with the directory facilitator (DF) service and to handle the various incoming messages and requests. The cyclic behavior class equips the PAA agent with ability to check for service results that have been sent by the gateway agents. The communication component inherits the jade.Core.Agent and jade.lang.acl.ACLMessage existing classes of the jade platform. These classes enable constructing ACL messages by utilizing the various FIPA performatives such as REQUEST, INFORM, INFORM-DONE, and QUERY-IF. The messages exchanged in the interaction contain a set of one or more message elements. The elements vary according to the coordination scenario; however, the only mandatory element in all the ACL messages is the performative. A message might additionally include the sender name or signature, the receiver and the possibly a content. Agents are able to get contents and set the content of a particular message by overriding the methods setContent and getContent.
570
A. Wahaishi and R. Aburukba
The coordination component uses the FipaContractNetBehaviour for the interaction and negotiation patterns based on the soliciting approach as in the case of the contractnet approach which depends on (1) the modeling approach for other agent’s capabilities, and (2) the solicitation approach for the local schedule and workload of other agents. To capture the PA ontology and to represent the domain knowledge, the Protégé [27] extendible framework is used in this work. The use of Protégé permits exporting the ontology in various formats such as OWL (Web Ontology Language) and RDF (Resource Description Framework). The JADE Abstract Ontology is implanted using the SimpleJADEAbstractOntology. For Frames ontologies, the SimpleJADEAbstractOntology is inherited and for the OWL ontologies, the OWLSimpleJADEAbstractOntoloigy package is imported.
5 Conclusions Precision agriculture is a new paradigm that is geared to accommodate the increasing demand for food and nutrient using state-of-the-art technologies. These technologies have emerged as a proper choice not only to make the agricultural activities simpler, but cheaper to collect and apply data, and to adapt to changing environmental conditions while using the resources most efficiently. The proposed agent-based monitoring and control architecture promotes innovative model that can protect field crops, manage and control real-time agricultural activities in a novel modern approach. The timely decisions exhibited by the modcel are very desirable fundamental directives for optimizing production activities of precision agriculture users. The architecture depicts a a promising trend in moving from an intensive-labor and timeconsuming work to an automated, real-time, and cost-effective monitoring, control and decision-making architecture that can enhance the PA prospective and objectives. By utilizing the Agent-Oriented paradigm, the Precision Agriculture domain environment is modelled at a high level of abstraction and viewed collectively as a coherent universe of interacting and collaborative agents that provide high degree of decentralization of capabilities, which is the key to system scalability and extensibility. The ontology model specifies formal schemata and intelligent view for developing and representing efficient of agricultural knowledge across various agriculture information sources. It is noteworthy that the number of physical parameters that need to be observed and monitored can be captured by introducing relevant sensor types as long as these attached sensors are registered and initiated with the Gateway Agent (GA). Furthermore, the model can accommodate control-based sensors for actuators and motors which can be remotely switched on and off to perform automatic activities such as irrigation and pesticide spraying activities. The utilization of the WSN technology and the proposed model was realized and tested under lab conditions where physical parameters (such as soil temperature and soil moisture) can be easily captured and transmitted for further decision making.
Wireless Sensor Network Smart Environment for Precision Agriculture
571
References 1. Blackmore, S., Stout, B., Wang, M., Runov, B.: Robotic agriculture—the future of agricultural mechanisation. In: Stafford, J.V. (ed.) Proceedings of the 5th European Conference on Precision Agriculture, pp. 621–628 (2005) 2. Bonfante, A., et al.: LCIS DSS–an irrigation supporting system for water use efficiency improvement in precision agriculture: a maize case study. Agric. Syst. 176 (2019). https:// doi.org/10.1016/j.agsy.2019.102646 3. Cross, P.: Some very high precision applications of Global Navigation Satellite Systems. In: IEE Seminar on New Developments and Opportunities in Global Navigation Satellite Systems (2005). https://doi.org/10.1049/ic:20050561 4. Crossbows. https://www.cabelas.com/category/Crossbows/103854780.uts. Accessed 10 Mar 2018 5. Damas, M., Prados, A., Gómez, F., Olivares, G.: HidroBus system: fieldbus for integrated management of extensive areas of irrigated land. Microprocess. Microsyst. 25(3), 177–184 (2001). https://doi.org/10.1016/s0141-9331(01)00110-7 6. Dammer, K.H.: Variable rate application of fungicides. In: Oerke, E.C., Gerhards, R., Menz, G., Sikora, R. (eds.) Precision Crop Protection - The Challenge and Use of Heterogeneity, pp. 351–361. Springer, Dordrecht (2010). https://doi.org/10.1007/978-90-481-9277-9_22 7. Ferrández-Pastor, F.J., García-Chamizo, J.M., Nieto-Hidalgo, M., Mora-Martínez, J.: Precision agriculture design method using a distributed computing architecture on internet of things context. Sensors (Basel, Switzerland) 18(6), 1731 (2018). https://doi.org/10.3390/s18 061731 8. Gay, D., Levis, P., Behren, R.V., Welsh, M., Brewer, E., Culler, D.: The nesC language. ACM SIGPLAN Not. 49(4), 41–51 (2014). https://doi.org/10.1145/2641638.2641652 9. Gersmehl, P.: GIS applications in agriculture (GIS applications in agriculture series) - Edited by Francis J Pierce and David Clay. Trans. GIS 12(2), 285–286 (2008). https://doi.org/10. 1111/j.1467-9671.2008.01099.x 10. Ghenniwa, H., Mohamed, K.: Interaction devices for coordinating cooperative distributed systems. Intell. Autom. Soft Comput. 6(3), 173–184 (2000). https://doi.org/10.1080/10798587. 2000.10642786 11. Goumopoulos, C.: An autonomous wireless sensor/actuator network for precision irrigation in greenhouses. In: Mukhopadhyay, S. (eds.) Smart Sensing Technology for Agriculture and Environmental Monitoring. LNEE, vol. 146, pp. 1–20. Springer, Heidelberg (2012).https:// doi.org/10.1007/978-3-642-27638-5_1 12. Goumopoulos, C., Kameas, A., Cassells, A.: An ontology-driven system architecture for precision agriculture applications. Int. J. Metadata Semant. Ontol. 4(1/2), 72 (2009). https:// doi.org/10.1504/ijmso.2009.026256 13. He, J., Wang, J., He, D., Dong, J., Wang, Y.: The design and implementation of an integrated optimal fertilization decision support system. Math. Comput. Modell. 54(3–4), 1167–1174 (2011). https://doi.org/10.1016/j.mcm.2010.11.050 14. https://www.nap.edu/catalog/11759/contributions-of-land-remote-sensing-for-decisionsabout-food-security-and-human-health. 15. Jade Site | Java Agent DEvelopment Framework (n.d.). http://jade.tilab.com/ 16. Kassim, M.R., Mat, I., Harun, A.N.: Wireless sensor network in precision agriculture application. In: 2014 International Conference on Computer, Information and Telecommunication Systems (CITS) (2014). https://doi.org/10.1109/cits.2014.6878963 17. Kim, Y., Evans, R.G.: Software design for wireless sensor-based site-specific irrigation. Comput. Electron. Agric. 66(2), 159–165 (2009). https://doi.org/10.1016/j.compag.2009. 01.007
572
A. Wahaishi and R. Aburukba
18. Koshy, S., Nagaraju, Y., Palli, S., Prasad, Y.G., Pola, N.: Wireless sensor network based forewarning models for pests and diseases in agriculture – a case study on groundnut. Int. J. Adv. Res. Technol. 3(1) (2014). http://www.ijoart.org/docs/Wireless-Sensor-Network-basedForewarning-Models.pdf 19. Liakos, K.G., Busato, P., Moshou, D., Pearson, S., Bochtis, D.: Machine learning in agriculture: a review. Sensors 18(8) (2018). https://doi.org/10.3390/s18082674 20. Lopez-Riquelme, J.A., Pavon-Pulido, N., Navarro-Hellin, H., Soto-Valles, F., TorresSanchez, R.: A software architecture based on FIWARE cloud for precision agriculture. Agric. Water Manag. 183, 123–135 (2017). https://doi.org/10.1016/j.agwat.2016.10.020 21. Masaud-Wahaishi, A., Ghenniwa, H.: Privacy based information brokering for cooperative distributed e-health systems. J. Emerg. Technol. Web Intell. 1(2) (2009). https://doi.org/10. 4304/jetwi.1.2.161-171 22. Masaud-Wahaishi, A., Ghenniwa, H.: Privacy-based multiagent brokering architecture for ubiquitous healthcare systems. Ubiquit. Health Med. Inform. 296–328 (2010). https://doi. org/10.4018/978-1-61520-777-0.ch015. 23. Morais, R., Valente, A., Serôdio, C.: A Wireless Sensor Network for Smart Irrigation and Environmental Monitoring: A Position Article (2005). https://www.semanticscholar.org/ paper/A-Wireless-Sensor-Network-for-Smart-Irrigation-and-Morais-Valente/2dda57ec30e0 faa0e2b7d9e02b9fe03830a603a6 24. MoteWork Software Platform. http://www.memsic.com/userfiles/files/User-Manuals/mot eworks-getting-started-guide.pdf. Accessed 22 Jan 2018 25. National Research Council: Contributions of Land Remote Sensing for Decisions About Food Security and Human Health: Workshop Report (2007). https://doi.org/10.17226/11759 26. PostgreSQL Administration: Beginning Databases with PostgreSQL, pp. 309–356. https:// doi.org/10.1007/978-1-4302-0018-5_11. http://www.postgresql.org. Accessed July 2018 27. Protégé Ontology Editor. http://protege.stanford.edu. Accessed July 2018 28. Sigg, B.: YETI 2 - TinyOS 2.x eclipse plugin [Computer software] (2018). http://webs.cs.ber keley.edu/tos/ 29. Tewari, S., Leskey, T.C., Nielsen, A.L., Piñero, J.C., Rodriguez-Saona, C.R.: Use of pheromones in insect pest management, with special attention to Weevil pheromones. Integr. Pest Manage., 141–168 (2014). https://doi.org/10.1016/b978-0-12-398529-3.00010-5 30. Tripathy, A.K., Adinarayana, J., Merchant, S.N., Desai, U.B., Ninomiya, S., Hirafuji, M., Kiura, T.: Data mining and wireless sensor network for groundnut pest/disease precision protection. In: 2013 National Conference on Parallel Computing Technologies (PARCOMPTECH) (2013). https://doi.org/10.1109/parcomptech.2013.6621399 31. Vermesan, O., Friess, P.: Digitising the industry - internet of things connecting the physical, digital and virtual worlds. Digit. Ind. Internet Things Connect. Phys. Digit. Virt. Worlds, 1–364 (2016). https://doi.org/10.13052/rp-9788793379824 32. Wigboldus, S., Klerkx, L., Leeuwis, C., Schut, M., Muilerman, S., Jochemsen, H.: Systemic perspectives on scaling agricultural innovations. a review. Agron. Sustain. Dev. 36(3), 1–20 (2016). https://doi.org/10.1007/s13593-016-0380-z 33. Willow Technologies. https://www.willow.co.uk/MEP410CA_Datasheet.pdf. Accessed 15 Jan 2018 34. Wolfert, S., Ge, L., Verdouw, C., Bogaardt, M.: Big data in smart farming – a review. Agric. Syst. 153, 69–80 (2017). 01.023 35. Xu, G.: GPS: Theory, Algorithms and Applications, 2nd edn., p. 340. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-72715-6
Autonomy Reconsidered: Towards Developing Multi-agent Systems Michael A. Goodrich1 , Julie A. Adams2 , and Matthias Scheutz3(B) 1 2
Brigham Young University, Provo, UT 84602, USA [email protected] Oregon State University, Corvallis, OR 97331, USA [email protected] 3 Tufts University, Medford, MA 02155, USA [email protected]
Abstract. An agent’s autonomy can be viewed as the set of physically and computationally grounded algorithms that can be performed by the agent. This view leads to two useful notions related to autonomy: behavior potential and success potential, which can be used to measure of how well an agent fulfills its potential, call fulfillment. Fulfillment and success potential induce partial and total orderings of possible agent algorithms, leading to algorithm-based, capability-centered definitions of levels of autonomy that complement common uses of this phrase. Because the success potential of a multi-agent system can exceed the success potentials of individual agents through synergy effects, the fulfillment of an individual can be augmented through interactions with others, though it can possibly also interfere in the fulfillment of the other agents. Interaction algorithms thus enable multiple agents to coordinate, communicate, or exchange information; these algorithms enable and constrain tradeoffs between augmenting and diminishing other agents. Short case studies are presented to illustrate how the algorithm-based definitions can be used to understand existing systems.
Keywords: Levels of autonomy
1
· Interaction algorithms · Multi agent
Introduction
Rapid developments in perception, control, planning, manipulation and navigation enable increasingly advanced robotic systems capable of accomplishing complex tasks autonomously, such as urban driving, traversing rough terrain, or assembling non-trivial products. What does it mean exactly for a system to be autonomous and how may that help us to develop increasingly effective and robust systems? Over the last thirty years many definitions of “autonomy” have explicated what autonomy may mean when applied to artificial systems. Some of these definitions are more detailed and emphasize formally precise conditions, while c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 573–592, 2022. https://doi.org/10.1007/978-3-030-82193-7_38
574
M. A. Goodrich et al.
others provide psychologically and philosophically motivated schemas related to self-governance. One frequently encountered dichotomy is between “automation” and “autonomy”, with “automation” roughly referring to fixed action patterns machines execute without human intervention, regardless of whether the actions achieve the desired effects: “Automation refers to the full or partial replacement of a function previously carried out by a human operator” [38]. A toaster, for example, may ignore the time-dependence of bread type; thus, applying the same duration to all bread types, regardless of easily they may burn. “Autonomy” is viewed as a system’s ability to consider environmental state changes and act upon them (e.g., sensing the correct toasting level, rather than using a fixed time period). For some, an autonomous robot can follow orders, but those orders may leave open exactly what steps are necessary to achieve the task (e.g., [15]). For others, autonomy represents “[an] agent’s active use of its capabilities to pursue its goals, without intervention by any other agent in the decision-making processes used to determine how those goals should be pursued” [4]. Other approaches view autonomy on a scale (e.g., “sliding autonomy”, “levels of autonomy”, “adjustable autonomy”), not as a binary notion. Systems can have degrees of autonomy based on the current context. For example, a clothes dryer with a moisture sensor can adapt the heat levels by sensing dryness, but can be forced by the human to apply a fixed heat level, thus reducing the machine’s ability to control the heat adaptively. Similarly, an airplane’s autopilot attempts to maintain a designated glide path until it no longer can guarantee the path due to, say, bad weather and disengages. Finally, other definitions stress an agent’s sensing and actions in an environment and the agent’s ability to realize its goals. For example: “Autonomous agents are computational systems that inhabit some complex dynamic environment, sense and act autonomously in this environment, and by doing so realize a set of goals or tasks for which they are designed” [24]. While these approach contain essential elements, particularly the notions of sensing and acting in a dynamic environment in the interest of goals, they lack the precision to capture the important interactions among goals, algorithms, and an agent’s physical aspects. Most importantly, notions of task, goals, and success require definition in order to evaluate an agent’s performance. The paper’s primary contributions are (a) algorithm-based definitions of behavior potential, success potential, and fulfillment for an individual agent, (b) an argument that interaction between multi-agent systems are potentially more powerful than an autonomous agent, with precise definitions of how interaction algorithms determine synergy, interference, and augmented capability, and (c) short examples that illustrate the utility of the definitions.
2
Related Literature
Beer et al. provide an overview of the notion of autonomy from multiple fields, including philosophy, psychology, and robotics [5]. A common theme is defining a robot’s capability in the context of a team’s capability, namely a humanrobot team. For example, Harbers, Peeters, and Neerincx use an operational
Autonomy Reconsidered
575
definition that implies three specific qualities associated with autonomy: “the time interval of interaction, the obedience of the robot and the informativeness of the robot” [23]. These qualities include the robot’s (1) capability and (2) the degree of robot reliance on the relationship between the robot and a human partner. The capability-relationship pair for human-robot and human-agent teams is a pattern in many autonomy definitions. Hexmoor emphasizes the pattern by suggesting that autonomy is “a social notion”; therefore, a robot’s autonomy is best defined by the interactions between the robot and some other entity [24]. He writes: [A]utonomy concerns are predominantly for the agent to acquire and to adapt to human preferences and guidance ... The word ‘autonomous’ connotes ... a sense of the agent’s autonomy from the human. A device is autonomous when [it] faithfully carries preferences and performs actions accordingly.
Naturally, others have written about independence from and interdependence between agents. Newell writes [37, p. 20]: One aspect of autonomy is greater capability to be free of dependencies on the environment ... [but] much that we have learned ... speaks to the dependence of individuals upon the communities in which they are raised and reside.
For example, Dorais says that an autonomous robot can follow orders, but those orders may leave open exactly what steps are necessary to achieve the task [15]. Hexmoor’s social notion provides insight into Sheridan’s levels of autonomy [38,47]. Specifically, Sheridan’s levels are not explicitly based on a robot’s capabilities in the way that Harbers, Peeters, and Neerincx define autonomy. Sheridan’s levels implicitly assume a level of capability and explicitly specify properties of the relationship—who has the responsibility for initiating, terminating, or intervening in the behavior induced by the algorithm(s). Other approaches view autonomy on a scale (e.g., “sliding autonomy”, “levels of autonomy”, or “adjustable autonomy”), not as a binary notion [29,47]. There are many autonomy scale variations, and most imply that autonomy is primarily a social notion [7,14,16,18,21,27,28,31,35]. Most variants require “social contract” algorithms that enable a human, a robot, or both to (re)assign responsibility/authority for initiating, executing, and terminating functions, information exchanges, and tasks [21,33]. Dialogues, safeguarding, and shared control are means of designing algorithmic social contracts so that team capability is maximized [17,18]. For example, shared control seeks to design algorithms that directly support the human-robot team [13,36,44]. Naturally, the scope of interaction algorithms can be very large, especially for large multi-agent systems [9,11,19,30]. Social contract algorithms may augment some agents and interfere with others. Shell and Matari´c [46] identify one interference type: “Traditional homogeneous foraging has each robot searching for pucks and independently transporting them to the home region ... [A]round the home region; many robots will attempt to enter the same space ...
576
M. A. Goodrich et al.
[so] additional robots may hamper the collective effort.” Algorithms have been written to mitigate spatio-temporal interference [20]. Sensing interference can also occur [9]. Johnson is critical of contemporary thinking on autonomy [26] and proposes “coactive design,” which develops capabilities and algorithms enabling humans and robots to interact well by supporting a form of mutual interdependence. Coactive design includes a specific approach for constructing the social contract so that it explicitly maximizes team capacity. Beer, Fisk, and Rogers propose an approach grounded in function allocation: identify tasks to be performed, determine what task components a robot will perform and is capable of performing, and create a means for a human to influence the robot [5]. Riley proposes a function allocation method that specifies general categories for the types of tasks, information exchanges, and required human-automation interactions [43]. Johnson’s, Beer et al.’s, and Riley’s approaches directly support systematic design of the social contract algorithm. Others disavow the social notion of autonomy, emphasizing that autonomy represents “[an] agent’s active use of its capabilities to pursue its goals, without intervention by any other agent in the decision-making processes used to determine how those goals should be pursued” [4]. Such definitions emphasize an independence from human input, stressing a robot’s sensing and actions in an environment subject to the robot’s ability to realize its goals. For example: “Autonomous agents are computational systems that inhabit some complex dynamic environment, sense and act autonomously in this environment, and by doing so realize a set of goals or tasks for which they are designed” [24]. Bradshaw et al. [6] emphasize capability, independent of a social context. They posit two properties essential for an autonomous system: “self-sufficiency, the capability of an entity to take care of itself” and “self-directedness, or freedom from outside control”. Similar properties of autonomy appear in nonrobotics research (e.g., levels of autonomy for nurse practitioners [10]). Bradshaw’s two elements imply that a robot must be able to perform some set of tasks, while also initiating, terminating, and modifying what tasks it performs and how those tasks are performed. Huang et al. [25] similarly state that goals (not a social contract) will govern how capabilities are used: “[A robot’s] autonomy [is defined] as its own capability to achieve its mission goals.” Beer, Fisk, and Rogers also emphasize self-directedness. Robot capability, self-sufficiency, and self-directedness must ultimately be implemented as algorithms. Maes [34] presented autonomy as a computational system. Hexmoor’s characterization of Maes’ work makes the computational system explicit [24]: Autonomous agents are computational systems that inhabit some complex dynamic environment, sense and act autonomously ... realize a set of goals or tasks for which they are designed.
Autonomy Reconsidered
3
577
Behavior, Success, and Autonomy
A general definition for task environment grounds the discussion. Let E = S, I, G, F, τ be a (task) environment specification where S, an environment, is a set of possible states (e.g., a manifold), I ⊂ S is a set of initial states, G ⊂ S is a set of goal states, and F is the evolution function defined on S over τ , where τ is a time bound. Environments are defined as sets of states, to remain as general as possible, while not committing to a particular notion of state or formalism in order to capture many possible environmental states (e.g., a set of differential equations or a Markov decision process) and their relations (e.g., which state is accessible from a given state or whether state transitions are deterministic or stochastic). When needed, the meaning of “state” can be specified (e.g., a six-dimensional kinematic vector, or a set of true propositions at a given point in time) and how exactly they evolve over time (e.g., differential equations, maps, transition functions), including whether the set of time points is discrete or continuous. The Thermostat as an Example. Consider an example of maintaining a room’s temperature at or near a desired temperature, denoted by θ. There are two relevant states: S = {(T < θ), (T ≥ θ)}, where T is the room’s temperature. Initial states I can be any room temperature, say I = [−30, 30] C, and goal states are determined by, for example, the goal to “keep the room cool”, G = {T ≤ θ}. An evolution function depends on temperatures outside of the building, the presence of a heating unit, and the presence of an air conditioning (cooling) unit, ⎧ if heater on ⎨ Tt+Δt = Tt + ε if air conditioner on , F : Tt+Δt = Tt − ε ⎩ Tt+Δt = Tt + δ(Toutside − Tt ) otherwise. where Δt denotes a small time step, ε and δ are small positive constants, and Toutside denotes the outside air temperature. Finally, τ is some deadline to reach the desired temperature, say τ = 20 min. Let R = P, A be a robot specification, where P , the hardware platform includes all sensing, actuating, and computing equipment, and A is an algorithm (plus data) on P that is possibly self-modifying. The sets of sensors Sen, effectors Eff, and computational systems Comp for P are used to define an algorithm as a mapping from sensors/computational states to effector/computational states: A : SSen × SComp → SEff × SComp ,
(1)
where the sensor and effector states are the transduced and non-transduced computational interface states, respectively. Computational states, SComp , include memory, processing, databases, knowledge representations, world models, etc. This formulation permits discussion of the same algorithm on platforms with different sensors, actuators, and representation systems. Computational, sensor, and effector states are part of the environment state, S ⊇ SSen ∪ SComp ∪ SEff .
(2)
578
M. A. Goodrich et al.
We differentiate between the instance of the robot’s algorithm and the class of algorithms from which the instance is drawn. For example, the class of RRT* algorithms asymptotically approach the optimal solution, but an instance of the RRT* algorithm requires specific parameters (i.e., neighborhood range and cost function) to generate the robot’s behaviors. Similarly, value iteration can find optimal solutions for a Markov-Decision Problem (MDP), but a particular MDPsolver must be instantiated on the robot. The robot’s algorithm is an instance of the algorithm class. There must be a relationship between the robot’s algorithm, A, and the evolution function F . If no relationship exists, then the robot has no influence on the environment and autonomy does not matter. The evolution function F includes A as well as other things that influence how world states change: physics, other robots, etc. When discussing autonomy, we are interested in the trajectories in the states of environment S induced by the robot’s algorithm A. Thermostat Example Continued. Consider a room that has only an air conditioner and no heater. If the thermostat senses that the current temperature exceeds the desired temperature, it turns on the air conditioner. The thermostat has no memory, so SSen = {(T < θ), (T ≥ θ)}, SComp = ∅, and SEff = {on, of f }. The thermostat’s algorithm is simply if T ≥ θ turn air conditioner on . else turn air conditioner of f. For temperatures above the desired value, T0 > θ, a trajectory is a trace of temperatures falling to the threshold. 3.1
Absolute Autonomy: Behavior, Success, Fulfillment
A robot’s autonomy is determined by the algorithm1 , A, implemented on platform, P . It is conceptually possible to quantify the “amount” of autonomy a robot, R possesses. Behavior Potential. The behavior potential BP (R) of R in E is the set of all trajectories in S induced by algorithm A for some starting state s ∈ E within time bound τ . A “trajectory” is any time-ordered set of states in S determined by how the robot’s algorithm A affects the evolution function F , for a given any initial state in S (e.g., flows in dynamical system, state sequences in an MDP). The behavior potential captures all possible behaviors R can exhibit before reaching the time bound τ in any environmental state. Success Potential. Behavior Potential, BP (R), includes two important subsets, SP (R) and SP I (R). Let SP (R) denote the robot’s success potential, defined as the set of trajectories induced by algorithm A, starting from any 1
For simplicity of exposition, the set of programs running on a single robot is treated, collectively, as a single algorithm.
Autonomy Reconsidered
579
s ∈ E leading through a goal state in G within τ . The SP (R) captures all ways for R to succeed at its task. The size of the success potential indicates the robot’s capability, and is thus an indicator of potential robot autonomy. Let SP I (R) denote the robot’s initialized success potential if the initial state of the environment can be specified, defined as the set of trajectories induced by algorithm A, starting from any i ∈ I leading through a goal state in G within time bound τ . The difference between SP (R) and SP I (R) is important because it is easy to create initial environmental states where the algorithm will always fail. For example, start a ground robot in an environmental state where it is dropped from an airplane and the robot will fail. Thermostat Example Continued. Recall the thermostat algorithm’s goal is to cause room temperature to be at or below a desired value starting from any initial value within a time bound τ of 20 min. Initial temperatures below the threshold will yield temperatures that are at or below threshold for all time τ (barring some unusual behavior of the evolution function, like a fire in the room). Thus, for states Slow = {T ≤ θ} the trajectories are within SP (R). Whether initial states Shigh = {T > θ} produce trajectories that are within SP (R) depends on the initial temperature and the laws of thermodynamics. For (a) a time bound τ of sufficient duration, (b) T within the set of feasible states (recall that environment states S included temperatures in the range [−30, 30] C), and (c) an air conditioner of high enough capacity, then (d) all trajectories induced by the thermostat yield success, that is, they are within SP (R). The notion of goal state G in E can be extended when (a) multiple goal states need to be reached by composing multiple tasks or (b) where particular states need to be maintained throughout the task by modeling a subset of S that R has to maintain. The notion of goal achievement can also be extended for stochastic environments to a probabilistic notion that requires R end in some goal state, with probability p within τ . Fulfillment. Fulfillment is defined as F ulf ill =
|SP (R)| , |BP (R)|
where | · | indicates a set measure, such as cardinality. Fulfillment is a measure of a robot’s need to rely on others. Fulfillment measures the proportion of possible initial states for which R will succeed at its task for a given algorithm A in the absence of help. Suppose that fulfillment equals one. Then |SP (R)| = |BP (R)|, which means that the robot always succeeds – no matter the robot’s initial state. A high fulfillment ratio means that the robot does not need to rely on human intervention. The size of the set difference |BP (R) \ SP (R)| is a measure of how often a robot will fail if there is no control over initial conditions; the size of this set measures how much help a robot needs to accomplish its goal. Thermostat Example Continued. When the goal is simply to keep temperature at or below a threshold, the fulfillment ratio for the thermostat equals one,
580
M. A. Goodrich et al.
since success potential equals behavior potential. The thermostat’s high fulfillment ratio provides insight into the noted paradox of the thermostat: “A thermostat exercises ... self-sufficiency and self-directedness with respect to the limited tasks it is designed to perform through the use of [a] very simple form of automation” [26]. Using this paper’s language, the thermostat is autonomous in that it does not need human input (its fulfillment is one), but not in the sense that it is capable of producing many interesting behaviors (its behavior potential is small).
Fig. 1. Low fulfillment. (ρ = 0.4, γ = 0.9, τ = 40)
A notion similar to fulfillment appears in the literature on reliability and human error in systems2 . “Operator error probability is defined as the number of errors made ... divided by the number of opportunities for such errors” [40]. Fulfillment emphasizes successful goal-achievement instead of errors. Fulfillment in a Markov Decision Process. Behavior potential, success potential, and fulfillment can be applied to a simple MDP. Consider the grid world shown in Fig. 1 and 2. The world states are locations on the grid, S = {(x, y) : x, y ∈ {0, 1, . . . , 16}}, the initial state (lower left) is I = {(0, 0)}, and the goal state (upper right) is G = {(16, 16)}. The evolution function is a transition probability p(s |s, a) where s is the next state, s is the current state, and a is the action specified by the algorithm. The algorithm is a policy designed to optimize expected discounted reward for some reward structure R(s, a) and some discount factor γ. The policy maps a sensed state to an action. Thus, the policy implements the definition of an algorithm A : SSen × SComp → SEff × SComp as π : S → A. States are SSen = S, that is, the robot can perfectly sense the world; effectors are SEff = A, that is, the 2
Thanks to Karina Roundtree for pointing out the connection between operator error and fulfillment.
Autonomy Reconsidered
581
Fig. 2. High fulfillment. (ρ = 0.9, γ = 0.999, τ = 60)
effector states are the sets of actions that the robot can take; and computation resources, SComp , is the data structure in which the policy is stored. For concreteness, the following hold: (a) The robot’s actions are the cardinal directions, A = {N, S, E, W }. (b) The agent moves in the direction it intends (it goes N when a = N ) with probability parameter ρ and moves in one of the other three directions with probability ρ3 . (c) The agent remains in the same position and receives a reward of r = −1 when it moves toward a wall. (d) The agent receives a reward of R = 2 when it reaches the goal. Instances of the optimal policy, π, were computed using value iteration. Given a policy, 50 trajectories were computed from the initial condition, generating a sample of the behavior potential. Figure 1 shows the behavior potential for a challenging set of conditions, ρ = 0.4, meaning that the robot goes in an unintended direction (1 − ρ) = 60% of the time. Each dot in a cell represents a visit from the robot in one of the trials. The discount factor was set to γ = 0.9. The optimal policy for the cells around cell (2, 2) point back to that cell. Essentially, the robot has learned that going through the narrow passageways to the left and below the irregular wall risks a likely collision with a wall, so the “pull” of the goal reward is insufficient to draw the robot through the passageways. For this example, no trajectories reach the goal within τ = 40 time steps, so the success 0 = 0. potential is empty. Thus, Fulfill = 50 Figure 2 shows the behavior potential for a policy instance generated from value iteration using the parameters, ρ = 0.9, γ = 0.999, and τ = 60 time steps. The robot moves in the intended direction often, more time is given to complete the task, and the discount factor is high enough to draw the agent to the goal through the narrow passageways. For this world, Fulfill = 50 50 = 1. Figure 3 illustrates fulfillment for various policy instances created from various parameters, yielding three observations. First, when more time is allowed for
582
M. A. Goodrich et al.
Fig. 3. Fulfillment for τ = 40 and τ = 60, respectively.
finding a goal, fulfillment increases. Comparing the left and right figures reveals that, for the same values of ρ and γ, fulfillment is higher when τ is greater. Second, even though each policy is optimal for the given discount factor and reward structure, not all algorithm instances have the same fulfillment. In particular, when both γ and ρ are low, the optimal policy tries to stay in the relatively open area in the bottom left, rather than pass through the narrow passageways. Third, fulfillment depends on both the algorithm through γ and the environment through ρ. Infinite Sets. These examples count the size of finite sets. Infinite sets need a set theoretic measure of set size. Probability measures can be used even if the common perspective that probability represents the frequency of an event precisely is not adopted, because probability measures are special cases of more general set theoretic measures. Future work will demonstrate this claim. Self-directedness. Some argue that self-directedness is essential for autonomy. The Church-Turing thesis implies that a self-directed agent needs an algorithm or algorithms to select goals, to process knowledge, and to select actions. If selfdirectedness must be encoded as an algorithm, then success potentials, behavior potentials, and fulfillment apply to that algorithm. Rationality – An Aside. Behavior potential, success potential, and fulfillment are agnostic about whether the algorithm is optimal or rational with respect to some standard. The process by which the algorithm was created is not specified. Because the definitions are agnostic, they complement frameworks that identify optimal algorithms for specific problems. Gerkey and Matari´c’s taxonomy of independent tasks that can be solved by multi-robot teams [19] is grounded in optimization. The known time and space complexities of algorithms that compute optimal solutions can be used to bound minimum required time budgets τ and what memory resources are required, respectively. Furthermore, knowing the
Autonomy Reconsidered
583
payoff of an optimal solution is useful in trading off the utility of approximate solutions to their fulfillment. Similarly, the definitions allow for algorithms that are rational with respect to Newell’s standard, where he argues that rationality requires an agent pursue a course of action compatible with its goals using knowledge available to the agent [37]. Newell’s notion is related to self-directedness, in that a self-directed agent must select goals and pursue those goals using available knowledge. Being agnostic about how the algorithm is computed may seem to allow irrational agents, and indeed it does. But measuring the fulfillment of irrational agents and comparing against the fulfillment of rational agents allows a comparison of the relative autonomy. 3.2
Relative Autonomy: Levels, Asymmetries, Deficiencies
Given two robots, R1 , and R2 , there are multiple partial or complete orders that can be identified by comparing SP (R1 ) to SP (R2 ) and SP I (R1 ) to SP I (R2 ). Intuitively, systems with lower autonomy (in terms of the subset relation) will be able to reach goal states in fewer cases (i.e., from useful initial states) and vice versa. Recall that capability and non-reliance on others are attributes of autonomy. For the capability attribute, a reasonable definition for levels of autonomy (LOA) is: LOA(R1 ) > LOA(R2 ) if f SP (R1 ) ⊃ SP (R2 ). The LOA is not defined by comparing the fulfillment ratios because fulfillment indicates the potential need for external input or intervention when behaviors cannot be guaranteed to reach the goal. LOAs indicate the relative capability of reaching a goal. We discuss fulfillment in the next section. MDP Example Continued. Multiple optimal policies were computed for the MDP example. Consider three policies: – R1 ’s is computed for (γ = 0.999, ρ = 0.9), – R2 ’s policy is computed for (γ = 0.9, ρ = 0.7), and – R3 ’s policy is computed for (γ = 0.9, ρ = 0.4). We ran 50 trials with those policies in a challenging world (ρ = 0.4). For each trial, each optimal policy was run using the same seed for the random number generator, with different seeds across trials, which approximates running the algorithms under the same conditions. Figure 4 illustrates the results for τ = 50 time steps. The red ×’s indicate trials where all algorithms failed to reach the goal. The blue ’s indicate the two trials where R3 reached the goal; one success occurred in a trial where both R1 and R2 succeed, and one occurred where both R1 and R2 failed. The green ◦’s represent trials where both R1 and R2 reached the goal. The cyan *’s represent trials where R1 reached the goal and R2 did not. Consider the pair R1 and R2 . Because SP (R2 ), enclosed in the green circle, is a proper subset of SP (R1 ), R1 has a higher LOA than R2 . Now, consider
584
M. A. Goodrich et al.
Fig. 4. Success potentials for the MDP problem with different algorithms. The clustering is notional, meaning that it does not represent any environment condition, but is organized to make the sets easy to visualize.
the pair R1 and R3 . What is the relationship between their LOAs? Fulfillment for R1 is much greater than fulfillment for R3 , but the success potential for R1 is not a superset of the success potential for R3 . This means that R1 does not have a higher level of autonomy than R3 , which may seem counter-intuitive. Fortunately, differences in success potentials for different robots can be exploited to maximize group potential.
4
Multi-agent Systems
Without interference and in the presence of an effective interaction algorithm, the success potential of a group of robots will be at least as high as the union of the success potentials of the individuals, SP ({R1 , R2 , . . . Rn }) ⊇ ∪ni=1 SP (Ri ).
(3)
Similarly, the behavior potential of a group should be at least has high as the union of individuals, again in the absence of interference, BP ({R1 , R2 , . . . Rn }) ⊇ ∪ni=1 BP (Ri ). 4.1
(4)
Group Potential: Synergy and Interference
Whether or not the relationships in Eqs. (3)–(4) hold is subtle for various reasons. First, the number of potential platforms in a team is more than the sum of the individuals. Combined team members can form new platforms (e.g., by connecting [39,49]) or virtual platforms (e.g., formations [32,41]). If n robots form the team, then there are 2n robot combinations of new or virtual platforms. Second, additional computing resources allow more algorithms. Increased resources increase the number of problems that can be solved (constrained by communications).
Autonomy Reconsidered
585
Third, entirely new trajectories can be created. For example, trajectories may be enabled that no single robot can perform (e.g., two robots pushing a large box that cannot be pushed by an individual). Fourth, multiple individual trajectories can be explored simultaneously by a team. For example, ants [22] and honeybees [45] can perform tasks within a time bound that no individual can do by itself within the time bound. Fifth, the nature of the goal determines which trajectories are successful. Steiner’s taxonomy of task types differentiates between unitary tasks and divisible tasks [48]. Divisible tasks can be separated into component subtasks that can each be performed by an individual group member. Unitary tasks must be performed in their entirety, requiring either a coordinated group algorithm or execution by a single team member (or subgroup) with no contributions from others. Synergy. The synergy potential from adding more agents H = {Rk : k ∈ I} to an existing group of agents G = {Rk : k ∈ J }, where G ∩ H = ∅, is the set of “extra” things that can be done given the new agents that cannot be done by the original group: Synergy(H + G) = BP (H ∪ G) \ BP (G) ∪ BP (H) . This potential can be extended to the extra things that can be done when agents are added to a set of indexed subsets but only the definition for two sets is given for simplicity. Of particular interest is what happens when evaluating what can be accomplished by a group, G = {R1 , R2 , . . . Rn }, when the goal is not divisible, Synergydiv (G) = BP (G) \ ∪i Ri BP (Ri ) . An analogous definition can be made in terms of success potential, and fulfillment in the presence of synergy can be computed. Interference. Similarly, interference potential can be defined as the set of trajectories removed from the subset of trajectories for group G when new agents H are added (e.g., R1 blocking the path to R2 ’s goal location), Interference(G + H) = BP (G ∪ H) ∩ BP (G). An analogous definition can be made in terms of success potential, and fulfillment in the presence of interference can be computed. Synergy and Interference in Swarms. Synergy and interference are illustrated using bio-inspired spatial robot swarms. In this example, spatial swarms are composed of simple agents who only interact with their neighbors based on three rules: repulsion, orientation, and attraction. These rules, based on Reynolds’s rules for boids [42], are representative of biological swarms [2]. Individual agents’ zones of repulsion, orientation, and attraction are centered at an
586
M. A. Goodrich et al.
agent’s position and are parameterized via the radii of repulsion (rrep ), orientation (rori ), and attraction (ratt ), where rrep < rori < ratt . The swarm uses the topological communication model [1,3], which assumes an individual can communicate with the NT nearest agents. Zebrafish have 3–5 topological neighbors [1], while starlings coordinate with the nearest 6–7 birds [3]. The examples in this section present results for a topological number of 7 neighbors. The swarm task is to search for a goal, in which the swarm is to locate and move all agents to a single goal location. The goal area’s size is scaled to ensure the swarm is able to fit within the goal area. The 600 × 600 pixel world is bounded by a wall that exerts a repulsive force. An agent can sense the goal if it is within the radius of attraction of the goal area’s location. Once an agent locates the goal, it communicates the location to its seven neighbors. Agents aware of the goal’s location update their headings by equally weighing the desire to travel to the goal and the desire to follow the interaction rules, as was done by others [8,12]. Synergy and interference are defined using trajectories in the state space. Recall from Eq. 2 that trajectories include computational and effector states. For these spatial swarms, the trajectories include (a) moving from an initial position new locations, (b) forming topological neighborhoods, (c) communicating goal information, and (d) sensing distances and directions of neighbors. Adding agents to a group can create new trajectories in the form of agent networks that communicate information, shaping where agents move.
Fig. 5. Fulfillment with swarms for N and no obstacles (left) and 20% obstacles (right).
1,800 simulation trials were conducted, where each trial was 1000 iterations, for 50, 100, and 200 agents. Figure 5 left presents results when there are no environmental obstacles. The percent reached represents the number of agents that reached the goal area, expressed as a percentage of the swarm’s size, at the end of the task. This number approximates fulfillment since if 100 robots are in the swarm and 80 reach the goal then 80% of the robots are successful.
Autonomy Reconsidered
587
In the absence of synergy, if roughly 45% of the 50-agent group reaches the goal then we’d expect roughly the same percent to reach the goal for the 100and 200-agent groups. Synergy increases fulfillment as agents are added because more trajectories are possible and successful trajectories are more likely. Larger groups contribute to success in two ways: (a) more agents explore the world, making it more likely that the goal will be found, and (b) more agents tend to form a large connected component through which goal information propagates enabling more agents to reach the goal. The world becomes more complex by adding obstacles. Obstacle densities of 10% and 20% of the number of agents were evaluated. Both obstacle density levels (10% and 20%) result in a lower percentage of the swarm robots reaching the goal as the number of robots increases. Figure 5 right illustrates decreased fulfillment for 20% density. With 50 robots, an average of 20% of the agents reached the goal, but with 200 robots the average drops to just a small percentage. The precipitous drop in fulfillment is caused by interference. Obstacles “carve” up the large connected component into disconnected connected components, eliminating successful trajectories by preventing goal knowledge to propagate across components. 4.2
Augmentation and Diminishment
When group potential exceeds individual potential, an agent may contribute to the success potential and fulfillment of another agent. Augmentation. Augmented capability represents the increase in a agent’s success potential when partnered with another agent. Augmented capacity represents what R1 gains in terms of achieving its goal, when R1 coordinates with R2 . Augmented capability for R1 is the increase in goal-achieving trajectories that arises (a) when R2 induces changes in the evolution function that benefit R1 (e.g., pushing an obstacle out of the way), (b) when sensor information from R2 is used as input to R1 ’s algorithm (e.g., R2 provides world state information that R1 cannot sense), or (c) when R2 ’s computational resources are used to solve a problem quickly (within time bound τ ) or with a larger amount of memory (e.g., imaging processing). An augmented robot has a higher level of autonomy, because the augmented capacity is defined as an increase in the success potential, SP aug (R) ⊃ SP (R) ⇒ LOAaug (R) > LOA(R), Relying on another can augment an agent’s capability. Diminishment. Augmenting a robot’s capacity can diminish another’s capability. If R1 requires R2 ’s computational resources, then R2 may be unable to compute what is needed. Diminished capability can be defined analogously to augmented capacity, and is a form of interference potential.
588
M. A. Goodrich et al.
Verplank and Sheridan’s Levels of Automation. Sheridan’s LOAs can be revisited in the light of augmentation and diminishment. Figure 6 illustrates a robot’s success potential, the green circle surrounding the green ◦’s. The robot can generate many behaviors, but only a fraction of them generate successes; the cyan *’s are robot failures. If the robot receives human input, such as navigation or perceptual support, then all behaviors will reach the goal, illustrated by the larger cyan circle enclosing the green circle. The success potential grows and fulfillment becomes one. With human input, robot R’s LOA increases because SP aug (R) ⊂ SP (R); the robot, augmented by the human, is strictly more successful.
Fig. 6. Augmenting a robot with human input can diminish the human, assuming the human cannot work on other tasks.
Augmenting the robot can cost the human, because human attention and other computational (i.e., cognitive) resources are used to support the robot. Thus, the human splits resources between two tasks, and the resulting set of human behaviors may no longer lead to success. Figure 6 illustrates human success potential without the robot, the red polygon surrounding the red ×’s. When supporting the robot, the human’s success potential decreases, the blue polygon. The human’s, H, LOA decreases because SP dim (H) ⊃ SP (H). Whether or not diminishing the human is worth it depends on whether the augmentation benefit is useful. Group potential of non-interacting human and robot is the “size” of the red polygon plus the “size” of the green circle; group potential with interaction is the “size” of the blue polygon plus the “size” of the cyan circle. Size measures can include the probability of encountering one of the trajectories or the utility of the trajectories or some combination. Consider four of Sheridan’s LOAs: autonomous (Auto, level 10), Do-ThenTell3 (Dtt, levels 6–9), Ask-Then-Do (Atd, levels 2–5), and teleoperation (Tele, level 1). First, assume that Auto means the robot can succeed at all behaviors without human input. Figure 7 left cross-plots (SP dim (H), SP aug (R)) the combined success potentials for various Sheridan-based LOAs. The plots assume it is possible for the autonomous robot to accomplish all its goals from any starting 3
Thanks to Lanny Lin for the names of Do-Then-Tell and Ask-Then-Do.
Autonomy Reconsidered
589
Fig. 7. Success potentials when autonomy is just as capable as a human and robot working together (left) and when autonomy can’t achieve what human and robot together can (right). •’s indicate robot success potentials, ’s indicate human success potentials, and thick +’s indicate group success potentials.
condition – fulfillment is one and success potential is large. What the robot can accomplish sans human help is plotted on the y–axis; the robot wastes time interacting with the human and resorts to autonomous mode. Human-diminishment from reduced computation-budget is plotted on the x-axis, with maximum computational resources available if the human is not obligated to assist the robot. The human-robot team elevates all robot LOAs to that of a fully autonomous robot, but at the cost of what the human can do when not assisting the robot. Group success is plotted as black +’s. For this example, Auto maximizes group fulfillment. Second, assume that the fully autonomous robot does not achieve maximum fulfillment and lacks sufficient capability to achieve maximum success potential. Figure 7 right cross-plots a diminished human and an augmented robot. Sans human input, Dtt and Atd perform the same as Auto, but they are equipped with a human interaction algorithm that allows them to be augmented. Tele must have human input to perform well. With human help, Atd can achieve maximum fulfillment but with human diminishment. Dtt can be augmented with human interaction to achieve high-but-not-maximum fulfillment, with lower human diminishment cost. The shaded rectangles indicate group success potential for Atd and Dtt; larger areas indicate larger group fulfillment. For this example, Dtt maximizes group fulfillment but Atd maximizes robot fulfillment. The ideal robot LOA depends on the success potential and fulfillment for the robot, the human, and the group.
5
Summary
This paper provides precise algorithm-based definitions for two attributes of agent autonomy: capability (defined as the size of the success potential set) and nonreliance on another agent (defined using fulfillment). The definitions are extended for multiple agents, leading to notions of synergy and interference. The potential for group capability and fulfillment to be higher than the sum of
590
M. A. Goodrich et al.
individuals in the group make it possible to estimate tradeoffs in multi-agent teams; specifically how a contribution from agent A can augment agent B, but at a potential cost in capability and fulfillment for agent A. Case studies were used to illustrate the definitions, emphasizing how the definitions give insight into existing problems. Acknowledgment. This work has in part been funded by ONR grant #N00014-181-2831.
References 1. Abaid, N., Porfiri, M.: Fish in a ring: spatio-temporal pattern formation in onedimensional animal groups. J. Royal Soc. Interface 7(51), 1441–1453 (2010) 2. Aoki, I.: A simulation study on the schooling mechanism in fish. Bull. Jap. Soc. Sci. Fish. 48(8), 1081–1088 (1982) 3. Ballerini, M., et al.: Interaction ruling animal collective behavior depends on topological rather than metric distance: evidence from a field study. Proc. Natl. Acad. Sci. 105(4), 1232–1237 (2008) 4. Barber, S.K., Goel, A., Martin, C.E.: Dynamic adaptive autonomy in multi-agent systems. J. Exp. Theoret. Artif. Intell. 12(2), 129–147 (2000) 5. Beer, J.M., Fisk, A.D., Rogers, W.A.: Toward a framework for levels of robot autonomy in human-robot interaction. J. Hum.-Robot Interac. 3(2), 74–99 (2014) 6. Bradshaw, J.M., Feltovich, P.J., Jung, H., Kulkarni, S., Taysom, W., Uszok, A.: Dimensions of adjustable autonomy and mixed-initiative interaction. In: Agents and Computational Autonomy, pp. 17–39. Springer (2004) 7. Bradshaw, J.M., et al.: Agent autonomy. In: Hexmoor, H., Falcone, R., Castelfranchi, C. (eds.) Adjustable Autonomy and Human-Agent Teamwork in Practice: An Interim Report on Space Applications. Kluwer (2002) 8. Brown, D.S., Goodrich, M.A., Jung, S.-Y., Kerman, S.C.: Two invariants of human swarm interaction. J. Hum.-Robot Interac. 5(1), 1–31 (2016) 9. Burgard, W., Moors, M., Stachniss, C., Schneider, F.E.: Coordinated multi-robot exploration. IEEE Trans. Robot. 21(3), 376–386 (2005) 10. Cajulis, C.B., Fitzpatrick, J.J.: Levels of autonomy of nurse practitioners in acute care setting. J. Am. Assoc. Nurse Pract. 19(10), 500–507 (2007) 11. Claes, D., Robbel, P., Oliehoek, F.A., Tuyls, K., Hennes, D., van der Hoek, W.: Effective approximations for multi-robot coordination in spatially distributed tasks. In: Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pp. 881–890. International Foundation for Autonomous Agents and Multiagent Systems (2015) 12. Couzin, I.D., Krause, J., Franks, N.R., Levin, S.A.: Effective leadership and decision-making in animal groups on the move. Nature 433, 513–516 (2005) 13. Crandall, J.W., Goodrich, M.A.: Characterizing efficiency of human robot interaction: a case study of shared-control teleoperation. In: Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems, Luasanne, Switzerland (2002) 14. Dias, M.B., et al.: Sliding autonomy for peer-to-peer human-robot teams. In: Proceedings of the Intelligent Conference on Intelligent Autonomous Systems (2008)
Autonomy Reconsidered
591
15. Dorais, G., Bonasso, R.P., Kortenkamp, D., Pell, B., Schreckenghost, D.: Adjustable autonomy for human-centered autonomous systems. In: Working notes of the Sixteenth International Joint Conference on Artificial Intelligence Workshop on Adjustable Autonomy Systems, pp. 16–35 (1999) 16. Endsley, M.R., Kaber, D.B.: Level of automation effects on performance, situation awareness and workload in a dynamic control task. Ergonomics 42(3), 462–492 (1999) 17. Fong, T., Thorpe, C., Baur, C.: A safeguarded teleoperation controller. In: IEEE International Conference on Advanced Robotics, Budapest, Hungary, August 2001 18. Fong, T., Thorpe, C., Baur, C.: Collaboration, dialogue, human-robot interaction, pp. 255–266. Springer (2003) 19. Gerkey, B.P., Matari´c, M.J.: A formal analysis and taxonomy of task allocation in multi-robot systems. Int. J. Robot. Res. 23(9), 939–954 (2004) 20. Goldberg, D., Matari´c, M.J.: Interference as a tool for designing and evaluating multi-robot controllers. In: Proceedings, AAAI 1997, Providence, Rhode Island, pp. 637–642, July 1997 21. Goodrich, M.A., Olsen, D.R., Crandall, J.W., Palmer, T.J.: Experiments in adjustable autonomy. In: Proceedings of the IJCAI 2001 Workshop on Autonomy, Delegation, and Control: Interacting with Autonomous Agents (2001) 22. Gordon, D.M.: Ant Encounters: Interaction Networks and Colony Behavior. Princeton University Press, Princeton (2010) 23. Harbers, M., Peeters, M.M.M., Neerincx, M.A.: Perceived autonomy of robots: effects of appearance and context. In: International Conference on Robot Ethics, Lisbon, Portugal (2015) 24. Hexmoor, H., Castelfranchi, C., Falcone, R.: A prospectus on agent autonomy. In: Agent Autonomy, pp. 1–10. Springer (2003) 25. Huang, H.-M., Messina, E., Albus, J.: Toward a generic model for autonomy levels for unmanned systems (ALFUS). Technical report, DTIC Document (2003) 26. Johnson, M.J.: Coactive design: designing support for interdependence in humanrobot teamwork. Ph.D. thesis, Technische Universiteit Delft, Delft, The Netherlands (2015) 27. Kaber, D.B., Endsley, M.R.: The effects of level of automation and adaptive automation on human performance, situation awareness and workload in a dynamic control task. Theoret. Issues Ergon. Sci. 5(2), 113–153 (2004) 28. Kaber, D.B., Onal, E., Endsley, M.R.: Design of automation for telerobots and the effect on performance, operator situation awareness and subjective workload. Hum. Factors Ergon. Manuf. 10(4), 409–430 (2000) 29. Kaber, D.B., Riley, J.M.: Adaptive automation of a dynamic control task based on secondary task workload measurement. Int. J. Cogn. Ergon. 3(3), 169–187 (1999) 30. Kaminka, G., Frank, I., Arai, K., Tanaka-Ishii, K.: Performance competitions as research infrastructure: large scale comparative studies of multi-agent teams. Autonom. Agents Multi-Agent Syst. 7(1), 121–144 (2003) 31. Kortenkamp, D., Bonasso, P., Ryan, D., Schreckenghost, D.: Traded control with autonomous robots as mixed initiative interaction. In: AAAI Symposium on Mixed Initiative Interaction, Stanford, CA, USA (1997) 32. Lewis, M.A., Tan, K.-H.: High precision formation control of mobile robots using virtual structures. Autonom. Robots 4(4), 387–403 (1997) 33. Lin, L., Goodrich, M.A.: Sliding autonomy for UAV path-planning: adding new dimensions to autonomy management. In: Proceedings of the International Conference on Autonomous Agents and Multiagent Systems. International Foundation for Autonomous Agents and Multiagent Systems (2015)
592
M. A. Goodrich et al.
34. Maes, P.: Modeling adaptive autonomous agents. Artif. Life 1(1 2), 135–162 (1993) 35. Miller, C.A., Funk, H.B., Dorneich, M., Whitlow, S.D.: A playbook interface for mixed initiative control of multiple unmanned vehicle teams. In: Proceedings of the 21st Digital Avionics Systems Conference, vol. 2, pp. 7E4-1–7E4-13, November 2002 36. Mulder, M., Abbink, D.A., Carlson, T.: J. Hum.-Robot Interac. Special issue on Shared Control (2015) 37. Newell, A.: Unified Theories of Cognition. Harvard University Press, Cambridge (1994) 38. Parasuraman, R., Sheridan, T.B., Wickens, C.D.: A model for types and levels of human interaction with automation. IEEE Trans. Syst. Man Cybern. Part A Syst. Hum. 30(3), 286–297 (2000) 39. Pfeifer, R., Lungarella, M., Iida, F.: Self-organization, embodiment, and biologically inspired robotics. Science 318(5853), 1088–1093 (2007) 40. Proctor, R.W., Van Zandt, T.: Human Factors in Simple and Complex Systems. CRC Press, Boca Raton (2008) 41. Ren, W., Beard, R.W.: Decentralized scheme for spacecraft formation flying via the virtual structure approach. J. Guid. Control Dyn. 27(1), 73–82 (2004) 42. Reynolds, C.: Flocks, herds and schools: a distributed behavioral model. Comput. Graph. 21, 25–34 (1987) 43. Riley, V.: FAIT: a systematic methodology for identifying system design issues and tradeoffs. In: Proceedings of the IEEE International Conference on Systems, Man and Cybernetics (1989) 44. Rofer, T., Lankenau, A.: Ensuring safe obstacle avoidance in a shared-control system. In: Fuertes, J.M. (ed.) Proceedings of the 7th International Conference on Emergent Technologies and Factory Automation, pp. 1405–1414 (1999) 45. Seeley, T.D.: Honeybee Democracy. Princeton University Press, Princeton (2010) 46. Shell, D.A., Mataric, M.J.: On foraging strategies for large-scale multi-robot systems. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2717–2723. IEEE (2006) 47. Sheridan, T.B., Verplank, W.L.: Human and computer control of undersea teleoperators. Technical report, MIT Man-Machine Systems Laboratory (1978) 48. Steiner, I.D.: Group Processes and Group Productivity. Academic, New York (1972) 49. Yim, M., et al.: Modular self-reconfigurable robot systems [grand challenges of robotics]. IEEE Robot. Autom. Mag. 14(1), 43–52 (2007)
A Real-Time Intelligent Intra-vehicular Temperature Control Framework Daniel Jacuinde-Alvarez, James Dols, and Shahab Tayeb(B) California State University, Fresno 93740, USA [email protected]
Abstract. This paper details a real-time vehicle system solution for vehicular heatstroke. In the current implementation, the system uses two Raspberry Pi microcontrollers, where one microcontroller is equipped with YOLOv3-Tiny, a “You Only Look Once” (YOLO) real-time object detection model, narrowed to detect pets (dogs and cats) left unattended in a vehicle. Also, it is responsible on handling the vehicle’s temperature reading, which is used to determine if corrective action is needed. The second microcontroller’s duty is to transmit an email and short message service (SMS) via WiFi to the owner’s cellular phone and email address when an unattended pet is present in the scene. Based on the temperature reading, the microcontroller alters climate control by sending data through the Controller Area Network (CAN) bus as well. Both microcontrollers are set up to communicate by Amazon Web Services Internet of Things Core (AWS IoT Core) service using Message Queuing Telemetry Transport (MQTT) protocol.
Keywords: CAN bus
1
· Cloud · MQTT · Raspberry Pi
Introduction
An average of 37 children and hundreds of pets die due to vehicular heatstroke or pediatric vehicular heatstroke (PVH) each year in the United States of America alone [15,20]. From the reported PVH cases, about 87% of children’s ages ranged from three years old or younger. The majority of PVH cases are accidents, where the parent(s) or owner unknowingly leaves their child or pet unattended in the vehicle [20]. Based on a study done by San Francisco State University, the temperature inside the car can rise on average of +19 ◦ F in the first 10 min, followed by +10 ◦ F in the next 10 min [19]. Even on a mild or cloudy day, the temperature inside the vehicle can reach life-threatening levels in minutes [15]. According to the National Highway Traffic Safety Administration (NHTSA), children’s body temperature can rise three to five times more than an average adult. As for pets, they are at higher risk because they cannot control their body’s temperature through sweating unlike humans [40]. Meaning that the unattended child or pet in the vehicle has a limited amount of time to be saved from a heatstroke or c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 593–612, 2022. https://doi.org/10.1007/978-3-030-82193-7_39
594
D. Jacuinde-Alvarez et al.
death. Heatstroke starts to occur when the core body temperature of the individual reaches 104 ◦ F. Once the core body temperature reaches about 107 ◦ F, organ failure begins to occur, which may lead to death [18,40]. Earlier research work have studied the implementation of microcontrollers to automate various aspects of smart and intelligent transport systems [38,39]. There are several existing products and technologies currently available that attempt to address the issue of PVH. Analogous to our system, these products all aim to remind or notify the driver of the vehicle to check for a child in the back seat. The most basic of these products, is a plastic rear-view hang tag that serves as a reminder to the driver to check in the back seats of the vehicle before leaving and locking. Another product is Waze, a mobile navigation application, which has an option that will produce a pop-up notification on your mobile device when GPS data indicates you have stopped traveling in a vehicle. A more elaborate product available is the ChildMinder SoftClip System which consists of a child seat clip and associated key fob remote alarm. The clip and key fob are paired together and if the key fob travels a set distance away from the clip while it is still buckled. The fob will sound an audible alarm notifying the driver to return and check the vehicle. These products all provide a notification to the driver to check the vehicle, but none of them take any action to protect the child from heatstroke events until the driver has returned. As for pets, there is no product currently available to prevent them from being left unattended in a vehicle. The proposed system extends on the idea of notifying the driver, but will also react responsively to lower the vehicle’s temperature when appropriate. The novel contributions of this research include using object recognition to identify pets in a vehicle setting. An automated system able to take corrective actions to keep a vehicles temperature regulated. Based on our research this is the first automated system designed to address the issue of unattended pets to address PVH. The rest of this manuscript provides an overview of the background knowledge required for this research. This is followed by a section on the overall system design and functionality. We then present the connection between our proposed system and previously presented background information with an in-depth description of the functionality of each major components. Finally, we conclude this manuscript by discussing the results of the research.
2 2.1
Background Object Detection
Object detection is a computer vision technique leveraging machine learning, or deep learning for predicting instances of objects in images or videos. Object detection neural networks can be categorized as either a two-stage network or a single-stage network. A two-stage network has two stages where one stage of the network identifies region proposals or a subset of the image that might contain an object. In the other stage, a region proposal is used to classify objects. On the other hand a single-stage network, similar to “You Only Look Once” (YOLO)
Intelligent Intra-vehicular Temperature Control
595
model being used in the project, the convolutional neural network (CNN) produces network predictions for regions across the entire image using anchor boxes, and the predictions are decoded to generate the final bounding boxes for the objects on a single stage [27]. Typically, a two-stage network can achieve very accurate results, but at slow prediction rates. For a single-stage network, it can achieve real-time predictions rates but may not reach the same level of accuracy as a two-stage network [35]. 2.2
Convolutional Neural Network (CNN)
CNN is a class of deep neural networks that are applied in computer vision applications because such a neural network was inspired to replicate the human visual system by connecting the neurons of the network similarly to the visual cortex connectivity [36]. Its name, CNN, comes from the type of hidden layers it consists of, such as convolutional layers, pooling layers, and fully connected layers. Convolutional layers are used to extract edges or other meaningful features of a given input image to help the model determine a prediction. Pooling layers are used to down-sample an input image while the channels or depth stay the same. The fully connected layers are used to classify the image into a label from the previous convolution and pooling layers. 2.3
Controller Area Network (CAN) Bus
Each electric subsystem in a vehicle is controlled by an electronic control module (ECM). Most subsystems require some degree of communication with other subsystems in order to coordinate activity or trigger actuators. The original vehicle network design used point-to-point wiring to connect each ECM to other ECMs. As the number of ECMs in vehicles continued to grow, this method required additional complex wiring. The Controller Area Network (CAN) bus provided a solution to this wiring complexity. Originally developed in the 1980s, the CAN bus became standard in most vehicles by 1993. It offered a much more efficient solution where all ECMs are connected together through a single two-wire bus, making it easier to insert or remove electrical subsystems. The CAN bus is a multi-master broadcast network, meaning that all messages broadcast from one node traverse through the network and are received by every other node. Figure 1 shows the contrast of both point-to-point and CAN Bus networks. 2.4
Message Queuing Telemetry Transport (MQTT)
MQTT is a lightweight messaging protocol designed for resource-constrained devices and frequently used for IoT endpoints [42]. Notable benefits to MQTT include that messages can be queued while a device is offline and retrieved when connectivity is restored. Additionally, the MQTT protocol uses an asymmetric protocol, which results in decreased overhead. It also supports multiple Quality of Service (QoS) levels of QoS-0 for non assured transmissions, QoS-1 for assured
596
D. Jacuinde-Alvarez et al.
Fig. 1. Point-to-point (left), CAN bus (right)
transmissions, and QoS-2 assured service on applications. These different QoS level allow for overhead to be reduced on less important communications. Based on Yokotani and Sasaki comparison on Hypertext Transfer Protocol (HTTP) and MQTT protocols with respect to the number of network resources consumed, they concluded that MQTT is the better performer for IoT devices [43]. MQTT is also the recommended protocol by AWS for communication through IoT core and supported in the Amazon IoT Device Software Development Kit (SDK) for Python.
3
Proposed System
We propose a system that can be integrated into future vehicle manufacturing to reduce the number of vehicular heatstroke related deaths each year. The realtime system is capable of identifying a child or pet left unattended in a vehicle setting and taking a responsive action. The initial focus of the system was on common household pets, specifically dogs and cats. This decision was made to limit the scope of the project with the idea that after a successful implementation for pets, the project could be expanded in the future to include unattended children as well. In order to determine a detection in the vehicle, computer vision is utilized which its main task is to acquire, process, analyze and understand digital images in order to extract high-level data from the real world such as numerical or symbolic information [36]. The system uses one of YOLO’s realtime object detection models, specifically YOLOv3–Tiny model, due to limited memory deployment and reasonable accuracy [26,31]. Two datasets were used for the object detection model, first is the Common Objects in Context (COCO) dataset to test deployment on the microcontroller. COCO dataset is widely used to measure performance on YOLO and similar models due to the fact that it contains roughly 80 object categories or classes, with over 200,000 labeled or trained images [12]. Once successfully deployed, the model was trained on a custom-developed dataset made from Google’s Open Images that specifically targeted the objects of our application: person, dog, and cat. [11,12]. For the response portion of the system, Qian’s method of preventing children from being left in a car is applied [25]. Qian’s method uses WiFi or Bluetooth connectivity to send a notification to a registered email address about the pet left inside the vehicle. To add redundancy, the system sends a short message
Intelligent Intra-vehicular Temperature Control
597
service (SMS) to the owner as well. In addition to the notification, control of the vehicle’s climate control system is handled through the CAN bus via the On-Board Diagnostics (OBD-II) port to bring the temperature within the set threshold [25]. In order to monitor the vehicle’s ambient temperature, the system is equipped with a high accuracy DHT22 temperature sensor module. Initially the system consisted of a single Raspberry Pi 4B microcontroller that runs a standard Linux OS (Raspbian OS). With the development of the system already underway in the early stages of the COVID-19 pandemic, adjustments have been made to the system to have two microcontrollers that will be connected to Amazon Web Services Internet of Things Core (AWS IoT Core). The service is used to exchange needed information between the two devices such as when an unattended pet is detected, the current reading of the temperature, and more. One microcontroller (M1) handles the object detection and reading of the DHT22 temperature sensor. While the other microcontroller (M2) handles the notification transmission and the CAN bus communication to make temperature adjustment using the vehicle’s climate control. Figure 2 demonstrates the overview of the system with those adjustments already included.
Fig. 2. System overview
When the system is powered on, both microcontrollers are initialized and signed in to AWS IoT Core. They then subscribe to their correct topics; microcontroller M1 to “recheck” and microcontroller M2 to “temp”. Afterwards, M1 is in run mode, while the M2 is in standby mode. The system enables object detection on M1 to check for objects: person, dog, and cat. To make the system more reliable in terms of object detection, there is a process taken to reduce false negative (FN) and false positive (FP) detection errors. FN detection occurs when an object is present but model was not able to detect it. As for FP detection, it occurs when the model detects an object but does not label the object correctly. In other words, the process taken by the system is to help determine that the predicted object(s) made from the model are true object(s) in high confidence. The system will be given a time frame to do object detection for the scene. If the objects are detected more than a specific threshold of that time frame, then the system can assume with high confidence that the predicted objects are really
598
D. Jacuinde-Alvarez et al.
in the scene. With such a process in place, the system first captures the vehicle scene. If the system detects a person in the scene then the system does not take any further actions and continues checking by resetting M1. Also, if none of the target objects are detected (person, dog, and cat) then the system will not take any further action. Once a dog and/or cat is detected without the presence of a person, then the system will disable object detection to release its resources for other tasks. Afterward, M1 will execute the reading of the vehicle’s temperature and publish such data to the topic “temp”. M1 will then wait for M2 to publish on the topic “recheck”. When it receives the recheck message it will reset causing the objection detection and temperature reading to occur again as needed. During the previously mentioned tasks for M1, M2 continuously checks AWS through its subscribed topic for information sent by M1. Once new data is received through topic “temp”, it executes the notification feature of the system, sending an email and SMS to the registered owner notifying them that an unattended pet is in their vehicle. To prevent spamming the owner with notifications, the system only transmits every five minutes after its first execution. The temperature data is then used to check if the temperature is within the temperature threshold for a pet. If the reading is within the desired threshold, M2 will publish to the topic “recheck”, causing M1 to execute again and M2 to wait. However, if the temperature is not within the temperature threshold, M2 then transmits the required CAN frame to turn on the AC of the vehicle. Then publish to the topic “recheck”, to indicate M1 to execute again until the temperature is regulated. M2 will reset then wait for the next temperature update from M1. 3.1
Microcontroller M1
Object Detection YOLO: Since the system is to perform in real-time conditions, it requires an objection detection model that can meet such specifications [37]. Through research, it is observed that the majority of other researchers are utilizing a family of real-time object detection algorithms called YOLO. It is a CNN, which can predict bounding boxes and class probabilities directly from images in a single evaluation. The first development of the algorithm was introduced in 2015 [27]. The latest algorithm developed by Redmon et al., YOLOv3 achieved an average of 54% mean average precision (mAP) using COCO dataset [28]. The term mAP is a commonly used metric to measure the accuracy of the object detector [9]. Information such as precision, recall, and intersection over union (IoU) must be captured from the validation (unbiased) dataset in order to obtain mAP statistics of the model. Precision measures how accurate the model made the prediction which can be calculated by Eq. 1. Recall measures how well the model can detect the objects which can be calculated by Eq. 2. Parameter true positive (TP) detection occurs when an object is present and the model detects
Intelligent Intra-vehicular Temperature Control
599
it correctly, which is the desired detection for a given model. P recision =
TP TP + FP
(1)
TP (2) TP + FN IoU is the model’s confidence level per object prediction which can be calculated by Eq. 3. For IoU, parameter, ground truth, refers to the true bounding box of an object. While parameter, prediction, refers to the model’s bounding box prediction on that same object. Recall =
IoU =
Ground T ruth ∩ P rediction Ground T ruth ∪ P rediction
(3)
Contrasting YOLOv3 to other algorithms, the algorithm is able to achieve similar accuracy, but at an incredible speed. The only observed downside is that YOLOv3 architecture is too big to be run on a memory-sparse embedded device such as a Raspberry Pi. This fact is true even with the aid of the Intel Movidius Neural Computer Stick (NCS), a first-generation low-power deep learning accelerator for microcontrollers [7,31]. Redmon et al. had developed a smaller YOLO architecture named YOLOv3–Tiny that is suited for embedded computer vision/deep learning devices such as Raspberry Pi, Google Coral, and NVIDIA Jetson Nano. YOLOv3–Tiny is still CNN, but compared with YOLOv3, it has a shallower network depth and fewer parameters [31]. The model’s architecture consists of a total of 24 layers, of which 2 layers are YOLO prediction layers, as shown in the Table 1. The result of such a small architecture is that half of the accuracy is lost, with 23% mAP on COCO dataset. The model is still suitable and reasonable for small to medium applications [41]. Framework: Numerous frameworks can be utilized to deploy the chosen model, such as Darknet, TensorFlow, Keras, and PyTorch. Each framework has its own advantages and disadvantages, such as dataset training size and speed. Keras supports rapid prototyping but with a small dataset size. PyTorch supports flexibility but short training duration, and TensorFlow supports a large dataset and high-performance [32]. The Darknet framework supports a large dataset, achieves higher performance than TensorFlow, and has extensive training duration [10]. Additionally, the framework was developed around YOLO architecture. As a result, the Darknet framework was selected for this system. Dataset: In terms of a dataset for the object detection, the COCO dataset and a developed custom dataset via Google Open Images is utilized. Since the system only needs to detect certain classes: person, dog, and cat, a custom dataset is beneficial. The custom dataset consists of roughly about 30,000 trained images for class: Person, 20,000 trained images for class: Dog, and 12,000 trained images for class: Cat. Open Images was selected due to its similar characteristics in terms of data compared to other datasets and it had the most trained images available per class [1,11].
600
D. Jacuinde-Alvarez et al. Table 1. YOLOv3–tiny (custom version) architecture Layer Type 0
Convolutional
1
Maxpool
2
Convolutional
3
Maxpool
4
Convolutional
5
Maxpool
6
Convolutional
7
Maxpool
8
Convolutional
9
Filter Size
Stride Input
Output
3×3 1
416 × 416 × 3
2×2 2
416 × 416 × 16 208 × 208 × 16
3×3 1
208 × 208 × 16 208 × 208 × 32
2×2 2
208 × 208 × 32 104 × 104 × 32
3×3 1
104 × 104 × 32 104 × 104 × 64
2×2 2
104 × 104 × 64 52 × 52 × 64
3×3 1
52 × 52 × 64
2×2 2
52 × 52 × 128
26 × 26 × 128
256
3×3 1
26 × 26 × 128
26 × 26 × 256
2×2 2
26 × 26 × 256
13 × 13 × 256
512
3×3 1
13 × 13 × 256
13 × 13 × 512
16 32 64 128
Maxpool
416 × 416 × 16
52 × 52 × 128
10
Convolutional
11
Maxpool
2×2 1
13 × 13 × 512
13 × 13 × 512
12
Convolutional 1024
3×3 1
13 × 13 × 512
13 × 13 × 1024
13
Convolutional
256
1×1 1
13 × 13 × 1024 13 × 13 × 256
14
Convolutional
512
3×3 1
13 × 13 × 256
13 × 13 × 512
15
Convolutional
24
1×1 1
13 × 13 × 512
13 × 13 × 24
16
YOLO
17
Route 13
18
Convolutional
19
Up-Sampling
20
Route 19,8
21
Convolutional
22
Convolutional
23
YOLO
13 × 13 × 256 1×1 1
13 × 13 × 256
13 × 13 × 128
2×2 1
13 × 13 × 128
26 × 26 × 128
256
3×3 1
26 × 26 × 384
13 × 13 × 256
24
1×1 1
26 × 26 × 256
13 × 13 × 24
128
26 × 26 × 384
Model Acceleration: Since the system will not use a graphics card, the object detection model performance was lower than normal. To restore performance, OpenCV and Intel Movidius Neural Compute Stick II were used. OpenCV is used to monitor real-time object detection and provide image-processing acceleration [5]. As for Intel’s Movidius Neural Compute Stick II (NCSII), a secondgeneration low-power deep learning accelerator targeted for microcontrollers, was used with OpenVINO to improve the performance of the model in terms of image processing [14]. According to the benchmark comparison conducted by Adrian Rosebrock, the average speed of Raspberry Pi 3B+ running MobileNet SSD object detection without NCSII and OpenVINO is 0.635 frames per second (FPS). Using NCSII and OpenVINO, its average speed is 8.340 FPS, translating to a huge performance improvement [30].
Intelligent Intra-vehicular Temperature Control
601
Camera: With the object detection model known, the camera was chosen based on direct compatibility with the model. YOLOv3–Tiny takes an image input size of a S × S, where S is a scalar of 32, e.g., 416 × 416 image. When selecting a camera with easy implementation for the Raspberry Pi and object detection image requirements, the Raspberry Pi Camera V2.1 was found to be a well-round camera. The Pi camera can be directly attached by the Raspberry Pi’s Camera Serial Interfaces (CSI) already provided. Also, the camera’s image size can be manually adjusted by its different modes, and other helpful features such as brightness, sharpness, image rotation, and much more can be configured based on its application [23]. Table 2 shows the corresponding image size for each mode of the camera. Mode 7 is the only image size that the object detection model can directly use as the input of Pi camera V2.1 without suffering loss of detail on the input image. Table 2. Pi camera V2.1 modes Mode Size
Ratio FPS
Binning
0
Automatic
1
1920 × 1080 16:9
0.1–30 None
2
3280 × 2464 4:3
0.1–15 None
3
3280 × 2464 4:3
0.1–15 None
4
1640 × 1232 4:3
0.1–40 2 × 2
5
1640 × 922
16:9
0.1–40 2 × 2
6
1280 × 720
16:9
40–90
7
640 × 480
4:3
40–200 2 × 2
2×2
Training: Currently there are three different convolution weights files associated with YOLOv3-Tiny model, darknet53.cov.74, yolov3-tiny.conv11 and yolov3tiny.conv15. Each weight file contains different weight values for each of the neuron connections in the hidden layers of the given network. Usually, the weight file can contain hundreds to billions of weight values depending on the model’s architecture. The weights determine which neuron has the greatest influence on the next layer of the model and thus has a greater influence on the model’s output. The training sessions were handled by Google Co-laboratory, also known as Google Colab, a python cloud-based service leveraging the power of Google hardware, graphics processing unit (GPU) for irregular computation, and tensor processing unit (TPU) for high training throughput for CNNs [3].
602
D. Jacuinde-Alvarez et al.
DHT22 Sensor With the system handling life or death scenarios, a high accuracy temperature sensor that can withstand extreme temperatures was necessary. A number of sensors were found with such capability, but the DHT22 temperature and humidity sensor was selected. The DHT22 has a temperature reading range of −40 ◦ C to 125 ◦ C ± 0.5 ◦ C. The sensor has a sampling rate of 0.5 Hz which translates to one reading every two seconds. The sensor operates on a single DC 3.3 V to 5.5 V connection making it easy to be implemented on the Raspberry Pi by its supported GPIO Pins. The sensor comes in 4-pin and 3-pin forms, but both utilize only 3-pins. The sensor outputs a 40-bit sequence, where the sequence is defined as: 16-bit Humidity Value + 16-bit Temperature Value + 8-bit Check Sum The humidity and temperature percentages can be directly calculated from their corresponding 16-bit binary value by simply finding its decimal representation and dividing by 10. The sensor’s checksum value in binary form is 11101110b or decimal form 238d . For example, if the output sequence retrieved from the sensor is: 0000 0010 1000 1100 0000 0001 0101 1111 1110 1110 the result values will be 652/10 = 65.2% Humidity and 351/10 = 35.1 ◦ C. A custom library was developed from the original library “Adafruit DHT” from Adafruit to support the Raspberry Pi 4 Model B. 3.2
Microcontroller M2
Notification The Raspberry Pi already contains an integrated wireless module that supports 2.4 GHz and 5.0 GHz IEEE 802.11ac wireless, Bluetooth 5.0, and Bluetooth Low Energy (BLE). For this reason, the notification feature can be easily applied [22]. The microcontrollers are already required to connect to AWS’s cloud via an internet connection, so Wi-Fi will be utilized for the notification feature as well. In terms of which bandwidth for Wi-Fi, the 2.4 GHz band provides coverage at a longer range than 5 GHz because higher frequencies cannot penetrate solid objects, such as walls and floors. In terms of data transmission speed, 5 GHz is about 2 or 3 times faster than 2.4 GHz [17]. The system is expected to perform in city conditions and with the notification package size relatively small (about 600 bytes), the transmission speed is not a big factor so 2.4 GHz is utilized. In order to send an email, simple mail transfer protocol (SMTP), which is a communication protocol used for electronic mail transmission, is used [29]. There are a number of network port numbers that are associated with SMTP such as ports 25, 587, 465, and 2525 [16]. As of 2020, the default port for SMTP submission is port 587, which is utilized when trying to establish a communication session with the email server. In order to send an email, information such
Intelligent Intra-vehicular Temperature Control
603
as sender email address, recipient(s) email, and the message that is to be sent are needed to be known beforehand. For this system, the sender is the system’s email address, the recipient will be the owner’s email, and the message regarding the unattended pet. The credentials of the system’s email are encrypted using python’s Fernet library, a symmetric encryption, to add a layer of security. To provide a layer of redundancy, the system sends an short messaging services (SMS) text message to the owner as well. Nearly all of the cellular carriers offer free SMS gateways that are associated with a cell phone number. These gateways allow normal emails to be converted to mobile phone messages [8]. In order to send such an SMS text message, all that is needed is a 10-digit cell phone number and its carrier’s gateway ending, both of which are captured by using the front-end application of the system. Once such information is obtained, the same approach of sending an email is taken. However, now the recipient’s email is the phone number followed by its carrier gateway. For example, if a phone number such as 1234567890 and its carrier is AT&T, its associated email will be [email protected]. CAN Bus Alteration To communicate with the automobile climate control system it is necessary to have access to the intra-vehicle network (IVN). For modern automobiles, this network is typically a combination of several segmented smaller networks which may include CAN bus, FlexRay, media oriented system transport (MOST), or local interconnect network (LIN) bus. These networks are all connected together through a central gateway point generally terminating with some diagnostic access port [21]. There are several options for connecting to the CAN bus in a modern vehicle, ranging from physical access connections to short and longrange wireless options [4]. With the proposed system being integrated inside the vehicle, the approach of a physical connection to the CAN bus was selected. The standard physical access point to the CAN bus in a vehicle is the OBD-II port, interchangeably referred to as the diagnostic link connector (DLC) by many automobile original equipment manufacturers or OEMs [6,13]. The OBD-II port is typically located on the lower left or right side near the steering column of the vehicle. It has been standard on all automobiles since 1996 and has a sixteen pin breakout as shown Fig. 3. Although the OBD-II port is a industry-standard connection, only half of the pins are standard-specified, while the rest of the pins are used based on the manufacturer’s discretion as described in Table 3.
Fig. 3. OBD-II pins
604
D. Jacuinde-Alvarez et al. Table 3. OBD-II pin description PIN Description
PIN Description
1
OEM choice
9
OEM choice
2
J1850 bus+
10
J1850 bus
3
OEM choice
11
OEM choice
4
Chassis ground 12
OEM choice
5
Signal ground
OEM choice
6
CAN high
14
CAN low
7
ISO 9141 high
15
ISO 9141 low
8
OEM choice
16
Battery power (12 V+)
13
CAN Frame: The CAN protocol was designed for real time operation and lacks the basic security tenants of confidentiality, integrity, and availability. Data on the CAN bus is not encrypted and as a result, anyone or any device with access to the network can freely monitor and read the CAN frames or CAN traffic without any authorization or authentication taking place. The structure of a typical CAN frame can be logically broken up into several specific segments as shown in Fig. 4. The first bit of the CAN frame is the start of frame (SOF) bit, which only indicates the beginning of the incoming frame. The next 11 bits specify the CAN ID, which each node or electronic control module (ECM) will analyze this ID to determine if the message is intended for it or not. On the CAN bus, the lower CAN ID has a higher priority. The IDs assigned to a specific node are up to the system designer’s discretion. General practice is to assign higher priority IDs to the more critical nodes like powertrain, anti-lock braking, or emergency systems. It is also fairly common to see nodes with similar functionality be assigned IDs that are close together in a range. The next bit is the remote transmission request (RTR), which is the bit to indicate whether the message is being sent to or requested from the node indicated by the CAN ID. Next two bits control whether the frame is a standard or extended frame. The data length code consists of four bits that indicate to the receiving node how many bytes to expect in the payload or data frame which can range from 0 to 64 bits. Next 16 bits are used for the calculation of a cyclic redundancy check (CRC) for transmission accuracy, followed by a two-bit acknowledgment (ACK). The last seven bits are reserved to show the end of the frame (EOF).
Fig. 4. CAN frame format
Intelligent Intra-vehicular Temperature Control
605
Node Insertion: The test vehicle used for this system was a 2012 Ford Fusion. Based on Miller’s finding, Ford engineers employ a secondary middle speed CAN bus operating at 125 Kbps for lower priority ECUs [13]. This secondary CAN bus is accessible through the OBD II port on pins 3 and 11 for CAN high and CAN low signals, respectively. In order to establish a node via OBD-II port, the approach of using an PiCAN3 with switch mode power supply (SMPS) module was utilized. This attachment module provides direct CAN Bus capabilities for the Raspberry Pi 4 microcontroller, a MCP2515 CAN controller, and MCP2562 CAN transceiver. The only connections made from the PiCAN to the OBD-II port are the following: – – – –
Pin Pin Pin Pin
16: Power Supply (12 V+) 4: Chassis Ground 3: CAN High 11: CAN Low.
Once the connections are established, the Raspberry Pi will not have a direct connection to any video output display or input devices. A virtual network computing (VNC) server was setup on the Raspberry Pi which enabled the Pi to be accessed over the WiFi network present at the location using a remote device [24]. For interacting with a CAN bus protocol, Linux has a set of utilities available through the open-source SocketCAN drivers. The drivers enable access to a CAN network by initializing a socket in a similar fashion to a Transmission Control Protocol/Internet Protocol (TCP/IP) connection and binding that socket to an interface for reading or writing data. CAN-Utils provides several different tools for interacting with a CAN bus network which will be used in the process of reverse engineering the CAN IDs. Reverse Engineering: Before attempting to reverse engineer the CAN bus signals on a physical vehicle, the process is applied to the CAN bus simulator. The simulator selected was open-source ICSim (Instrument Cluster Simulator) which was written by Craig Smith [33]. The reverse engineering process used is similar to and adapted from those described by Currie and Smith in their respective publications [6,34]. The communication to the CAN bus network is established, then the traffic on the CAN bus is observed by using cansniffer, a tool of CAN-Utils. Cansniffer can automatically filter out CAN IDs that have static signals. In other words, only the CAN IDs that have recently changed are displayed. Cansniffer’s output is displayed left to right where the first column the delta field is shown, this field is a time delta to distinguish when each CAN frame was received in relation to the other CAN frames. The next column, the ID field is displayed, which is the CAN ID (arbitration ID) displayed in hexadecimal format. The next column field consists of eight columns, each displaying a hexadecimal digit representing one nibble (4 bits) of the payload or data. Lastly, the last column displays the ASCII character representation of the data. Applying to the vehicle now, the traffic is observed on the network for a few seconds while taking no actions to establish a baseline of traffic on the network.
606
D. Jacuinde-Alvarez et al.
Events are now triggered by pressing the buttons on the control panel to turn on the fan blowers, turn on the AC, the MAX AC, and to adjust the air output through the dash and floor vents. The traffic on the CAN network is observed during these changes and it is observed that two CAN IDs (387 and 388) appear in the traffic whenever changes are made to the AC. Filtering those two CAN IDs, when the AC is turned off, its payload of zeroes are shown in Fig. 5.
Fig. 5. Climate control CAN IDs
Once the AC is turned on it is observed that the traffic on the bus cycled periodically between the original signal as shown in Fig. 6 and the signals shown in Fig. 7. Now that climate control CAN IDs are known, generating the desired CAN frame and transmitting it to the IVN is a trivial task by using cansend from CAN-Utils.
Fig. 6. CAN traffic for turning on AC
Fig. 7. CAN traffic while AC is running
3.3
Cloud Communication
To connect or communicate using AWS IoT Core, both microcontrollers must be configured and registered with the service as a “Thing”. For the system’s configuration, the AWS IoT Device Software Development Kit (SDK) v2 for Python is used. The SDK contains all the necessary software to establish a secure Transport Layer Security (TLS) connection between the device and AWS. Additionally, it
Intelligent Intra-vehicular Temperature Control
607
provides an application programming interface (API) for handling message subscriptions, connections, disconnections and publishing new messages [2]. In terms of registration, each device is given its own private key and rootCA which are stored locally on the devices to enable connection to AWS. An IoT Core policy is created, which defines the rights that the devices have on the service. The policy can restrict a device to only publishing, or only subscribing. It is best security practice to limit each device policy to grant only the access absolutely needed to perform it’s function. That way if the security of an individual device is compromise and a threat actor gains control over the resource, they will still be limited. Using MQTT each microcontroller device publishes a message to topics of which the alternate microcontroller is a subscriber. In this manner, the microcontrollers exchange messages related to the presence of a pet, measured temperature, and state of the climate controls in the vehicle. Custom functions are used for publishing to a topic and subscribing to a topic. AWS IoT core provides a test interface in the web console where you can subscribe to topics from an Hypertext Transfer Protocol Secure (HTTPS) connection and view the messages published to them. Figure 8 shows the interface for topic “recheck”.
Fig. 8. Monitoring messages on the recheck
608
4
D. Jacuinde-Alvarez et al.
Results
With the custom-trained YOLOv3-Tiny, the model was able to achieve a better performance with an average of 39.84% [email protected] compared to the pre-trained model on COCO dataset at 23% mAP. Also, the object detection was able to achieve 16 to 23 FPS on a live image feed. There is still room for improvement for the object detection, such as a training the model on a bigger and better-fitted dataset. An extreme improvement in this area is to replace the current model with YOLOv4-Tiny due to the fact it was able to achieve 40% [email protected] with COCO dataset. With the new tailored dataset, it is expected for the model to achieve in the 50% mAP or greater. The notification feature was successfully developed as the results are shown in Fig. 9 with its execution time of less than 30 s. An optional improvement is to extend the service to other cellular providers. In addition, the notification function can be extended to alert local law enforcement agencies as well.
Fig. 9. SMS (left) and Email (right) notification
The completed system was tested while sending the CAN frames to a virtual CAN network rather than a physical vehicle network. The lack of a dedicated vehicle for testing was the limiting factor in this testing. The only vehicles available were personally owned vehicles and there was no tolerance for any downtime in the operational state of the vehicles. In future work it would be useful to obtain a vehicle that can be used for testing the CAN messages to verify the reverse engineered data frames are producing the expected responses from the electric control modules. However, different CAN IDs may be used for the same function depending on the vehicle’s manufacturer, model, and year, which is a big obstacle to extending this project to multiple vehicles. In order to provide this system as a solution, additional work is needed to find a less time-consuming way to identify or reverse engineer the required CAN IDs for each target vehicle. The network traffic between the AWS and the Raspberry Pi was analyzed by Wireshark, a protocol analyzer as show in Fig. 10. Line number 652 shows the outgoing packet from the microcontroller to AWS and then 110 ms later the response is received. The outgoing payload is 67 bytes of information and the return payload to the microcontroller is 33 bytes long. This is just one example but after monitoring multiple data exchanges using Wireshark, these latency
Intelligent Intra-vehicular Temperature Control
609
Fig. 10. Packet sniffing with wireshark
values and data packet lengths did not vary significantly from one communication to the next. The AWS IoT core and MQTT protocol supports additional features that were not implemented in this project. A possible future expansion of this system could include support for shadow states for each microcontroller. Shadow states enable a temporary holding area for messages to queue when they are transmitted to the IoT Core service. The next time the microcontroller connects and subscribes to the topic it would be able to retrieve all of the messages that queued when it was offline. This type of message handling was not considered during the initial design of the system. As a result the program was written to avoid the need for queued messages.
5
Conclusion
The accuracy of the YOLOv3-Tiny model trained with the custom-tailored dataset as indicated previously in this paper show that it is possible to implement an effective object detection model on a microcontroller with limited memory and compute resources. Additionally, successful transmission of the CAN frames responsible for the AC control to the virtual CAN bus show that it is possible to directly manipulate an electric control module by posing as a node on the intra-vehicle network and injecting data. With a total system latency of under 60 s in ideal conditions. The system provides a response time well within 10 min which is suitable to prevent any subject from suffering a vehicular heatstroke related event.
610
D. Jacuinde-Alvarez et al.
Acknowledgment. This research was partially supported by an SB-1 grant with Fresno State Transportation Institute.
References 1. Open image v6-description (2020). https://storage.googleapis.com/openimages/ web/factsfigures.html. Accessed 20 Oct 2020 2. Amazon: AWS IoT Device SDK v2 for Python. https://github.com/aws/aws-iotdevice-sdk-python-v2#aws-iot-device-sdk-v2-for-python. Accessed 18 Oct 2020 3. Bisong, E.: Google colaboratory. Building Machine Learning and Deep Learning Models on Google Cloud Platform, pp. 59–64 (2019) 4. Bozdal, M., Samie, M., Aslam, S., Jennions, I.K.: Evaluation of can bus security challenges. Sensors 20, 16–17 (2020) 5. Bradski, G., Kaebler, A.: Learning OpenCV: Computer vision with the OpenCV library, September 2008 6. Currie, R.: Hacking the can bus: basic manipulation of a modern automobile through can bus reverse engineering. SANS Institute (2017) 7. Du, J.: Understanding of object detection based on CNN family and yolo. J. Phys. Conf. Ser. 1004, 012029 (2018) 8. Eileen, R.: How to email text to an ATT cell phone. https://www.techwalla.com/ articles/how-to-email-text-to-an-att-cell-phone. Accessed 20 Oct 2020 9. Hui, J.: Map (mean average precision) for object detection. https://jonathan-hui. medium.com/map-mean-average-precision-for-object-detection-45c121a31173. Accessed 20 Oct 2020 10. Koo, Y., You, C., Kim, S.: OpenCL-Darknet: an OpenCL implementation for object detection. In: 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 631–634 (2018) 11. Kuznetsova, A., et al.: The open images dataset V4: unified image classification, object detection, and visual relationship detection at scale. CoRR, abs/1811.00982 (2018) 12. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. CoRR, abs/1405.0312 (2014) 13. Miller, C., Valasek, C.: Adventures in automotive networks and control units. Def. Con. 21, 260–264 (2013) 14. Modasshir, M., Li, A.Q., Rekleitis, I.: Deep neural networks: a comparison on different computing platforms. In: 2018 15th Conference on Computer and Robot Vision (CRV), pp. 383–389 (2018) 15. Nelson, K.: Here’s exactly what happens to the body of a dog left in a hot car. Bark Post (2020). https://barkpost.com/discover/canine-heat-stress-dog-in-hotcar/. Accessed 20 Oct 2020 16. Netgear: How to choose the right smtp port (port 25, 587, 465, or 2525) (2019). https://kb.netgear.com/29396/What-is-the-difference-between-2-4-GHzand-5-GHz-wireless-frequencies. Accessed 20 Oct 2020 17. Netgear: What is the difference between 2.4 ghz and 5 ghz wireless frequencies? (2019). https://kb.netgear.com/29396/What-is-the-difference-between-2-4GHz-and-5-GHz-wireless-frequencies. Accessed 20 Oct 2020 18. NHTSA: Prevent hot car deaths: Heatstroke kills (2020). https://www.nhtsa.gov/ campaign/heatstroke. Accessed 20 Oct 2020
Intelligent Intra-vehicular Temperature Control
611
19. Null, J.: Pediatric vehicular heatstroke deaths. In: 47th Conference Broadcast Meteorology/5th Conference on Weather Warnings and Communications. AMS (2019) 20. Null, J.: Heatstroke deaths of children in vehicles (2020). www.noheatstroke.org/. Accessed 20 Oct 2020 21. Pes´e, M.D., Stacer, T., Andr´es Campos, C., Newberry, E., Chen, D., Shin, K.G.: LibreCAN: automated CAN message translator. In: Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security, pp. 2283–2300 (2019) 22. Raspberry Pi: Raspberry pi 4 tech specs (2020). https://www.raspberrypi.org/ products/raspberry-pi-4-model-b/specifications/. Accessed 20 Oct 2020 23. Raspberry Pi: Raspberry pi camera module (2020). https://www.raspberrypi.org/ documentation/raspbian/applications/camera.md. Accessed 20 Oct 2020 24. Raspberry Pi: VNC (virtual network computing). https://www.raspberrypi.org/ documentation/remote-access/vnc/. Accessed 18 Oct 2020 25. Qian, D.Z.: Method for preventing children being left in car by using obd2 socket and smartphone, 1 October 2015. US Patent App. 14/230,019 26. Redmon, J.: YOLO: real-time object detection (2020). https://pjreddie.com/ darknet/yolo/. Accessed 20 Oct 2020 27. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016) 28. Redmon, J., Farhadi. A.: YOLOv3: an incremental improvement. CoRR, abs/1804.02767 (2018) 29. Riabov, V.: SMTP (Simple Mail Transfer Protocol), pp. 388–406, December 2007 30. Rosebrock, A.: OpenVINO, OpenCV, and Movidius NCS on the Raspberry Pi (2019). https://www.pyimagesearch.com/2019/04/08/openvino-opencv-andmovidius-ncs-on-the-raspberry-pi/. Accessed 20 Nov 2020 31. Rosebrock, A.: YOLO and tiny-YOLO obejct detection on the Raspberry Pi and Movidius NCS. 2020. https://www.pyimagesearch.com/2020/01/27/yoloand-tiny-yolo-object-detection-on-the-raspberry-pi-and-movidius-ncs/. Accessed 20 Oct 2020 32. Sayantini: Keras vs TensorFlow vs PyTorch: Comparison of the deep learning frameworks (2020). https://www.edureka.co/blog/keras-vs-tensorflow-vspytorch/. Accessed 20 Oct 2020 33. Smith, C.: Instrument cluster simulator for SocketCAN. https://github.com/ zombieCraig/ICSim. Accessed 18 Oct 2020 34. Smith, C.: The Car Hacker’s Handbook: A Guide for the Penetration Tester. No Starch Press, San Francisco (2016) 35. Soviany, P., Ionescu, R.T.: Optimizing the trade-off between single-stage and twostage deep object detectors using image difficulty prediction. In: 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp. 209–214 (2018) 36. Szeliski, R.: Computer Vision: Algorithms and Applications. Springer, London (2010). https://doi.org/10.1007/978-1-84882-935-0 37. Tayeb, S., Latifi, S., Kim, Y.: A survey on IoT communication and computation frameworks: an industrial perspective. In: 2017 IEEE 7th annual Computing and Communication Workshop and Conference (CCWC), pp. 1–6. IEEE (2017) 38. Tayeb, S., et al.: Securing the positioning signals of autonomous vehicles. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 4522–4528. IEEE (2017)
612
D. Jacuinde-Alvarez et al.
39. Tayeb, S., Pirouz, M., Latifi, S.: A Raspberry-Pi prototype of smart transportation. In: 2017 25th International Conference on Systems Engineering (ICSEng), pp. 176– 182. IEEE (2017) 40. Ward, E.: Heat stroke in dogs (2020). https://vcahospitals.com/know-your-pet/ heat-stroke-in-dogs. Accessed 20 Oct 2020 41. Yang, Z., Xu, W., Wang, Z., He, X., Yang, F., Yin, Z.: Combining YOLOV3-tiny model with dropblock for tiny-face detection. In: 2019 IEEE 19th International Conference on Communication Technology (ICCT), pp. 1673–1677. IEEE (2019) 42. Yassein, M.B., Shatnawi, M.Q., Aljwarneh, S., Al-Hatmi, R.: Internet of things: survey and open issues of MQTT protocol. In: 2017 International Conference on Engineering & MIS (ICEMIS), pp. 1–6. IEEE (2017) 43. Yokotani, T., Sasaki, Y.: Comparison with HTTP and MQTT on required network resources for IoT. In: 2016 International Conference on Control, Electronics, Renewable Energy and Communications (ICCEREC), pp. 1–6 (2016)
Intelligent Control of a Semi-autonomous Assistive Vehicle David Sanders1(B) , Giles Tewkesbury1 , Malik Haddad1 , Ya Huang1 , and Boriana Vatchova2 1 University of Portsmouth, Portsmouth PO1 3DJ, UK
[email protected] 2 Institute of Information and Communication Technologies, Bulgarian Academy of Sciences,
Sofia, Bulgaria
Abstract. A control system for a powered wheelchair is described. The wheelchair is equipped with sensors to help a disabled user to steer their wheelchair. An innovative intelligent control schemes is presented. A model reference controller for veer regulation that can deal with variable operating conditions is presented. It is based on compensating the non-linear terms using an automatic adaptive scheme. The method specifically focuses on the design of a reliable veer controller capable of mitigating for uncertainties such as slopes, bumps, hills, differences in wheels and tires and changes to surfaces (for example one side more uneven than the other). The controller has been designed with a quasi-linear closedloop behavior so that outer control loops can be added later such as path-following. A single powered wheelchair assistive agent was created to allow for future cooperation between wheelchair systems by sharing information. The work foresees the potential employment of semi-autonomous assistive agents within cooperative wheelchair systems. Keywords: Autonomous · Assistive · Control · Vehicle · Disabled · Wheelchair
1 Introduction There has been an increasing trend in developing semi-autonomous assistive systems capable of easing the burden of control and driving [1–5]. Some intelligent and Smart powered wheelchairs have evolved from research laboratory prototypes to commercial devices, providing a number of functionalities. A small number of semi-autonomous powered wheelchairs with more or less enhanced capabilities have been developed in recent years [6–11]. These have generally come from academic and research institutions [8, 12, 13]. Some powered wheelchairs have been using developing technology and control systems and intelligent powered wheelchair systems have become more reliable and affordable [14–19]. However, a number of specific applications remain dependent on human wheelchair operators, since current intelligent control technology still lacks adequate design, reliability, robustness or cost effectiveness. This leads to a need for innovation © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 613–621, 2022. https://doi.org/10.1007/978-3-030-82193-7_40
614
D. Sanders et al.
starting from the design phase, as well as redefining the concepts and methodologies for effective assistance, guidance, control and interaction with the user [20]. With this aim in mind, this work has the goal of presenting an innovative powered wheelchair controller. Modularity was considered in the design as flexibility was important. Powered wheelchairs have to provide the capability of being adapted for different users and their needs, including reshaping of structures to modify payload capacity and maneuvering characteristics, as well as to add/substitute sensors and devices “on the fly” for different users. An innovative intelligent control schemes is presented in order to face the variable operating conditions, ranging from lack of maneuverability to lack of spatial awareness or inability to position precisely. Effective and redundant velocity configurations were implemented to guarantee motion capabilities in different environments. Such configurations also required an advanced control scheme to execute the commands from the disabled user and to assist them, at the same time compensating for external disturbances and possible faults. Control methods apply machine-learning approaches, which are recently arising interest in the community and providing encouraging results. A single powered wheelchair assistive agent has been created to allow cooperation between wheelchair systems by sharing information. The work foresees the employment of semi-autonomous assistive agents as cooperative wheelchair systems.
2 The Wheelchair A Bobcat II Wheelchair was used as shown in Fig. 1. It was fitted with simple and tough Ultrasonic sensors [2, 4, 6]. The wheelchair had different operating modes: (a) Controllers driven directly by Joystick data, (b) Sensors turned on and a computer adapts the course of the powered wheelchair using approaches that were recently published in the literature, and (c) Sensors turned on and the computer modified the course of the powered wheelchair using the expert system described in this paper. A set of rules were employed: (a) The user of the wheelchair stayed in overall control, (b) The expert system only altered a course when necessary, and (c) The controller needed to produce smooth and controlled turns. Signals from the ultrasonic sensors could contain a lot of noise so that there were misreadings. These were filtered out to improve reliability. A volume in front of the powered wheelchair was divided into grids. These grids were named “adjacent”, “intermediate” and “furthest” (Fig. 2). If objects were detected then they were classified as “adjacent”, “intermediate” or “furthest”. Sensors were mounted on the wheelchair so that their ultrasonic beams overlapped. Figure 2 shows the grid created by a single sensor. The flexible design allowed the hosting of various types of control systems and sensors. An intelligent microcontroller controlled the wheel motors and could take over the control of the entire wheelchair if required. The hardware and software architecture was based on Commercial off-the shelf components.
Intelligent Control of a Semi-autonomous Assistive Vehicle
Fig. 1. The Bobcat II Wheelchair used in the research.
. Sample beam pattern.
2 Furthest.
1 m. Intermediate.
Adjacent.
1m.
0m.
1m.
Fig. 2. Ultrasonic transducer envelope showing the grid to classify ranges to objects.
615
616
D. Sanders et al.
3 Control This section presents the model reference controller for veer regulation based on the nonlinear terms compensation by an automatic adaptive scheme. The development process is inspired by the methodology described in Ioannou and Sun (1995) [21] and more recently Bibuli et al. [20]. This work extended that approach to non-linear veer dynamics. The method specifically focused on designing a reliable veer controller capable of mitigating model uncertainties, and providing a quasi-linear closed-loop behavior in such a way to subsequently design outer control loops such as path-following or other guidance schemes. 3.1 Modelling The problem of developing an adaptive controller arises from the dynamics’ model of the wheelchairs, which is corrupted by various uncertainties (size and shape of the human driver, different mechanical assists, mobility devices, life support, assistive, sensor and control systems used by each wheelchair driver). The current model was developed at the University of Portsmouth [2, 4]. A combined modeling/identification procedure led to the following dynamics form: m u = ku u |u| + cvr v|r|_ + bu fu
(1)
m v = kv v |v| + cur u |r|_ + bv fv
(2)
Ir r = kr r |r| + cuv u v_ + br τ
(3)
Where t kx is the the friction or damping term, cxx is the Coriolis acceleration term, bx is the coefficient for input and f and τ are the input force and along the veer axis. The variables represent: u the surge speed, v the sway speed, i the veer-rate, fu the input surge force, fv the input sway force, m the mass of the powered wheelchair and Ir the moment of inertia along the veer axis. Because of the different size and weight of the various drivers and their equipment, there was a significant uncertainty about the dynamics. For that reason, a number of consolidated control methods were not suitable. 3.2 Controller Design Considering Eq. (1), describing the veer motion behaviour, the founding idea was to design a suitable law for input torque τ in such a way that the closed-loop system behaved as a linear system. A reference model could then be suitably defined to provide a virtual behaviour for the closed loop system. To achieve the model tracking, and thus provide reliable closed-loop veer-rate tracking, the generated torque law consisted of two main components: one to compensate the non-linear dynamics and the other to track the desired reference veer-rate input. They had a time-varying behaviour.
Intelligent Control of a Semi-autonomous Assistive Vehicle
617
The objective of the control system was to define proper time-varying components for stability and tracking of the veer-rate controller. The goal was to design a suitable input control law to obtain a desired bounded and stable linear closed-loop behavior as: rm ˙ = −amr + bm r*
(4)
Where r* is the desired veer-rate reference, rm is the desired veer-rate response, am > 0 is the stable linear coefficient and bm is the input coefficient. The system described by Eq. (2) is the ‘reference model’. The veer-rate signal is then able to track the rm state of the reference model by defining the τ control law as follows: τ = −γ(t)r + λ(t)r∗
(5)
With γ(t) and λ(t) being the online adapted dynamics’ compensating coefficients. The form of the adaptive coefficients were designed following the procedure of Ioannou and Sun [21], obtaining the following adaptation formulas: γ˙ = −ηγ e r sgn(br)
(6)
λ = −ηλ e r ∗ sgn(br)
(7)
Where ηγ and ηλ are gain factors to tune the adaptation rate, sgn(.) is the sign function and e is defined as the tracking error variable e = r − rm. Figure 3 shows the model reference state and actual powered wheelchair’s veer-rate with respect to a desired value. Although the jerky behavior of the r(t) signal (caused by environmental disturbance, absence of filtering etc.), it can be seen that it tracks the reference rm(t).
Fig. 3. Step response of the adaptive veer-rate controller during experiments.
618
D. Sanders et al.
Once the veer-rate controller had been tuned, an external loop for heading regulation could be implemented; in this case, a simple Proportional-Derivative (PD) scheme was implemented. The Integral term in the heading control scheme was omitted to eliminate long oscillatory phases that would have extended the tuning phase. The behaviour of the dual-loop heading control is shown in Fig. 4.
Fig. 4. Heading control response experiment - desired and actual heading.
A piece-wise constant orientation is commanded to the powered wheelchair; a small drift from the desired value can be observed and it is caused by both the environmental disturbance and the lack of an integrative term in the heading control. 3.3 Path-Following A simple higher-level path-following module was written based on the path-following guidance system described in Bibuli et al. [22] where a Lyapunov-based technique was employed to guarantee convergence and robustness of the system. The basic principle was to simulate the setting of the joystick to drive forward and then to use a veer-rate signal to compensate against to keep the powered wheelchair on the desired straight line path. Since the guidance system was decoupled from the low-level controller, the integration of the guidance module was relatively straightforward. A set of results where the powered wheelchair was required to autonomously track a straight line is shown in Fig. 5. Performance was satisfactory, especially considering the operating area included sloping ground that affected the direction of the powered wheelchair.
Intelligent Control of a Semi-autonomous Assistive Vehicle
619
Fig. 5. Path-following response experiment along a simulated straight line.
4 Conclusions and Future Work A control system for a powered wheelchair was presented. The wheelchair was equipped with ultrasonic sensors to help a wheelchair user to steer their wheelchair. An innovative intelligent control schemes was presented that could deal with variable operating conditions. The design and experimental evaluation of a model reference controller for veer regulation was presented that was based on compensating the non-linear terms using an automatic adaptive scheme. The method specifically focused on the design of a reliable veer controller capable of mitigating for uncertainties such as slopes, bumps and hills, The controller was designed with a quasi-linear closed-loop behavior so that outer control loops could be added later such as heading control or path-following. A single powered wheelchair assistive agent was created to allow for future cooperation between wheelchair systems by sharing information. The work foresees the potential employment of semi-autonomous assistive agents as cooperative wheelchair systems. Testing will now move from simulation to real world trials and future work will investigate model-based prediction for navigation [23], Route Optimization [24], control [25] and voter based control [26], collision avoidance [27] and the perception of semiautonomous intelligent vehicles such as Smart Powered Wheelchairs [28]. Acknowledgment. Research was supported by the EPSRC.
References 1. Sanders, D.A., Langner, M., Tewkesbury, G.E.: Improving wheelchair-driving using a sensor system to control wheelchair-veer and variable-switches as an alternative to digital-switches or joysticks. Ind. Robot Int. J. 32(2), 157–167 (2010)
620
D. Sanders et al.
2. Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016) 3. Sanders, D.A., Bausch, N.: Improving steering of a powered wheelchair using an expert system to interpret hand tremor. In: Liu, H., Kubota, N., Zhu, X., Dillmann, R., Zhou, D. (eds.) ICIRA 2015. LNCS (LNAI), vol. 9245, pp. 460–471. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-22876-1_39 4. Sanders, D.A.: Non-model-based control of a wheeled vehicle pulling two trailers to provide early powered mobility and driving experiences. IEEE Trans. Neural Syst. Rehabil. Eng. 26(1), 96–104 (2018) 5. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 822–838. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6_57 6. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE, August 2020 7. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 822–838. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01054-6_57 8. Haddad, M., et al.: Intelligent control of the steering for a powered wheelchair using a microcomputer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 594–603. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55190-2_44 9. Haddad, M., et al.: Use of the analytical hierarchy process to determine the steering direction for a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 617–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-551902_46 10. Haddad, M.J., Sanders, D.A.: Selecting a best compromise direction for a powered wheelchair using PROMETHEE. IEEE Trans. Neural Syst. Rehabil. Eng. 27(2), 228–235 (2019) 11. Haddad, M., Sanders, D., Ikwan, F., Thabet, M., Langner, M., Gegov, A.: Intelligent HMI and control for steering a powered wheelchair using a Raspberry Pi microcomputer. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), Bulgaria, pp. 223–228. IEEE (2020) 12. Sanders, D., Haddad, M., Tewkesbury, G., Bausch, N., Rogers, I., Huang, Y.: Analysis of reaction times and time-delays introduced into an intelligent HCI for a smart wheelchair. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), Bulgaria, pp. 217–222. IEEE (2020) 13. Sanders, D., et al.: Introducing time-delays to analyze driver reaction times when using a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 559–570. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_41 14. Haddad, M., et al.: Intelligent system to analyze data about powered wheelchair drivers. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 584–593. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_43 15. Haddad, M., Sanders, D., Langner, M., Omoarebun, P., Thabet, M., Gegov, A.: Initial results from using an intelligent system to analyse powered wheelchair users’ data. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), Bulgaria, pp. 241–245. IEEE (2020).
Intelligent Control of a Semi-autonomous Assistive Vehicle
621
16. Haddad, M., et al.: Steering a powered wheelchair using a camera module and image processing algorithms. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2021. Advances in Intelligent Systems and Computing (2021). Accepted and in Press 17. Haddad, M., et al.: Steering a powered wheelchair using a camera module and image processing algorithms. In: 2021 32nd IEEE Intelligent Vehicles Symposium, Japan. IEEE (2021). Accepted and in Press 18. Haddad, M., et al.: Novel approach to steer a powered wheelchair using image processing algorithm and Raspberry Pi. In: 2021 32nd IEEE Intelligent Vehicles Symposium, Japan. IEEE (2021). Accepted and in Press 19. Haddad, M., et al.: A new system to drive a powered wheelchair using an image processing algorithm. In: 24th IEEE International Conference on Intelligent Transportation – ITSC 2021, USA. IEEE (2021). Accepted and in Press 20. Bibuli, M., et al.: Evolution of autonomous surface vehicles. In: 19th International Conference on Computer and IT Applications in the Maritime Industries, Pontignano, Hamburg, 17–19 August 2020, pp. 26–37 (2020). ISBN: 978-3-89220-717-7 21. Ioannou, P.A.: Robust Adaptive Control. Prentice-Hall (1995) 22. Bibuli, M., Bruzzone, G., Caccia, M., Lapierre, L.: Path following algorithms and experiments for an unmanned surface vehicle. J. Field Robot. 26(8), 669–688 (2009) 23. Sanders, D., et al.: Model-based prediction for navigation assistance using a combination of sensors. In: 24th IEEE International Conference on Intelligent Transportation – ITSC 2021, USA. IEEE (2021). Accepted and in Press 24. Sanders, D., et al.: Route optimization using forecasting, wheelchair modelling and path planning. In: 24th IEEE International Conference on Intelligent Transportation – ITSC 2021, USA. IEEE (2021). Accepted and in Press 25. Sanders, D., et al.: Control of a semi-autonomous powered wheelchair. In: 2021 32nd IEEE Intelligent Vehicles Symposium, Japan. IEEE (2021). Accepted and in Press 26. Sanders, D., et al.: Voter based control for situation awareness and obstacle avoidance. In: 24th IEEE International Conference on Intelligent Transportation – ITSC 2021, USA. IEEE (2021). Accepted and in Press 27. Sanders, D., et al.: An assistance system for collision avoidance. In: 2021 32nd IEEE Intelligent Vehicles Symposium, Japan. IEEE (2021). Accepted and in Press 28. Sanders, D., et al.: The perception of semi-autonomous intelligent vehicles such as Smart Powered Wheelchairs. In: 2021 32nd IEEE Intelligent Vehicles Symposium, Japan. IEEE (2021). Accepted and in Press
One Shot Learning Approach to Identify Drivers Malik Haddad1(B) , David Sanders1 , Martin Langner1,2 , and Giles Tewkesbury1 1 University of Portsmouth, Portsmouth PO1 3DJ, UK
[email protected] 2 Chailey Heritage Foundation, North Chailey BN8 4EF, UK
Abstract. This paper presents a new approach to identify users of shared powered mobility platfroms using One Shot learning algorithm. An electronic circuit is created using a Raspberry Pi and a camera module. Python programming language is used to create a program to control the function of the camera module and conduct One Shot learning to identify users. A user interface containing Start and Stop buttons is created to operate the camera module. If Start is pressed, the Python program will trigger the camera to take a snapshot, then the program will identify the face in that snapshot and perform facial recognition to compare that face with a database of user facial images. If a match was found a message box popped-up showing the user name. Once a user is identified, their settings and preferences can be downloaded to the shared powered mobility platform. If the face in the snapshot did not match any of the user facial images in the data base, a message popped-up showing that user was not found in the database. Practical testing showed the system behaved satisfactorily and successfully detected users. One Shot learning allowed new users to be added to the database without the need to retrain the whole system. Keywords: Camera · Disabled · One Shot · Learning · Python · Wheelchair
1 Introduction The work presented in this paper describes a new approach to identify users of shared powered wheelchairs using One Shot learning algorithm. A system was created using a camera module and a Raspberry Pi. The work is part of broader research at Chailey Heritage Foundation and the University of Portsmouth funded by the Engineering and Physical Sciences Council (EPSRC) [1]. The main aims of this research are to use Artificial Intelligence (AI) techniques to increase mobility and improve the quality of life of disabled powered wheelchair users by improving their self-reliance and confidence. Approximately 15% of the world population are having some sort of disability, and some of them have been diagnosed with significant mobility problems [2]. Due to modern medical achievements, ageing and the spread of long term health problems these numbers have been increasing [2, 3]. People with disabilities often have lower quality of life than others [4].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 622–629, 2022. https://doi.org/10.1007/978-3-030-82193-7_41
One Shot Learning Approach to Identify Drivers
623
Powered mobility is becoming more important to people with disabilities [5]. Powered mobility includes assistive devices such as powered wheelchairs or scooters. During the past three decades, researchers developed systems to enhance mobility and improve the quality of life of disabled users. Sanders et al. [6] used sensors to control powered wheelchair veer and improve driving. Langner [7] used a rotating ultrasonic transducer to produce a Scanning Collision Avoidance Device (SCAD). Sanders and Bausch [9] developed an expert system to analyze users’ hand tremor and improve steering. Sanders [8] studied self-reliance factors to create a system that shared control between ultrasonic sensors and wheelchair users. Sanders et al. [10, 11] considered rule-based systems to provide steering routes for wheelchairs. Haddad et al. [10, 12, 13] used ultrasonic sensor array readings as inputs to Multi-Criteria Decision Making (MCDM) deciders and mixed the suggested output from the deciders with user desired directions to provide collision free routes for wheelchairs. Haddad and Sanders [14] applied a MCDM method, PROMETHEE II, to suggest a safe route. Haddad et al. [12, 15] utilized microcomputers to create intelligent Human Machine Interfaces (HMI) used to safely steer powered wheelchairs. Haddad and Sanders [2, 16] created a deep learning neural network to provide a safe steering direction for a powered wheelchair. Many researchers created systems to study and improve powered wheelchair driving [17–20]. Haddad et al. [21–23] used cameras and microcomputers to translate drivers hand movements to digital commands used to operate powered wheelchairs. Interviews conducted by the authors with operational therapists, helpers and carers at Chailey Heritage Foundation/School revealed that many students used the same powered mobility platform to practice driving. Each driver had their own settings and preferences. Changing user settings required time and effort. Helpers/carers often struggled with changing wheelchair settings when changing users. Different users often required different interfaces, sensors and input devices. This paper presents a new approach to identify users of shared powered mobility platforms. Once a user has been identified, their input device is triggered and the user settings are installed. The new approach is described in the next Section. Section 3 presents some results and discussion. Conclusions and future work are presented within Sect. 4.
2 The New Approach A new approach to identify users of shared powered mobility platforms is presented. The new approach is based on One Shot learning algorithm. One-shot learning is an algorithm often used in computer vision for object recognition. It aims to learn information about object features from one, or a small number of training images. By learning object features, a One Shot learning algorithm can calculate the probability of the presence of an object in an image. If the probability is higher than a predefined threshold, the algorithm reports the presence of that object in the image. A circuit connected a Raspberry Pi camera to a Raspberry Pi. The camera was directed towards the powered wheelchair users.
624
M. Haddad et al.
A Python program was created and installed on to the Raspberry Pi. The Python program controlled the function of the camera, triggered the camera to take snap shots, conducted facial identification and compared the snapshot to a database of user facial images. The new approach used the Python image processing algorithm “face_recognition” to compare the identified face in the snapshot to user facial images. The Block-diagram of the new approach is shown in Fig. 1. The Python program is shown in Fig. 2.
Fig. 1. Block-diagram of the new approach.
One Shot learning used a similarity function which input two images and output the degree of difference between the input images. If the two images were for the same individual, the function generated a small number. If the two images were for different individuals, the function would output a large number. A threshold could be used to set the degree of similarity. To use the similarity function in image recognition, a new picture was compared to the images in the database. If the new image was for a person in the database, the function generated a small number when compared to that person and large numbers when compared to other images in the database. If the image was not in the database, the function generated larger numbers for all images in the database which implied that the image was not for any person in the database. That solved the problem of adding new users to the database without the need to retrain the whole system [24]. A simple User Interface (UI) was created containing two buttons: Start and Stop as shown in Fig. 3.
One Shot Learning Approach to Identify Drivers
Fig. 2. Python program used in the new approach.
Fig. 3. Simple user interface used to control the python program.
625
626
M. Haddad et al.
The simple UI provided straightforward operation for powered mobility platforms and provided a suitable fit between desired commands and driver capabilities [25].The UI provided a friendly and uncluttered design. When the Start button was clicked, the Python program triggered the Raspberry Pi camera to take a snapshot, then the program would identify the face in that snapshot and conduct facial recognition to compare that face with the database of potential users’ facial images. If a match was found a message box popped-up showing the user name as shown in Fig. 4.
Fig. 4. UI showing a message box indicating the name of the identified user.
Once a user was identified, their settings and preferences could be downloaded to the shared powered mobility platform. If the face in the snapshot did not match any of the facial images in the data base, a message popped-up showing that user was not found in database as shown in Fig. 5. To exit the program, the Stop button should be clicked. When the Stop button was clicked, the UI was destroyed.
Fig. 5. UI showing a message box indicating user not found in database.
3 Discussion and Results The new approach described in this paper used a One Shot learning algorithm to provide a more efficient outcome for facial recognition than a typical deep learning algorithm using a convolutional neural network, especially when a small learning set (small number
One Shot Learning Approach to Identify Drivers
627
of images for each individual) in a database was considered. An additional advantage was that One Shot learning did not require retraining of the whole system when adding a new user to the database. The new approach was tested and successfully identified users in a database and did not match incorrect users in the database. Further testing will be conducted to identify a suitable location for the camera and the effect of ambient lighting on the accuracy of the new approach. New users could be digitally added to the system by uploading their facial image to the database. When a users was identified, their input device was triggered and the user settings were installed. Clinical trials will be conducted to assess the effectiveness of the new approach. Reducing cost was one of the reasons behind this research. The new approach provided reliable results, reduced time taken in setting up, improved user autonomy and reduced the need and cost for carers.
4 Conclusions and Future Work This paper presented a new approach to identify users of shared powered mobility platforms using a One Shot learning algorithm. An electronic circuit including a camera module and a raspberry Pi was created. Python programming language was used to create a program to control the function of the camera and conduct facial recognition. The new approach used a simple friendly User Interface and reduced the amount of effort required by the helpers/carers to adjust the shared powered mobility platform settings. The authors are currently investigating other mathematically inexpensive AI techniques applied to powered mobility problems as part of a broader research to use artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system [12]. Future work will consider creating new programs to install the identified user settings to the shared powered mobility platform. Also future work will investigate using AI algorithms and Artificial Neural Networks to improve driving capabilities for powered wheelchair users. Future work will also consider using a camera module and image processing algorithms to capture user movement used to control a powered wheelchair. Acknowledgment. This research was supported by an EPSRC EP/S005927/1 project titled “Using artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system”. Investigators: Sanders, DA and Gegov, AE. Senior Researchers Haddad MJ and Langner MC.
References 1. Sanders, D., Gegov, A.: Using artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system. EPSRC Project, 2019–2022 (2018) 2. Haddad, M.J., Sanders, D.A.: Deep Learning architecture to assist with steering a powered wheelchair. IEEE Trans. Neural Syst. Rehabil. Eng. 28(12), 2987–2994 (2020)
628
M. Haddad et al.
3. Krops, L.A., Hols, D.H., Folkertsma, N., Dijkstra, P.U., Geertzen, J.H., Dekker, R.: Requirements on a community-based intervention for stimulating physical activity in physically disabled people: a focus group study amongst experts. Disabil. Rehabil. 40(20), 2400–2407 (2018) 4. Bos, I., Wynia, K., Almansa, J., Drost, G., Kremer, B., Kuks, J.: The prevalence and severity of disease-related disabilities and their impact on quality of life in neuromuscular diseases. Disabil. Rehabil. 41(14), 1676–1681 (2019) 5. Frank, A.O., De Souza, L.H.: Clinical features of children and adults with a muscular dystrophy using powered indoor/outdoor wheelchairs: disease features, comorbidities and complications of disability. Disabil. Rehabil. 40(9), 1007–1013 (2018) 6. Sanders, D.A., Langner, M., Tewkesbury, G.E.: Improving wheelchair-driving using a sensor system to control wheelchair-veer and variable-switches as an alternative to digital-switches or joysticks. Ind. Robot. Int. J. 32(2), 157–167 (2010) 7. Langner, M.: Effort reduction and collision avoidance for powered wheelchairs: SCAD assistive mobility system, Doctoral dissertation, University of Portsmouth (2012) 8. Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016) 9. Sanders, D.A., Bausch, N.: Improving steering of a powered wheelchair using an expert system to interpret hand tremor. In: Liu, H., Kubota, N., Zhu, X., Dillmann, R., Zhou, D. (eds.) ICIRA 2015. LNCS (LNAI), vol. 9245, pp. 460–471. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-22876-1_39 10. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2018. AISC, vol. 868, pp. 822–832. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6_57 11. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE, August 2020 12. Haddad, M., et al.: Intelligent control of the steering for a powered wheelchair using a microcomputer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. AISC, vol. 1252, pp. 594–603. Springer, Cham (2021). https://doi.org/10. 1007/978-3-030-55190-2_44 13. Haddad, M. et al.: Use of the analytical hierarchy process to determine the steering direction for a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. AISC, vol. 1252, pp. 617–630. Springer, Cham (2021). https:// doi.org/10.1007/978-3-030-55190-2_46 14. Haddad, M.J., Sanders, D.A.: Selecting a best compromise direction for a powered wheelchair using PROMETHEE. IEEE Trans. Neural Syst. Rehabil. Eng. 27(2), 228–235 (2019) 15. Haddad, M., Sanders, D., Ikwan, F., Thabet, M., Langner, M., Gegov, A.: Intelligent HMI and control for steering a powered wheelchair using a Raspberry Pi microcomputer. In 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 223–228. IEEE, Bulgaria (2020) 16. Haddad, M., Sanders, D., Tewkesbury, G., Langner, M.: A novel collision avoidance system for steering a powered wheelchair using deep learning architecture. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE, USA (2021, Submitted) 17. Haddad, M., et al.: Intelligent system to analyze data about powered wheelchair drivers. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. AISC, vol. 1252, pp. 584–593. Springer, Cham (2020). https://doi.org/10.1007/978-3-03055190-2_43
One Shot Learning Approach to Identify Drivers
629
18. Haddad, M., Sanders, D., Langner, M., Omoarebun, P., Thabet, M., Gegov, A.: Initial results from using an intelligent system to analyse powered wheelchair users’ data. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 241–245. IEEE, Bulgaria (2020) 19. Sanders, D., Haddad, M., Tewkesbury, G., Bausch, N., Rogers, I., Huang, Y.: Analysis of reaction times and time-delays introduced into an intelligent HCI for a smart wheelchair. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 217–222. IEEE, Bulgaria (2020) 20. Sanders, D., et al.: Introducing time-delays to analyze driver reaction times when using a powered wheelchair. In: Arai K., Kapoor S., Bhatia R. (eds.) Intelligent Systems and Applications. IntelliSys 2020. AISC, vol. 1252, pp. 559–570. Springer, Cham (2020). https://doi. org/10.1007/978-3-030-55190-2_41 21. Haddad, M., Sanders, D., Langner, M., Tewkesbury, G.: Novel approach to steer a powered wheelchair using image processing algorithm and raspberry Pi. In: 32nd IEEE Intelligent Vehicles Symposium, 2021. IEEE, Japan (2021, submitted) 22. Haddad, M., Sanders, D., Tewkesbury, G., Langner, M.: Using open source computer vision algorithms to drive a powered wheelchair. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE, USA (2021, submitted) 23. Haddad, M., Sanders, D., Langner, M., Tewkesbury, G.: Steering a powered wheelchair using a camera module and python imaging library. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE, USA (2021, submitted) 24. Fei-Fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 594–611 (2006) 25. Lewis, C.: Simplicity in cognitive assistive technology: a framework and agenda for research. Univ. Access Inf. Soc. 5(4), 351–361 (2007)
Facial Recognition Software for Identification of Powered Wheelchair Users Giles Tewkesbury, Samuel Lifton, Malik Haddad, David Sanders(B) , and Alex Gegov University of Portsmouth, Portsmouth PO1 3DJ, UK {Giles.tewkesbury,david.sanders}@port.ac.uk
Abstract. The research presented in this paper investigates the use of facial recognition software as a potential system to identify powered wheelchair users. Facial recognition offers advantages over other biometric systems where wheelchair users have disabilities. Facial recognition systems scan an image or video feed for a face, and compare the detected face to previously detected data. This paper reviews the software development kits and the libraries available for creating such as systems and discusses the technologies chosen to create a prototype facial recognition system. The new prototype system was trained with 262 identification pictures and confidence ratings were produced from the system for video feeds from twelve users. The results from the trials and variance in confidence ratings are discussed with respect to gender, presence of glasses and make up. The results demonstrated the system to be 95% efficient in its ability to identify users. Keywords: Face recognition · User identification · Camera · Wheelchair · SDK
1 Introduction The work presented in this paper describes the results from the creation of a facial recognition system to identify users of powered wheelchairs. The identification system was required to identify a user from a pool of 262 identification (ID) pictures. The input to the system was via a video camera. The system returned a confidence value for each match. Variations in confidence values are discussed for test case scenarios based on gender, the presence or non-presence of glasses and make up. The work described here is part of broader research carried by the authors at Chailey Heritage Foundation and the University of Portsmouth funded by the Engineering and Physical Sciences Council (EPSRC) [1]. The aims of this research are to use AI techniques to improve the quality of life and to increase mobility of disabled powered wheelchair users providing improved self-reliance and confidence. Studies have revealed that approximately 15% of the world population were suffering from some sort of disability, part of them were diagnosed with significant mobility problems [2, 3]. People with disabilities often suffer from lower quality of life than others [4]. Powered mobility often included assistive devices such as powered wheelchairs or scooters and is becoming more important to people with disabilities [5]. Researchers © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 630–639, 2022. https://doi.org/10.1007/978-3-030-82193-7_42
Facial Recognition Software for Identification
631
have developed systems to enhance mobility and improve the quality of life of disabled users through the use of sensors to control veer [6], scanning ultrasonic sensors for collision avoidance [7] and expert system to analyse users’ hand tremor and improve steering [8]. Self-reliance factors have been studied to create a system that shared control between ultrasonic sensors and wheelchair users. Sanders et al. [9–11] considered rule-based systems to provide steering routes for wheelchairs. Ultrasonic sensor arrays have been used as inputs to Multi-Criteria Decision Making deciders combined with user desired directions to provide collision free routes for wheelchairs [12–15]. Intelligent Human Machine Interfaces [16, 17] and a deep learning neural network has been created to provide a safe steering direction for a powered wheelchair [2]. Tewkesbury et al. applied high level task programming methodologies to the programming of powered wheelchairs [18]. Many researchers have created systems to study and improve powered wheelchair driving [19–21]. Haddad et al. [22–24] used cameras and microcomputers to translate drivers hand movements to digital commands used to operate powered wheelchairs. Interviews conducted by the authors with operational therapists, helpers and carers at Chailey Heritage Foundation/School showed that many students used the same powered platform to practice driving a powered wheelchair. Each powered wheelchair user had their own settings and preferences. Changing user settings often required time and effort. Helpers/carers often struggled with changing wheelchair settings when changing users. Different users often required different interfaces, sensors and input devices. Identifying users of a powered wheelchair from a video stream could therefore be used to automatically configure wheelchair settings. This paper presents an overview of the leading technologies available for facial recognition at the time of the study. The selection process of the technology for use in this research is covered in Sect. 2 and the results from user trials of the new system created are presented in Sect. 3 with an analysis of the effects of gender, the presence of glasses and make up. Discussion, conclusions and future work are presented within Sect. 4. 1.1 Facial Recognition Systems A simple facial recognition system operates in three steps: Firstly, an image is scanned for anything that resembles a human face. If a face is detected, then feature data is extracted and stored digitally. This data can then be used to verify an image against a database of images. Facial recognition software relies on the same methods and theory as all other forms of object recognition, however before a face can be recognised it must first be detected. One of the most important developments in the detection of faces in digital images was the Viola-Jones object detection framework. This framework allows a system to recognise patterns in an image that might constitute a face without being computationally expensive, allowing for real time face detection [25]. The patterns that the framework attempts to detect are the same patterns that human brains are able to recognise. For example, a few of the properties common to human faces are that the region surrounding
632
G. Tewkesbury et al.
the eyes is darker than that of the upper cheeks, and that the nose bridge region is brighter than that of the eyes [26]. The regions of light and dark formed on an image by averaging the intensity of the pixels are known as Haar features. Training a program to look for the Haar features associated with human faces in turn, detects faces [27]. By mapping the geometries of these regions, such as the distance between the eyes or the length of the nose, the identity of a face can be profiled digitally using a vector that represents the makeup of a face known as an Eigenface [28]. The similarity between two faces can then be found by use of mathematical algorithms to compare their relative similarity to an Eigenface [29]. Improved methods for the detection and recognition of faces have been developed since the emergence of the Viola-Jones method, including detection through the use of 3D images as opposed to 2D, and the use of high detail digital skin prints capable of detecting subtle differences present in the faces of identical twins [30]. However, the traditional Viola-Jones method, and other improved methods based on it remain the most commonly used method for detection in facial recognition systems, primarily because their simplicity allows them to be used with relatively low computational requirements in comparison to other methods that may require the use of complex systems such as neural networks [31]. The potential solutions for a facial recognition of powered wheelchair users range from fully fledged systems that include both the necessary hardware and software for facial recognition, to APIs and software libraries that require integration into a working program and system. 1.2 API Microsoft’s Azure Face API offered both free and paid pricing tiers depending on the amount of calls to the API that were required; a free account could call the API up to a maximum of 20 times per minute and up to 30,000 times per month, however the ability to store images on Microsoft servers instead of locally required a paid subscription. [32]. Amazon’s Rekognition API (a component of Amazon Web Services) offered real time facial recognition for both images and live video. Similar to Microsoft, Amazon offered both free and paid subscription. The free tier for Rekognition could, (per month) analyse 5000 images, store up to 1000 images, and analyse 1000 min of live video. However, the free tier of this service was limited to one year [33]. Google also produced an API; “Cloud Vision”. However, at the time of this study, they had yet to publicly release a build capable of facial recognition, claiming that: “facial recognition merits careful consideration to ensure its use is aligned with our principles and values, and avoids abuse and harmful outcomes” [34]. Cloud Vision was only capable of detecting faces and could not be used to attach identities. A relatively new entrant to the facial recognition market was Kairos, which like Amazon’s Rekognition was capable of applying facial recognition to video. A useful additional feature Kairos offered was the ability to self-host the API on local servers using their Software Development Kit (SDK). Kairos was only available through paid pricing tiers, which cost a monthly fee on top of ‘per transaction’ charges. It was not possible to test the system without a paid license [35].
Facial Recognition Software for Identification
633
China’s facial recognition ‘Face++’ API, was a publicly available version of the software that was used in China’s 170 million camera strong ‘SkyNet’ mass surveillance system [36]. The free pricing tier of Face++ allowed for a maximum of 3 API calls per second with no limits on usage outside of this. Face++ charged extra for additional API request bandwidth instead of charging more for total requests [37]. 1.3 Software Libraries An alternative to calling APIs was to use a local software library. These collections of pre-written code allow a system for facial recognition to be run on a local machine, without internet connectivity. This however comes with the requirement for more powerful hardware. A low cost embedded system may not have sufficient power for more demanding applications. Whilst technically identical to SDKs, software libraries are free and open source, thus offering more flexibility in their implementation. In comparison to an externally hosted API they are more intricate to program into a system and require applications utilising them to be sufficiently streamlined so as to not hinder performance. The most popular software libraries in the field of facial recognition were OpenCV and Accord.NET. Both were capable of real time facial recognition on live video and their respective implementations were well documented online, with various books and tutorials available for both [38, 39].
2 Facial Recognition System A facial recognition system capable of efficient identification of powered wheelchair users required development from the ground up using commercially available APIs or libraries. These handle the task of detecting and recognising faces, but what a system does with the returned data is entirely dependent on the application they are built in to. The development of the facial recognition system began with a comparison of the various available solutions to determine the one most suitable for use. The early stages of development that followed concerned the implementation of the chosen solution in a simple system, for example: in an application that was only capable of applying facial detection on an image. Continued development aimed towards a more advanced system capable of facial recognition on a video feed. An important factor for determining the most feasible API was ease of implementation. Azure Face was by far the most flexible, with several in depth examples about how to effectively utilise it in a system available on Microsoft’s website or through third party programming tutorial/hobbyist sites. Rekognition’s documentation was deemed to be less accessible, being marketed more towards seasoned developers of Amazon Web Service, but as with Azure Face it had the benefit of being easily importable into a Visual Studio project through the Nuget package manager. Kairos and Face++ did feature notable documentation, but they were not as comprehensive as that of Azure or Rekognition. With these factors in mind, the decision made was to use Azure Face API for the developmental system. Rekognition’s greater TPS bandwidth made it the superior candidate for use in a real-world system, however it was not selected for use due to the likelihood of its complexity of implementation hindering development.
634
G. Tewkesbury et al.
Alongside the development of a prototype system utilising an API, an experimental system utilising the OpenCV library was also developed. These two systems eventually merged to form a final system that utilised both Azure Face API and OpenCV.
3 Results The system needed to be able to recognise wheelchair users from a webcam video feed after being trained with the facial data extracted from an identification (ID) photo or the images on their electronic record card. The images were a mixture of quality and resolution. The aging of wheelchair users was also considered and the confidence value returned by the system was a useful metric for this. The system was trained with 12 known user faces and with an additional 250 unknown faces, all from the same source of ID photos. Of the 250 untested faces trained in the system, the ID photos of 3 faces were unable to be trained by the system due to its inability to detect a face (i.e. locate a face in the picture); a failure rate of 1.2%. These images were discarded and replaced with images that the system could use. Testing consisted of having users present themselves to the system via a webcam feed. The system then attempted to recognise the user in the same way it did the images used in preliminary testing. Where possible, elements of the testing were kept constant. All users were tested using the same webcam under as consistent as possible lighting conditions. The test was repeated on each user 3 times. In one test users were asked to present a neutral expression, in another test users were asked to smile, and in the final test users were asked to attempt to mimic the expression of the training photo. If a user wore glasses, they were tested for an additional three instances without wearing glasses so that the difference in confidence values returned could be compared. Three confidence values were obtained for each user tested, with an additional three also obtained for users who wore glasses. These values were compiled into two separate average values, as shown in Table 1. The size of collection of trained faces had no effect on returned confidence values. The system was able to correctly identify all users at least once and no users were misidentified (i.e. as a different user). A single user was not recognised by the system whilst wearing glasses, but was correctly identified every time when not wearing glasses (The user’s ID photo featured them without glasses). Other users also returned a lower value whilst wearing glasses compared to not wearing glasses. Excluding the unidentifiable student, the differences in confidence values returned for glasses-wearing students are shown in Table 2. The average difference between these results was 8%, enough to bring a user below the 50% recognition threshold if their confidence value was already low without the addition or removal of glasses. Varying facial expressions had a similar but lesser effect on returned confidence values. If a student’s expression in their ID photo was neutral or smiling, a higher confidence value was returned when asked to present with the same expression. User’s attempts at mimicking the exact expression of their ID photo resulted in higher confidence
Facial Recognition Software for Identification
635
Table 1. Results of users testing in descending order of average confidence value. Average Gender Confidence Values
Confidence Values Neutral Smile
Mimic
Glasses Confidence Values Neutral
Smile
Mimic
75.6%
M
77.7%
73.8%
79.1%
N/A
N/A
N/A
72.8%
M
70.0%
73.9%
74.6%
N/A
N/A
N/A
70.5%
M
71.1%
69.9%
73.7%
N/A
N/A
N/A
65.3%
M
62.1%
65.3%
68.5%
N/A
N/A
N/A
53.3%
62.8%
M
54.3%
50.8%
54.8%
63.5%
59.1%
65.9%
62.5%
53.1%
F
59.1%
63.4%
65.1%
52.7%
53.5%
55.5%
62.0%
54.0%
F
58.3%
62.4%
65.3%
53.4%
54.1%
54.5%
61.7%
52.6%
M
59.9%
61.0%
64.3%
51.4%
53.8%
54.2%
60.7%
0%
M
61.4%
58.2%
62.5%
0%
0%
0%
51.2%
56.1%
F
51.3%
50.8%
51.6%
56.3%
55.8%
62.1%
F
52.8%
53.4%
57.7%
N/A
N/A
N/A
54.6%
Table 2. A comparison of the difference in confidence values for glasses-wearing users. No Glasses
Glasses
Difference
53.3%
62.8%
9.5%
62.5%
53.1%
9.4%
62.0%
54.0%
8%
61.7%
52.6%
9.1%
51.2%
56.1%
4.9%
values. The differences in confidence values as a result of facial expression can be seen below in Table 3. The average difference between the higher and lower confidence values of the neutral and smile expressions was 2%, the difference between the higher of these two values and the value of the mimic expression was an additional 2%. A factor that had a more significant effect on the confidence value was the gender of the participants. The difference in confidence values can be noted from Table 1, with no female user scoring above 63% average confidence value, whereas 4 male users scored well over that. The difference between the average male value and the average female value was 7%. The research question posed at the start of this paper was to evaluate if a system could identify a user from a pool of 262 identification pictures, using a video camera.
636
G. Tewkesbury et al. Table 3. A comparison of the differences in confidence values for different expressions. Confidence values
Differences in values
Neutral
Smile
Mimic
Neutral/Smile
Mimic
77.70%
73.80%
79.10%
3.90%
1.40%
70.00%
73.90%
74.60%
3.90%
0.70%
71.10%
69.90%
73.70%
1.20%
2.60%
62.10%
65.30%
68.50%
3.20%
3.20%
54.30%
50.80%
54.80%
3.50%
0.50%
59.10%
63.40%
65.10%
4.30%
1.70%
58.30%
62.40%
65.30%
4.10%
2.90%
59.90%
61.00%
64.30%
1.10%
3.30%
61.40%
58.20%
62.50%
3.20%
1.10%
51.30%
50.80%
51.60%
0.50%
0.30%
52.80%
53.40%
57.70%
0.60%
4.30%
63.50%
59.20%
65.90%
4.30%
2.40%
52.70%
53.50%
55.50%
0.80%
2.00%
53.40%
54.10%
55.50%
0.70%
0.40%
51.40%
53.80%
54.20%
2.40%
0.40%
56.30%
55.80%
62.10%
0.50%
5.80%
The experimental analysis looked at the relationship between the source data and the recognition rates to establish a confidence factor in the ability to identify wheelchair users with the applied test data.
4 Discussion and Conclusions The results demonstrated the system to be 95% efficient in ability to identify wheelchair users with the applied test data. The research presented here summarise the data and facts obtained from two of the six possible solutions identified in the literature survey. Future work will evaluate the relative efficiency of the other solutions and evaluate if they are superior to the results obtained during this research. The scientific value added by this paper includes the application of existing face recognition techniques to the new area, by using ID photos to identify wheelchair users from video feeds and to apply the results to benefit wheelchair users, carers and practitioners. Overall, the project has successfully determined that a facial recognition system reliant on ID photos or the images on electronic record cards could be used to identify powered wheelchair users, to configure the devices for the needs of that particular user. It was possible for the Azure Face API to return additional data such as emotional values or estimated physical characteristics but that data was not used during this study, however, this type of data may prove to be useful in the future for other studies.
Facial Recognition Software for Identification
637
The identity of the face detected, along with a confidence value were returned for each test. The results suggested that the API was biased towards recognising male faces, with only one female scoring above 70% confidence value. This discrepancy might have been due to the use of ID photos as the source data for the test: the images of female individuals used featured them wearing notably different levels of makeup, thus negatively impacting on the system’s ability to recognise them when comparing the two images. This is reinforced by 3 of the 5 lowest confidence values that were returned during the test having come from females who wore in one image heavy makeup, and in the other image no makeup. Notable differences in lighting conditions could also have been a significant factor. The ID photos used in the study were selected and supplied by the users, however in a real clinical situation these photos are likely to have been provided at the time of first use of the wheelchair, or from clinical records, where there is likely to be more consistency in the level of makeup. The benefits that this system offers to wheelchair users, support practitioners, carers and helpers is a reduced time for setting up wheelchairs for individual wheelchair users. Helpers/carers often struggled with changing wheelchair settings when changing users especially when different users required different interfaces, sensors and input devices. Identifying users of a powered wheelchair from a video stream and automatically configuring wheelchair settings would therefore extend times for actual driving and result in less frustration. Future work will include comparing the results presented here with other toolkits and approaches and identifying the limitations of existing systems for facial recognition in this application.
References 1. Sanders, D., Gegov, A.: Using artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system. EPSRC Project 2019–2022 (2018) 2. Haddad, M.J., Sanders, D.A.: Deep Learning architecture to assist with steering a powered wheelchair. IEEE Trans. Neural Syst. Rehabil. Eng. 28, 2987 (2020) 3. Krops, L.A., Hols, D.H., Folkertsma, N., Dijkstra, P.U., Geertzen, J.H., Dekker, R.: Requirements on a community-based intervention for stimulating physical activity in physically disabled people: a focus group study amongst experts. Disabil. Rehabil. 40(20), 2400–2407 (2018) 4. Bos, I., Wynia, K., Almansa, J., Drost, G., Kremer, B., Kuks, J.: The prevalence and severity of disease-related disabilities and their impact on quality of life in neuromuscular diseases. Disabil. Rehabil. 41(14), 1676–1681 (2019) 5. Frank, A.O., De Souza, L.H.: Clinical features of children & adults with a muscular dystrophy using powered indoor/outdoor wheelchairs: disease features, comorbidities and complications of disability. Disabil. Rehabil. 40(9), 1007–1013 (2018) 6. Sanders, D.A., Langner, M., Tewkesbury, G.E.: Improving wheelchair-driving using a sensor system to control wheelchair-veer and variable-switches as an alternative to digital-switches or joysticks. Ind. Rob. Int. J. 32(2), 157–167 (2010) 7. Langner, M.: Effort Reduction and Collision Avoidance for Powered Wheelchairs: SCAD Assistive Mobility System (Doctoral dissertation, University of Portsmouth) (2012)
638
G. Tewkesbury et al.
8. Sanders, D.A., Bausch, N.: Improving steering of a powered wheelchair using an expert system to interpret hand tremor. In: Liu, H., Kubota, N., Zhu, X., Dillmann, R., Zhou, D. (eds.) ICIRA 2015. LNCS (LNAI), vol. 9245, pp. 460–471. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-22876-1_39 9. Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016) 10. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 822–838. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6_57 11. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE, August 2020 12. Haddad, M., Sanders, D., Gegov, A., Hassan, M., Huang, Y., Al-Mosawi, M.: Combining multiple criteria decision making with vector manipulation to decide on the direction for a powered wheelchair. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1037, pp. 680–693. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29516-5_51 13. Haddad, M., Sanders, D., Langner, M., Ikwan, F., Tewkesbury, G., Gegov, A.: Steering direction for a powered-wheelchair using the analytical hierarchy process. In: Proceedings of the 2020 IEEE 10th International Conference on Intelligent Systems (IS), Varna, Bulgaria, pp. 229–234 (2020) 14. Haddad, M., et al.: Use of the analytical hierarchy process to determine the steering direction for a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 617–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-551902_46 15. Haddad, M.J., Sanders, D.A.: Selecting a best compromise direction for a powered wheelchair using PROMETHEE. IEEE Trans. Neur. Syst. Rehabil. 27(2), 228–235 (2019) 16. Haddad, M., Sanders, D., Ikwan, F., Thabet, M., Langner, M. and Gegov, A., 2020, August. Intelligent HMI and control for steering a powered wheelchair using a Raspberry Pi microcomputer. In 2020 IEEE 10th International Conference on Intelligent Systems (IS) (pp. 223–228). IEEE. 17. Haddad, M., et al.: Intelligent control of the steering for a powered wheelchair using a microcomputer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 594–603. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_44 18. Tewkesbury, G., Sanders, D., Haddad, M., Bausch, N., Gegov, A., Okonor, O.: Task programming methodology for powered wheelchairs. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1037, pp. 711–720. Springer, Cham (2020). https://doi.org/10.1007/978-3030-29516-5_53 19. Haddad, M., et al.: Intelligent system to analyze data about powered wheelchair drivers. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 584–593. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_43 20. Haddad, M., Sanders, D., Langner, M., Omoarebun, P., Thabet, M., Gegov, A.: Initial results from using an intelligent system to analyse powered wheelchair users’ data. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 241–245. IEEE, August 2020 21. Sanders, D., Haddad, M., Tewkesbury, G., Bausch, N., Rogers, I., Huang, Y.: Analysis of reaction times and time-delays introduced into an intelligent HCI for a smart wheelchair. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 217–222. IEEE, August 2020
Facial Recognition Software for Identification
639
22. Sanders, D., et al.: Introducing time-delays to analyze driver reaction times when using a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 559–570. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_41 23. Haddad, M., et al.: Steering a powered wheelchair using a camera module and image processing algorithms. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2021. Advances in Intelligent Systems and Computing (2021). (Accepted and in Press) 24. Haddad, M., et al.: Novel approach to steer a powered wheelchair using image processing algorithm and Raspberry Pi. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) Intelligent Systems and Applications. IntelliSys 2021. Advances in Intelligent Systems and Computing (2021). (Accepted and in Press) 25. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, p. I (2001). https://doi.org/10.1109/CVPR.2001.990517 26. Yang, M.-H., Kriegman, D.J., Ahuja, N.: Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24(1), 34–58 (2002). https://doi.org/10.1109/34.982883 27. Wang, C.: What’s the Difference Between Haar-Feature Classifiers and Convolutional Neural Networks? Towards Data Science (2018). https://towardsdatascience.com/whats-the-differ ence-between-haar-feature-classifiers-and-convolutional-neural-networks-ce6828343aeb. Accessed 1 Jan 2021 28. Sirovich, L., Kirby, M.: Low dimensional procedure for the characterisation of human faces. J. Optical Soc. Am. 4, 519 (1986) 29. Georgescu, D.: A real-time facial recognition system using eigenfaces. J. Mobile Embedded Distrib. Syst. 3(4), 199. ISSN 2067-4074 (2011) 30. Williams Pontin, M.: Better Face-Recognition Software. MIT Technology Review (2007). https://www.technologyreview.com/s/407976/better-face-recognition-software/. Accessed 01 Jan 2021 31. Enriquez, K.: Faster Face Detection using Convolutional Neural Networks & the ViolaJones Algorithm (2018). https://www.csustan.edu/sites/default/files/groups/University%20H onors%20Program/Journals/01_enriquez.pdf. Accessed 01 Jan 2021 32. Microsoft: Cognitive Services pricing – Face API (2019). https://azure.microsoft.com/en-gb/ pricing/details/cognitive-services/face-api/. Accessed 01 Jan 2021 33. Amazon: Amazon Rekognition pricing (2019). https://aws.amazon.com/rekognition/pricing/. Accessed 01 Jan 2021 34. Google: AI for Social Good in Asia Pacific (2018). https://www.blog.google/around-theglobe/google-asia/ai-social-good-asia-pacific/amp/. Accessed 1 Jan-2021 35. Kairos: KAIROS FACE RECOGNITION PRICING GUIDE (2019). https://www.kairos.com/ pricing/. Accessed 01 Jan 2021 36. Jacobs, H., Ralph, P.: Inside the creepy and impressive startup funded by the Chinese government that is developing AI that can recognize anyone, anywhere. Business Insider (2018). https://www.businessinsider.com/china-facial-recognition-tech-company-megvii-fac eplusplus-2018-5. Accessed 01 Jan 2021 37. Face++: Face++ Facial Recognition API Prices (2021). https://www.faceplusplus.com/v2/pri cing/. Accessed 01 Jan 2021 38. OpenCV: Open Source Computer Vision Library (2021). https://opencv.org/about.html. Accessed 01 Jan 2021 39. Accord.NET: Machine Learning Made in a Minute. (2021). http://accord-framework.net/. Accessed 01 Jan 2021
Intelligent User Interface to Control a Powered Wheelchair Using Infrared Sensors Malik Haddad1(B) , David Sanders1 , Giles Tewkesbury1 , Martin Langner2 , and Sarinova Simandjuntak1 1 University of Portsmouth, Portsmouth PO1 3DJ, UK
[email protected] 2 Chailey Heritage Foundation, North Chailey BN8 4EF, UK
Abstract. This paper presents a new system to steer a powered wheelchair using a Sharp IR sensor and a Raspberry Pi. Interviews with occupational therapists, helpers and carers at Chailey Heritage Foundation/School revealed that clicking noises generated from closing switch contact used to operate powered wheelchairs disturbed the attention and reduced the focus of young wheelchair users having cognitive or physical disability. Also switches often slipped away and became unreachable. The new system replaced lever-switches used to steer powered wheelchairs by an electronic circuit. The circuit consisted of a Sharp IR sensor, Analogue to Digital converter, relays, and a Raspberry Pi. The sharp IR sensor detected movement in its range and the Raspberry Pi interpreted the data and generated commands to steer a powered wheelchair. Two modes were used to overcome the problem of sensors slipping from position: Click to Calibrate and Auto-Calibrate. A technical User Interface was created to modify sensitivity, user and detection settings. Practical testing showed the system behaved satisfactorily. It detected users’ voluntary movements and used them to steer a powered wheelchair and overcome the problem of switches slipping from position. Clinical trials will be conducted at Chailey Heritage Foundation. Keywords: Disabled · Infrared sensor · User interface · Wheelchair
1 Introduction This paper presents a new system to steer a powered wheelchair using a Sharp IR sensor, a Raspberry Pi and a set of relays. The work described here is part of broader research carried by the authors at Chailey Heritage Foundation and the University of Portsmouth funded by the Engineering and Physical Sciences Council (EPSRC) [1]. The main aims of this research are to use Artificial Intelligence (AI) techniques to increase mobility and improve the quality of life of disabled powered wheelchair users by improving their self-reliance and confidence. Recent studies showed that around one sixth of the world population were diagnosed with some sort of disability with 2–4% of them diagnosed with mobility problems [2]. Due to recent medical advancements, population ageing and the spread of long © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 640–649, 2022. https://doi.org/10.1007/978-3-030-82193-7_43
Intelligent User Interface to Control a Powered Wheelchair
641
term health problems these figures were increasing [2, 3] and consequently disabled individuals usually had lower quality of life [4]. People with disability often used powered mobility for daily activities [5] and researchers have aimed to improve powered mobility using: sensor systems [6], ultrasonics [7] Multiple Criteria Decision Making and ultrasonic sensors [9–12], Expert Systems [13, 14], Rule Based Systems [15], microcomputers and Human Machine Interfaces [16, 17], Deep Learning [2, 18] and camera modules [19–22]. Wheelchair users often controlled their speed and direction with a joystick. If a user lacked the coordination to effectively use a joystick or if they could not use their hands or fingers then other input devices could be used (lever-switch, foot control, sip tubes/puff switches, or head or chin controllers, etc.). Of these lever switches were often used to control powered wheelchairs. A lever-switch was a type of electrical component used to control current to the electrical motors. Figure 1 shows lever switches used to control a powered wheelchair.
Fig. 1. Lever Switches used to steer a powered wheelchair [7].
Several switches were often used to control a wheelchair. Wheelchair controllers have normally been open-loop. Powered wheelchair users indicated their preferred direction by pressing the switch responsible for driving a wheelchair in that direction. Pressing the switch closed contacts and connected a circuit which triggered a specific relay and allowed current to flow to activate the motor responsible for driving the wheelchair in the desired direction. Closing switch contacts made clicking sounds. Recent interviews carried out by the authors with occupational therapists, helpers and carers at Chailey Heritage Foundation/School revealed that the clicking noise caused discomfort, disturbed attention and reduced the focus of some young wheelchair users having cognitive or physical disability. Also switches often slipped away and became unreachable because of wheelchair movement or because of the terrain. Helper intervention was often required to reposition a switch. That often disrupted driving sessions and user engagement.
642
M. Haddad et al.
A new system is presented in this paper to replace the lever switches and overcome the problem of switches slipping from position. The next Section presents the new system. Section 3 presents some results and discussion. Conclusions and future work are presented within Sect. 4.
2 The New System The work presented in this paper replaced the lever switches used to operate a powered wheelchair with a new system that did not generate clicking sounds and that overcame the problem of repositioning. A replacement system was created using a Sharp IR range sensor (SHARP GP2Y0A41SK0F) and a Raspberry Pi. The prototype system is shown in Fig. 2.
Fig. 2. Prototype of the new system to replace lever-switches.
The Sharp IR range sensor was suitable to detect objects in ranges from 4 cm to 30 cm and required 5 V DC and ground connections. The Sharp sensor communicated with the Raspberry Pi via Special Peripheral Interface (SPI). A program was created using Python programming language as shown in Fig. 3. To control the function of the range sensor, a simple User Interface (UI) was created containing three buttons: Start, Stop and Calibrate. The UI is shown in Fig. 4. Users with cognitive disability often used the UI. The simple UI matched the capabilities of users with disability. The new system had a straightforward operation for steering powered wheelchairs and provided a suitable match between desired commands and user capabilities [22]. The program was installed on to the Raspberry Pi. The Start button was used to activate the sensor. For safety reasons the sensor would not control the function of the wheelchair unless the Start Button was pressed. If the Start Button was pressed, the sensor would detect the nearest object in its range and the Raspberry Pi would calculate the distance between them. This distance was used as a reference to check for movement. If the sensor detected an object in its range, the sensor would send analogue data to Channel 0 of the ADC. The ADC would converted the analogue data to digital data that could be interpreted by the Raspberry Pi. Using sensor datasheet, a fourth order polynomial approximation was applied to find the distance in (cm) between the sensor
Intelligent User Interface to Control a Powered Wheelchair
643
Fig. 3. Python3 Program used to the create user interface and control the sharp IR range sensor.
Fig. 4. Simple user interface to control the function of the sharp IR range sensor.
644
M. Haddad et al.
and a detected object [24]. The distance between the sensor and a detected object was calculated using Eq. 1. D = 16.2537v4 − 129.893v3 + 382.268v2 − 512.611v + 301.439
(1)
Where D represented the distance between the sensor and the detected object and v represented the digital value generated by ADC. If a sensor detected an object in its range, The Raspberry Pi would calculate the distance between the sensor and that object. The Python program would set that distance as a reference distance. If the object moved, the sensor would detect that movement and the Raspberry Pi would calculate the new distance between the sensor and the object, the distance moved and the direction of the movement. A threshold value was set to identify how much movement was required to activate the wheelchair. A user would have a specific threshold value assigned to him. That threshold value depended on the amount of voluntary movement the user could generate based on their level of functionality and type of disability. If the object moved towards the sensor and the distance moved was greater than the threshold value, the Raspberry Pi would assign a high logic value to a designated pin identified by the program. The high logic value was used to activate the wheelchair motor. If the object moved away from the sensor and the distance moved was greater than the threshold value, the Raspberry Pi would assign a low logic value to the designated pin to deactivate the motor. The Stop Button was pressed.to stop the sensor from controlling the wheelchair. The Calibrate Button was used to overcome the problem of undesired slippage of the sensor. For example, if the sensor slipped towards the user and the wheelchair was undesirably activated or the voluntary movement produced by the user was no longer in the range of the sensor. In such cases, helpers could press the Calibrate Button to trigger the sensor to search for the nearest object and the Raspberry Pi would calculate a new reference distance to detect movement from the sensor’s new position. Auto-Calibrate mode was added to the new system. In this mode, the sensor automatically calculated the reference distance without the need to press the Calibrate Button. The new mode did not require a UI since it operated during the boot-up sequence of the Raspberry Pi.
3 Testing A student at Chailey Heritage School was considered as a case study. The student could produce voluntary movement with their head and used that movement to control a powered wheelchair. Head switches were used to transform that movement to steering direction of a powered wheelchair. The voluntary movement was used to steer a powered wheelchair left or right. The head switches were replaced by two Sharp IR sensors. The sensors were placed at the same physical location as the head switches. The sensors were used to detect the movement of the student’s head as shown in Fig. 5. A sensor was activated when the student moved their head towards it. The new system presented in this paper transformed the voluntary movement to steering directions for the powered wheelchair.
Intelligent User Interface to Control a Powered Wheelchair
645
Fig. 5. IR sensors used to operate a powered wheelchair.
A new technical UI was created to allow helpers input user settings, advanced parameters were used to accurately translate users’ desires and filter out unwanted movement, see Fig. 6.
Fig. 6. Technical user interface to input user settings.
The new technical UI allowed helpers to input and modify different parameters using track-bars as shown in Fig. 6. The following parameters were considered: • Threshold represented the minimum detection distance of the sensor. The larger the Threshold value the less sensitive the sensor would be to movements of the user’s head.
646
M. Haddad et al.
• Cut off Time allowed the new system to operate in two different modes: Switch Mode and Time Delay Mode. Setting the Cut off Time track-bar to 0 would trigger the system to operate in Switch Mode, where the system would use the sensors as switches. If an object was detected, a specific relay was triggered on and would remain triggered until the object was no longer detected. Setting the Cut off Time track-bar to any value other than 0 allowed the system to operate in Time Delay Mode where if an object was detected in a sensor range, a specific relay was activated depending on the value specified by the track-bar in seconds, then the relay would be switched off if no object was detected within sensor range. • Minimum distance represented the minimum cut off distance, where closer objects were not detected. • Maximum distance represented the maximum cut off distance, where objects beyond it were not detected. • Sampling Time allowed powered wheelchair users drive their wheelchairs safely across uneven ground. If the drive terrain was unsettled, the sensors could provide readings that did not represent the users’ desires because of the vibrations, head movement, or momentary sensor movements. The new system took two readings for an object detected in its range with the Sampling Time in between. That allowed the new system to overcome the problem of momentary sensor movement and head movement due to uneven driving terrain. Detection range was calculated by subtracting Minimum distance from Maximum distance. If helpers input Maximum distance smaller than the Minimum distance, an error message would appear asking the helper to adjust the sensor position according to the input distances as shown in Fig. 7.
Fig. 7. Error message asking helpers to adjust sensor position.
Figure 8 shows the prototype of the new system connected to a touchscreen used to input users’ settings. Once Driver Settings were input and the Apply Settings Button was pressed, the program stored all the settings in a CSV file. The CSV file could be used later to install driver settings during boot-up. The Auto-Calibrate Mode would read the CSV file and install driver’s settings during boot-up. Improvements to the new system will be made based on users and helpers feedback.
Intelligent User Interface to Control a Powered Wheelchair
647
Fig. 8. Prototype of the new system showing a touch screen to input user settings.
4 Conclusions and Future Work A new system was created to replace lever switches used to operate powered wheelchairs. The new system eliminated the clicking noise generated from closing switch contacts and the problem of switches becoming unreachable to users due to powered wheelchair movement or uneven terrain. This was achieved by introducing two functioning modes to overcome the problem: Click-to-Calibrate and Auto-Calibrate. The two modes improved disabled powered wheelchair users’ mobility. The new system provided a friendly and simple User Interface and required less effort to control a powered wheelchair. Components used were relatively cheap. The system provided a straightforward and simple yet efficient solution for the problems presented in this paper. To reduce cost, the authors will upload the schematic diagram and the Python program to an open access platform. Disabled wheelchair users will be able to download free of charge. The new system provided a faster response time and required less hardware than the lever-switches. Clinical trials will be conducted to investigate the effectiveness of the new system and improvements will be made based on users, helpers and carers feedback. Future work will consider creating a new UI to incorporate multiple Users [16, 25]. Users could be digitally added to the new system. A specific threshold value will be assigned for each user based on their needs, level of functionality and type of disability. Future work will also consider using Artificial Neural Networks to identify which user is operating the new system and how they are performing. Acknowledgment. This research was supported by an EPSRC EP/S005927/1 project titled “Using artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system”. Investigators: Sanders, DA and Gegov, AE. Senior Researchers Haddad MJ and Langner MC.
648
M. Haddad et al.
References 1. Sanders, D., Gegov, A.: Using artificial intelligence to share control of a powered-wheelchair between a wheelchair user and an intelligent sensor system. EPSRC Project 2019–2022 (2018) 2. Haddad, M.J., Sanders, D.A.: Deep Learning architecture to assist with steering a powered wheelchair. IEEE Trans. Neural Syst. Rehabil. Eng. 28(12), 2987–2994 (2020) 3. Krops, L.A., Hols, D.H., Folkertsma, N., Dijkstra, P.U., Geertzen, J.H., Dekker, R.: Requirements on a community-based intervention for stimulating physical activity in physically disabled people: a focus group study amongst experts. Disabil. Rehabil. 40(20), 2400–2407 (2018) 4. Bos, I., Wynia, K., Almansa, J., Drost, G., Kremer, B., Kuks, J.: The prevalence and severity of disease-related disabilities and their impact on quality of life in neuromuscular diseases. Disabil. Rehabil. 41(14), 1676–1681 (2019) 5. Frank, A.O., De Souza, L.H.: Clinical features of children and adults with a muscular dystrophy using powered indoor/outdoor wheelchairs: disease features, comorbidities and complications of disability. Disabil. Rehabil. 40(9), 1007–1013 (2018) 6. Sanders, D.A., Langner, M., Tewkesbury, G.E.: Improving wheelchair-driving using a sensor system to control wheelchair-veer and variable-switches as an alternative to digital-switches or joysticks. Ind. Robot Int. J. 32(2), 157–167 (2010) 7. Langner, M.: Effort reduction and collision avoidance for powered wheelchairs: SCAD Assistive Mobility System, Doctoral dissertation, University of Portsmouth (2012) 8. Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016) 9. Haddad, M., Sanders, D., Gegov, A., Hassan, M., Huang, Y., Al-Mosawi, M.: Combining multiple criteria decision making with vector manipulation to decide on the direction for a powered wheelchair. In: Bi, Y., Bhatia, R., Kapoor, S. (eds.) IntelliSys 2019. AISC, vol. 1037, pp. 680–693. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-29516-5_51 10. Haddad, M., Sanders, D., Langner, M., Ikwan, F., Tewkesbury, G., Gegov, A.: Steering direction for a powered-wheelchair using the analytical hierarchy process. In: Proceedings of the 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 229–234. IEEE, Bulgaria (2020) 11. Haddad, M., et al.: Use of the analytical hierarchy process to determine the steering direction for a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 617–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-551902_46 12. Haddad, M.J., Sanders, D.A.: Selecting a best compromise direction for a powered wheelchair using PROMETHEE. IEEE Trans. Neural Syst. Rehabil. Eng. 27(2), 228–235 (2019) 13. Sanders, D.A., Bausch, N.: Improving steering of a powered wheelchair using an expert system to interpret hand tremor. In: Liu, H., Kubota, N., Zhu, X., Dillmann, R., Zhou, D. (eds.) ICIRA 2015. LNCS (LNAI), vol. 9245, pp. 460–471. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-22876-1_39 14. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE, Bulgaria (2020) 15. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered Wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 822–838. Springer, Cham (2019). https:// doi.org/10.1007/978-3-030-01054-6_57
Intelligent User Interface to Control a Powered Wheelchair
649
16. Haddad, M., Sanders, D., Ikwan, F., Thabet, M., Langner, M., Gegov, A.: Intelligent HMI and control for steering a powered wheelchair using a Raspberry Pi microcomputer. In 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 223–228. IEEE, Bulgaria (2020) 17. Haddad, M., et al.: Intelligent control of the steering for a powered wheelchair using a microcomputer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2020. AISC, vol. 1252, pp. 594–603. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_44 18. Haddad, M., Sanders, D., Tewkesbury, G., Langner, M.: A novel collision avoidance system for steering a powered wheelchair using deep learning architecture. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE (2021, submitted) 19. Haddad, M., Sanders, D., Tewkesbury, G., Langner, M.: Novel approach for digitizing the scanning collision avoidance device detection range. In: 32nd IEEE Intelligent Vehicles Symposium, 2021. IEEE, Japan (2021, submitted) 20. Haddad, M., Sanders, D., Langner, M., Tewkesbury, G.: Novel approach to steer a powered wheelchair using image processing algorithm and raspberry Pi. In: 32nd IEEE Intelligent Vehicles Symposium, 2021. IEEE, Japan (2021, submitted) 21. Haddad, M., Sanders, D., Tewkesbury, G., Langner, M.: Using open source computer vision algorithms to drive a powered wheelchair. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE (2021, submitted) 22. Haddad, M., Sanders, D., Langner, M., Tewkesbury, G.: Steering a powered wheelchair using a camera module and Python Imaging Library. In: 24th IEEE International Conference on Intelligent Transportation - ITSC2021. IEEE, USA (2021, submitted) 23. Lewis, C.: Simplicity in cognitive assistive technology: a framework and agenda for research. Univ. Access Inf. Soc. 5(4), 351–361 (2007) 24. Spark Fun Homepage: https://www.sparkfun.com/products/8958#comment-4f864d34ce39 5f9161000000. Accessed 16 Nov 2020 25. Haddad, M., Sanders, D., Langner, M., Tewkesbury, G.: One shot learning approach to identify drivers. In: SAI Intelligent Systems Conference, IntelliSys. Netherlands (2021, Accepted and in Press)
A Classification Based Ensemble Pruning Framework with Multi-metric Consideration Ya-Lin Zhang, Qitao Shi, Meng Li, Xinxing Yang, Longfei Li, and Jun Zhou(B) Ant Group, Hangzhou, China {lyn.zyl,qitao.sqt,lm168260,xinxing.yangxx,longyao.llf, jun.zhoujun}@antgroup.com
Abstract. Machine learning techniques have been an essential driving force for Internet companies while handling daily tasks like fraud detection. Given a pre-designed model, several efforts, such as fine-tuning and ensemble techniques, can be taken to enhance its performance. However, more effective methods, with consideration of task-specific requirements, are always highly demanded. In this paper, we aim at efficiently obtaining an effective framework with consideration of multiple metrics, based on a pre-designed model. To achieve this, we build a framework which utilizes the ensemble pruning technique and adapts a classification based optimization method to perform the model pruning procedure. On one hand, to maintain the performance consistency of the pruned ensemble model between the selection and deployment stage, a calibration strategy is introduced to the model pruning (weight optimization) procedure. On the other hand, to meet the business requirement for multi-metric consideration, multiple evaluation metrics can be equipped in this framework, and we improve the optimization method to simultaneously optimize these metrics. Experiments are first conducted on several benchmark datasets, and the results show that this framework can effectively enhance the performance. We further apply the proposed framework to large-scale fraud detection tasks, which also validates its effectiveness. Keywords: Ensemble pruning · Classification-based optimization Multi-metric evaluation · Fraud detection
1
·
Introduction
Machine learning techniques have been widely explored and act as an indispensable role for almost all Internet corporations while handling diversified tasks, such as recommendation [6,22,28,42], advertising [4,11,20], face recognition [29,37,38], object detection [2,15,36], and fraud detection [3,16,33,34], etc. Considerable improvements and significant benefits have been obtained with the help of these techniques in recent years. It has been deemed as a pivotal issue to design new effective methods and fully excavate their potential for machine learning tasks. Given a pre-designed model, c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 650–667, 2022. https://doi.org/10.1007/978-3-030-82193-7_44
A Classification Based Ensemble Pruning Framework
651
to fully explore its performance, many different strategies can be employed. For example, a fine-tuning procedure for the hyperparameters of this model [8,31] is always conducted so that the best parameters will be selected and the corresponding model will be deployed to get the best performance for a single model. Another typical strategy is to ensemble multiple diversified component models [21,35,39,40], and it has been shown to be a promising strategy and widely applied, especially for industrial tasks and competitions. However, there are still some inadequacies when using these strategies in large-scale industrial scenarios, where a large amount of training samples are always used. For hyperparameter tuning, a general strategy is to follow the try-and-evaluate paradigm [13,31]. However, this process always needs multiple rounds to find the best hyperparameter combination even with some ingeniously designed methods, while the total time and resource consumption may be unaffordable in large-scale scenarios since the data is always tremendous and even a single evaluation for a group of hyperparameters can be pretty time-consuming. As for the ensemble method [26,39], we may not need to carefully tune the parameters for every single model, and the ensemble models can usually be able to achieve state-of-the-art performance. However, simply using all available components always results in a computation and storage cost which may be prohibitive for many applications, while it has been revealed that the performance can be further improved when properly using some instead of all available component learners [23,41]. In addition, for some tasks, multiple metrics are always concerned simultaneously, which may not be addressed in many methods There is no doubt that more effective methods, along with the consideration of taskspecific metrics, are invariably demanded. In this paper, we aim at efficiently building an effective framework based on a pre-designed model, with consideration of multiple metrics if necessary. The basic idea is to equip ensemble pruning framework with efficient and effective optimization method, and make it to be feasible when encountering multi-metric evaluation scenarios. Following the ensemble pruning methodology, we equip it with classification based optimization method to construct a performance-enhancing framework. Besides, the framework is extended to take multiple metrics into consideration during the optimization procedure. The main contribution of this work can be concluded as follows: – We utilize the ensemble pruning technique and propose to employ a classification based derivative-free optimization method to efficiently build a framework. To maintain the consistency of the pruned ensemble model between the selection and deployment stage, a calibration strategy is proposed in the ensemble pruning procedure. – The framework is further extended to take multiple metrics into consideration while performing the optimization process. Since the task turns to be multiobjective, traditional techniques for single-objective optimization can not be naturally applied, we improve the derivative-free optimization strategy for the pruning of the ensemble components in this scenario. – Abundant experiments are conducted on various benchmark datasets, the results show that the proposed methods can substantially improve the
652
Y.-L. Zhang et al.
performance. Besides, we further employ this framework on large-scale fraud detection tasks, and validate its effectiveness. The rest of the paper is organized as follows: In Sect. 2, we briefly present the related work. In Sect. 3, we elaborate the proposed method. In Sect. 4, we present the empirical results on benchmark datasets and analyze the results from different perspectives. In Sect. 5, we show the results of employing this framework on large-scale fraud detection tasks. Finally, the conclusions are drawn in Sect. 6.
2
Related Work
To design new effective models and fully explore their potential is always regarded as a crucial for machine learning community. When a model is designed, we will always take much effort to fully explore the capabilities of the model. In this section, we briefly review two typical strategies, i.e., fine tuning of the hyperparameter and ensemble techniques. Given a pre-designed model, a typical strategy to improve the performance is by hyperparameter selection, which is an important topic of automatic machine learning [13]. The optimization (selection) of the hyperparameters can be naturally regarded as a black-box (derivative-free) optimization problem. To handle this problem, several methods, such as genetic algorithm [10], Bayesian optimization [8], cross-entropy methods [7], optimistic optimization methods [19] and classification based method [31], etc., have been proposed, and it’s promising to efficiently find a good hyperparameter combination with these strategies for many tasks. As a prominent achievement for derivative-free optimization problem, classification-based optimization method learns a classifier to model search space, and has shown impressive performance in various applications [12,30,31]. However, for large-scale tasks, it will cost much time to evaluate the performance for a hyperparameter combination, thus typical hyperparameter optimization process may be too time-consuming and resource-consuming, which will hinder its application in industrial tasks.. Ensemble learning [26,39] is an appealing machine learning paradigm, which has been widely used in many fields. Typically, multiple base learners are trained following different strategies such as bagging [1], boosting [9], stacking [27] and other strategies [40], and all of these base learner are then combined together for further prediction. Some researchers [17,18,41] show that selecting some instead of all the available component learners, a better ensemble can be generated, and the ensemble pruning techniques are explored. Instead of using all available learners, ensemble pruning works by selecting a subset of the component learners, and it has been shown to be able to achieve better performance than the whole ensemble [41]. Multiple strategies are proposed for the ensemble pruning problem, such as ordering based method [18] and optimization based method [14]. However, these methods typically perform the pruning procedure by optimize one objective, while real-world tasks always take multiple metrics into consideration simultaneously.
A Classification Based Ensemble Pruning Framework
653
In this work, we follow the ensemble pruning paradigm, and handle the pruning procedure with optimization based method. We propose to use classification based optimization method in the ensemble pruning procedure, which is actually a weight optimization task. Besides, we address the multi-metric optimization problem, and design the framework to simultaneously optimize them efficiently.
3 3.1
The Proposed Framework Problem Statement
In this work, we consider the binary classification problem, and aim at improving the performance of the whole framework with a pre-designed model. Let X = Rd denote the feature space and Y = {−1, +1} denote label space. We are given a training set with p samples Dtr = {(x1 , y1 ), . . . , (xp , yp )} and a validation set with q samples Dv = {(x1 , y1 ), . . . , (xq , yq )}, where xi ∈ X and yi ∈ Y. The goal is to learn a mapping function H : X → Y, to provide as good performance as possible for future-coming samples. Given a pre-designed model, a typical procedure to get the mapping function H(x) is by fine-tuning the hyperparameters of this model and selecting the best performed one. However, the typical multi-round selection manner will become too time-consuming and resource-consuming for large-scale tasks, which will hinder the hyperparameter selection procedure. Another effective choice is to ensemble multiple diversified component models to get the mapping function, M i.e., H(x) = i=1 wi hi (x), here hi (x) a single model and wi is its corresponding weight. However, the whole model may be too cumbersome and the computational cost of the predictive procedure may be prohibitive in large-scale scenarios. 3.2
Overview of the Proposed Framework
In our proposed framework, we follow the ensemble manner to construct the mapping function, i.e., a set of M component models {hi , . . . , hM } are first trained and the corresponding weights {w1 , . . . , wM } are learned to combine M these component models, and it is often constrained that wi ≥ 0 and i=1 wi = 1. Different to the typical ensemble strategy, we employ the ensemble pruning method to only select a small set of all component models, i.e., the count of the non-zero element in {w1 , . . . , wM } is much smaller than M , so that the whole model size will be much smaller. To effectively learn the proper weights, we regard this problem as a derivativefree optimization problem, and propose to employ a classification based method, i.e., a random coordinate shrinking model, to perform the ensemble pruning procedure. Besides, a calibration strategy is applied when performing the ensemble pruning (i.e., weight optimization) procedure to ensure the consistency of the performance between the selection stage and employment stage. To meet the requirement that multiple metrics may be concerned simultaneously, we extend our method to take multiple metrics into consideration. The optimization task
654
Y.-L. Zhang et al.
will turn to be multi-objective in this setting, and traditional optimization methods can not be naturally employed. In our framework, we further improve the derivative-free optimization method to handle the weight optimization problem in this scenario. In the following sections, we will first explain the details of handling the ensemble pruning task with classification based optimization method. Then, we extend the problem setting to meet the multi-metric optimization requirement and elaborate the corresponding optimization strategy. 3.3
Ensemble Pruning with Classification Based Optimization
As a powerful learning paradigm, ensemble learning [26,39] becomes appealing in many domains from both academic and industrial fields [5,35]. In most ensemble applications, all of the component learners are combined as the final model, while ensemble pruning [17,39] tries to select a subset of them to construct the final model, and it has been shown that the generalization performance of the pruned ensemble can be even better than the ensemble with all component models [18,41]. Furthermore, the pruned ensembles are with a smaller size, which will reduce the required storage space and accelerate the prediction speed. These advantages greatly encourage us to employ an ensemble pruning strategy in our framework. Formally speaking, in ensemble learning scenarios, given the training set Dtr , a set of M base learners {h1 , . . . , hM } are first trained, and the corresponding weights {w1 , . . . , wM } are optimized using validation set Dv , to combine these component classifier into the final model H. Thus the final prediction for a sample x can be obtained by H(x) =
M
wi hi (x).
(1)
i=1
It is often assumed that the weights are non-negative and sum to 1, i.e., M wi ≥ 0 and i=1 wi = 1. When applying ensemble pruning, the number of the selected component models is getting to be much smaller than the number M of all trained models, i.e., M
1(wi > 0) < M ,
(2)
i=1
in which 1(·) is the indicator function that is 1 if the inner expression is true and 0 otherwise. To learn the weights for the component models, the validation set Dv is usually used, and a specific evaluation metric (such as AUC) is always optimized. The metric score J can be obtained by calculating with all validation samples. As an example, the accuracy can be calculated by q
J =
1 1 (sign(H(xi ), θ), yi ) , q i=1
(3)
A Classification Based Ensemble Pruning Framework
655
in which H(x) is the model prediction for sample x calculated as above, q is the size of the validation set, yi is the corresponding ground-truth label for sample xi , and sign(H(xi ), θ) is 1 if H(xi ) ≥ θ, and –1 otherwise. Similarly, the AUC (Area Under ROC Curve) score can be calculated by 1 1 1 H(x+ ) > H(x− ) + 1 H(x+ ) = H(x− ) J = + − , q q 2 + − − + x ∈Dv x ∈Dv
(4) in which q + and q − denote the number of positive and negative samples respectively, and Dv+ and Dv− are the set of positive and negative samples in validation data. Let w = [w1 , . . . , wM ]T denote the weights vector, the optimization procedure tries to get the best w to minimize/maximize the objective as showed in Eq. (5), min / max J (w) w
w
subject to
M i=1
wi = 1,
M
1(wi > 0) ≤ k,
(5)
i=1
wi ≥ 0, where 1 ≤ i ≤ M where k is the maximal value of the selected number of the component learners, which is much smaller than M , and J (w) is a specific optimization metric (such as AUC) with w as the optimization parameters. In this work, to efficiently learn the weights w for the component models, we regard the optimization problem as a derivative-free optimization problem, and propose to employ a classification based method named RACOS, and further adapt it to handle this problem, so that the constraints of our problem can be satisfied. RACOS is a recently proposed classification-based optimization method [30,31], and it has shown to be effective in various optimization tasks [12,24,25]. The basic idea is to separate the solutions from good to bad, and a classification model is learned to find the good area so that new solutions can be effectively sampled. One issue that needs to be addressed is that the solution in this problem is with some strict constraints, i.e., only a portion of component models are with non-zero weights (as shown in Eq. 5), which is not addressed in previous studies. For example, in [41], the weights of the component models are optimized during the selection procedure, and the final retained models are those with larger weights (which will be calibrated so that the sum is 1), while the rest models are dropped. This strategy will result in an inconsistency between the selection and deployment procedure, due to the variation of the weights. In our framework, a calibration process is first conducted on the originally generated solutions before going to the evaluation procedure, so that the obtained performance scores are strictly consistent among different stages. We denote the k-th largest value of all weights in w as α, the Calibration procedure is simply performed by Eq. (6), to set the smaller weights to zeros, and
656
Y.-L. Zhang et al.
with a normalization process as shown in Eq. (7), so that the sum of all weights will be 1. 0, if wi < α (6) wi = wi , if wi ≥ α wi wi = (7) wi ∈w wi Algorithm 1. Adapted RACOS Require: J : Objective function to be optimized; C: A binary classification algorithm; λ: Balancing parameter; T ∈ N+ : Number of iterations; m ∈ N+ : Number of solutions in each iteration; k ∈ N+ (≤ m): Number of positive solutions; Sampling: Sample new solutions; Selection: Decide the positive/negative solutions; Calibration: Calibrate the solutions to satisfy the constraints; ˆ Ensure: the best solution w 1: Collect S0 = {w 1 , · · · , w m } by i.i.d. sampling m solutions 2: B0 = {(w 1 , z 1 ), · · · , (w m , z m )}, ∀w i ∈ S0 : z i = J (Calibration(w i )) 3: for t = 1 to T do 4: (Dt+ , Dt− ) = Selection(Bt−1 , k) 5: Bt = Bt−1 6: for i = 1 to m do 7: ft = C(Dt+ , Dt− ) 8: w i = Sampling(ft , λ) 9: z i = J (Calibration(w i )) and let Bt = Bt ∪ (w i , z i ) 10: end for 11: end for ˆ = Calibration(w) ˜ ˜ z˜) = arg min(w ,z)∈Bt z, and w 12: (w, ˆ 13: return w
Algorithm 1 presents the pseudo-code of the adapted RACOS, which is slightly modified from the original RACOS algorithm [31], and it is used to optimize the weights for each base model. Roughly speaking, before going into the weight optimization process, M base models {hi , . . . , hM } are first trained, and the predictions of these models are obtained on validation set. The goal is ˆ to maximize (or minimize) the evaluation to get the best weight combination w metric J . More specifically, the initial solution set S0 is uniformly generated. Here, a solution wi ∈ RM represents a weight combination, whose each element wji denotes the weight for the j-th base learner hj . Then the solutions are calibrated following Eq. (6) and Eq. (7) to satisfy the constraints in Eq. (5), the final prediction H(x) for each sample x in validation set is obtained as the weighted averM age of the predictions from all of these base models, i.e., H(x) = i=1 wi hi (x).
A Classification Based Ensemble Pruning Framework
657
Evaluation is performed on the validation set to get the metric scores z i for the objective function J , and a solution-score tuple set B0 is obtained (line 2). An iteration procedure is performed to learn the classification model, sample and evaluate new solutions. In line 4, the Selection procedure is conducted, in which the good and bad solution set (Dt+ and Dt− ) are obtained according to the evaluation score z i , and each solution is labeled 1 if it is among the best k solutions, otherwise labeled –1. In line 6 to 10, m solutions are generated and evaluated afterwards. By using classifier C, a model which can distinguish a randomly chosen good solution from the bad ones is constructed (line 7), and a new solution (weight combination) is generated by using the Sampling procedure in line 8. Parameter λ is used to balance the Sampling procedure so that the new solution is generated from the leaned model with probability λ and generated from uniformly sampling over all feasible space with probability 1 − λ. The process between line 6 and 10 can be performed in parallel to improve the efficiency. The ˜ will be selected and calibrated to w ˆ (line 12), which best evaluated solution w will be returned finally as the selected weight combination. By using the above method, the troublesome problem of the weights optimization can be efficiently solved. At the same time, the consistency of the performance between selection and deployment stage can be maintained. The final model can be obtained by selecting the non-zero weighted models and their corresponding weights. 3.4
Multi-Metric Consideration and Its Optimization
In real-world tasks, we always need to optimize multiple metrics to meet the business requirements, which is not addressed in typical ensemble pruning methods. In this work, we further extend the framework to take multiple metrics into consideration when performing the weight optimization procedure. Take two metrics as an example, if we denote the original objective in Eq. (5) as Jo (w), and the other objective as Ja (w), the problem can be formed as Eq. (8), min / max Jo (w) , Ja (w) w
w
subject to
M i=1
wi = 1,
M
1(wi > 0) ≤ k,
(8)
i=1
wi ≥ 0, where 1 ≤ i ≤ M Although the optimization problem in Eq. (8) seems very similar to that in Eq. (5), the problem is getting to be much more complicated, while traditional optimization techniques can not be naturally applied. In our framework, we improve the RACOS method to handle the multi-objective optimization problem. The adapted RACOS algorithm is designed to optimize one objective, and the Selection procedure is based on the sorting of the evaluation score for the solutions. However, the problem in Eq. (8) is with multiple objectives Jo and Ja , and total order can not be built for the solutions due to the multiple ordering
658
Y.-L. Zhang et al.
measurements. In this work, we further proposed an improved method to handle the problem in Eq. (8).
Algorithm 2. RAM Require: Jo , Ja : Objective functions to be optimized; C: A binary classification algorithm; λ: Balancing parameter; T ∈ N+ : Number of iterations; m ∈ N+ : Number of solutions in each iteration; Sampling: Sample new solutions; Selection: Decide the positive/negative solutions; Calibration: Calibrate the solutions to satisfy the constraints; ˆ Ensure: the selected solution w 1: Collect S0 = {w 1 , · · · , w m } by i.i.d. sampling m solutions 2: B0 = {(w 1 , zo1 , za1 ), · · · , (w m , zom , zam )}, ∀wi ∈ S0 : zoi = Jo (Calibration(w i )), zai = Ja (Calibration(w i )) 3: for t = 1 to T do 4: (Dt+ , Dt− ) = Selection(Bt−1 ) 5: Bt = Bt−1 6: for i = 1 to m do 7: ft = C(Dt+ , Dt− ) 8: w i = Sampling(ft , λ) 9: zoi = Jo (Calibration(w i )), zai = Ja (Calibration(w i )) and let Bt = Bt ∪ (w i , zoi , zai ) 10: end for 11: end for ˆ = Calibration(w) 12: (D+ , D− ) = Selection(BT ), sample one w from D+ and let w ˆ 13: return w
We name the proposed method as RAM, and the pseudo-code is shown in algorithm 2. In RAM, we aim at optimizing the weights w of the base learners as shown in Eq. (8). Two different objectives Jo and Ja are considered, and a Calibration procedure is employed so that the evaluated solutions (weight combinations) strictly satisfy the constraints. Furthermore, the number of selected positive samples is no longer needed. We need to address that in algorithm 1, the Selection procedure can be conducted just by sorting the solutions according to their corresponding objective score J , while in algorithm 2, the total order can not be built for the solutions since two objective scores are concerned in our problem, thus we can not just sort the solutions to get the top-behaved ones. In algorithm 2, we construct the non-dominated solution set as the positive set. Here, a solution is called nondominated if none of the objective score can be improved without degrading the other objective score. After the Selection procedure, the construction of the model ft and Sampling procedure can be performed as in algorithm 1. Note that the auxiliary evaluation metric is only used in the Selection procedure, and will
A Classification Based Ensemble Pruning Framework
659
not influence the learning of ft . We will just need to sample a positive solution from the positive (non-dominated) set and learn an axis-parallel box model to distinguish the positive solution from the negative ones, and the objective scores are not needed in this process. At the end of the whole algorithm, we can simply return a calibrated solution sampled from the final positive set, or manually select one solution if necessary.
Algorithm 3. The Whole Framework of the Proposed Method Require: Dtr : The training set; Dv : The validation set; L: The specific learner; M : Number of all learned base learners; Θ: Parameters for adapted RACOS or RAM; Ensure: the selected model-weight pairs M. 1: for i = 1 to M do 2: Train base model hi = L(Dtr ) 3: Get prediction pi on validation set Dv 4: end for ˆ using algorithm 1(or 2), with parameters Θ 5: Optimize to get the weights w 6: Construct M = {(hi , wi )}wi ∈wˆ ,wi >0 7: return M
To provide an overview of the whole process, we explain the whole framework in algorithm 3. Given the training set Dtr , validation set Dv , the specific learner L, the goal is to construct a final model, which consists of at most k weighted learners selected from M base models, so that robust and effective performance can be obtained concerning all concerned objectives. Firstly, M base learners with different parameters are built with the training set Dtr , in which strategies such as bootstrap can be employed to improve the diversity of the base models. After that, predictions of these M base models on validation samples are obtained, which will be used in the weight optimization procedure. Concretely speaking, in algorithm 1 or algorithm 2, we aimed at learning a proper weight combination w with length M , in which each dimension wi denotes the weight of the i-th base model. To get the objective scores, we just get the weighted predictions (the weighted combination of the predictions from each base model) on the validation set and evaluate it to get the scores. Since the predictions have been obtained before, the evaluation procedure can be efficiently conducted. When we ˆ from algorithm 1 or algorithm 2, we will select and get the optimized weights w save the component models (along with their weights) whose weights are not zero, and return them as the final model M. In the deployment and prediction stage, given a new-coming sample, we will just get the predictions from the saved component models and obtain the weighted average as the final prediction.
660
4
Y.-L. Zhang et al.
Empirical Results
4.1
Compared Methods
We apply XGBoost as the base learners, since it is one the best candidates for most tasks [5,32], which is a strong baseline for comparison. A default setting of XGBoost (denote as XGBoost-d) is used to train the baseline model, whose number of rounds is 100, subsample ratio is 0.6, maximum depth of trees is 5, and the rest parameters are as default. To get the fine-tuned model, the number of rounds is selected from {80, 90, 100, 110, 120}, the learning rate is selected from {0.1, 0.2, 0.3}, the subsample ratio is selected from {0.6, 0.7, 0.8, 0.9, 1}, and the maximum depth is selected from {4, 5, 6, 7}, and the other parameters are as default. Grid search is performed, and the best-behaved model (denote as XGBoost-t) concerning the AUC (Area Under ROC Curve) metric on the validation set among these 300 parameter combinations is also selected for comparison. For our methods, 10 base models with different parameters, in which each parameter is randomly sampled from the aforementioned lists, are trained firstly as the component models, then at most 3 of them are selected to construct the final model. We first employ the adapted RACOS (denote as ARC) to perform the model selection procedure, concerning the AUC evaluation score on the validation set. Besides, to perform RAM, we use AUC and AP (Average Precision) as the objectives in Eq. (8). The average ensemble of all these 10 models (denote as Ensemble-a) is also compared. Table 1. The information of the benchmark datasets Dataset
|Dtr |
|Dv |
|Dte |
Dim
Magic 12172 3044 3804 10 22057 5515 6893 118 Nomao 28934 7234 9043 51 Bank 28999 7250 9063 8 Electricity 48000 12000 16000 170 Aps 82998 20750 25937 4126 Commercial 83240 20811 26013 50 Miniboone 300000 100000 100000 2000 Epsilon 3000000 1000000 1000000 18 Susy 5000000 3000000 3000000 28 Higgs
4.2
Experiments on Benchmark Datasets
We first perform experiments on 10 various benchmark datasets, which come from UCI1 or OpenML2 repository. The detailed statistics of the benchmark 1 2
https://archive.ics.uci.edu. https://www.openml.org.
A Classification Based Ensemble Pruning Framework
661
datasets are shown in Table 1. The sample numbers of training, validation and test set are denoted as |Dtr |, |Dv | and |Dte |. The last column shows the dimension of each dataset. As we can see, the size of the training data ranges from ten thousand to five million, and the size of the dimension ranges from eight to more than four thousand. To evaluate the performance, three metrics are calculated, which include AUC (Area Under ROC Curve), AP (Average Precision) and F1 (F1 Score). The results (mean ± standard derivation) are shown in Table 2. For each dataset, the experiments are repeated 5 times with different data split, and the reported results are obtained by averaging these results. The best results concerning each metric are marked in bold. In addition, if one method performs significantly worse than the best behaved one by t-test with confidence level 5%, the corresponding entry is marked with a bullet. Evaluation Results. As we can see from Table 2, the proposed methods ARC and RAM result in substantial improvements when comparing with all other methods. No matter which metric is considered, the proposed methods can achieve a better score in most of the datasets, which validates the effectiveness of the ensemble pruning schemes. There is no doubt that, compared with the default model, the fine-tuned model can improve the performance. However, we need to address that the finetuned model is obtained by selecting from 300 sets of parameter combinations, which is severely time-consuming, and the improvement may be inconspicuous. Even though the model is fine-tuned, the performance can not beat the proposed method in almost all datasets. One interesting phenomenon is that the average ensemble of all component models can not lead to an improvement of the metrics in many datasets, even compared with the default single model. For example, the Ensemble-a model behaves as the worst in datasets Nomao and Higgs with regard to all assessed metrics. One possible reason is that the parameters of component base learners are randomly sampled from the parameter lists, and the component model which behaves unsatisfactorily will corrupt the performance of the whole ensemble model when the average ensemble is used. Nevertheless, the proposed method is pretty resistant to these components, since the selected models are optimized, the bad models will be filtered out, and the final model of the proposed method will not be influenced by the unsatisfactory base models. Influence of Multi-Metric Consideration. Notably, if we take RAM and ARC for comparison, we will find out that RAM can always perform better and more robust than ARC across the metrics, which demonstrates the effect of the introduced auxiliary metric. If RAM is not the best-performed method, the performance is barely significantly worse than the best one. Besides, the proposed method is pretty efficient, since all of the component models can be simultaneously trained without fine-tuning for the component
662
Y.-L. Zhang et al.
Table 2. The AUC, AP and F1 scores of the compared methods on benchmark datasets. Dataset
Method
Magic
XGBoost-d 93.12±0.00• 95.28±0.00• 90.87±0.00•
AUC
AP
F1
XGBoost-t 93.67±0.00• 95.68±0.00• 91.29±0.00 Ensemble-a 93.23±0.05• 95.42±0.06• 91.14±0.05•
Nomao
ARC
93.73±0.12• 95.72±0.11• 91.34±0.09•
RAM
93.82±0.08 95.81±0.08 91.40±0.13
XGBoost-d 99.55±0.00• 99.82±0.00• 98.02±0.00• XGBoost-t 99.61±0.00• 99.84±0.00• 98.11±0.00• Ensemble-a 99.52±0.02• 99.81±0.01• 97.92±0.05•
Bank
ARC
99.64±0.01 99.85±0.00
RAM
99.63±0.01
98.19±0.05
99.85±0.00 98.21±0.06
XGBoost-d 92.72±0.00• 61.41±0.00• 62.35±0.00• XGBoost-t 93.37±0.00• 63.90±0.00• 64.16±0.00 Ensemble-a 93.35±0.07• 61.39±0.34• 62.84±0.29•
Electricity
ARC
93.43±0.06• 64.14±0.25• 64.20±0.29•
RAM
93.48±0.05 64.20±0.26 64.26±0.31
XGBoost-d 95.78±0.00• 94.39±0.00• 87.24±0.00• XGBoost-t 97.08±0.00
96.09±0.00• 89.70±0.00
Ensemble-a 96.03±0.19• 94.63±0.24• 87.70±0.35•
Aps
ARC
97.14±0.23 96.23±0.38 89.67±0.75
RAM
97.09±0.35
96.21±0.43
89.71±0.78
XGBoost-d 99.39±0.00• 91.70±0.00• 85.53±0.00• XGBoost-t 99.53±0.00• 92.08±0.00• 85.62±0.00• Ensemble-a 99.61±0.05 91.99±0.40• 85.35±0.52• ARC
99.55±0.02
92.98±0.21• 87.61±0.54•
RAM
99.55±0.04
93.05±0.22 87.66±0.52
Commercial XGBoost-d 97.91±0.00• 98.65±0.00• 94.85±0.00• XGBoost-t 98.14±0.00• 99.05±0.00• 96.00±0.00 Ensemble-a 98.04±0.16• 98.75±0.11• 94.92±0.23•
Miniboone
ARC
98.35±0.18• 99.06±0.09
RAM
98.66±0.15 99.14±0.10 96.01±0.31
95.87±0.26
XGBoost-d 98.29±0.00• 95.33±0.00• 89.66±0.00• XGBoost-t 98.36±0.00• 95.49±0.00• 89.89±0.00•
Epsilon
Ensemble-a 98.44±0.02• 95.77±0.05
89.95±0.10•
ARC
98.44±0.02• 95.77±0.04
90.21±0.09•
RAM
98.49±0.02 95.82±0.04 90.38±0.12
XGBoost-d 93.10±0.00• 93.14±0.00• 85.46±0.00• XGBoost-t 93.57±0.00• 93.60±0.00• 86.09±0.00• Ensemble-a 93.86±0.12• 93.91±0.13• 86.50±0.12•
Susy
ARC
94.17±0.11• 94.22±0.11• 86.90±0.15•
RAM
94.21±0.09 94.25±0.08 86.93±0.14
XGBoost-d 87.57±0.00• 87.96±0.00• 77.81±0.00• XGBoost-t 87.65±0.00
88.03±0.00• 77.90±0.00
Ensemble-a 87.57±0.02• 87.97±0.03• 77.84±0.02•
Higgs
ARC
87.67±0.01
RAM
87.67±0.01 88.06±0.01 77.91±0.01
88.05±0.01
77.91±0.01
XGBoost-d 82.24±0.00• 83.72±0.00• 77.16±0.00• XGBoost-t 82.51±0.00
83.95±0.00• 77.34±0.00
Ensemble-a 81.79±0.17• 83.27±0.16• 76.88±0.11• ARC
82.67±0.34
RAM
82.68±0.33 84.13±0.31 77.46±0.23
84.11±0.32
77.45±0.25
A Classification Based Ensemble Pruning Framework
663
models; and performance is robustly better than the baseline models, with evaluation metrics from different perspectives. Time and Resource Complexity. It is clear that the time and resource complexity are low when using the default configuration, since only one model is trained and stored. However, the performance of using default configuration is always far from satisfactory, as shown in Table 2. Fine-tuning can improve the performance with same resource complexity, but the time consumption may be extremely high, which will hinder its utilization. As an example, in our experiments, 300 parameter combinations are trained and evaluated. Even if we parallelly train 10 different model in each round, it will need to perform 30 rounds. As for Ensemble-a and our proposed framework, 10 base models are parallelly trained at a time, so the time complexity will not become an obstacle. However, 10 models need to be saved and used in deployment procedure for the Ensemblea method, which means that the resource complexity and computation cost for deployment are increased, while in our proposed methods, only 3 models are saved and further used, which effectively alleviates this problem. Additionally, our methods can provide more competitive performance. In a word, with slightly more time and resource consumption, our method can robustly provide more effective performance, which makes it a good choice.
5
Application to Fraud Detection Tasks
We further apply the proposed framework to real-world fraud detection tasks. These tasks are about detecting the frauds in different scenarios, for which methods with better performance are urgently demanded. Note that due to business requirement, multiple metrics need be considered when deploying a model, which address the necessity of model optimization under multi-metric consideration. Following the business experience, the negative instances are down-sampled to achieve better performance. The detailed information of the preproccessed business datasets is shown in Table 3, with the ratio of the number of negative samples and positive samples denoted as |D− |/|D+ |. Table 3. The information of the Fraud detection datasets. Dataset |Dtr | Task1 Task2 Task3
|Dv |
|Dte |
|D− |/|D+ | Dim
796332 199084 248854 344 5927789 1481947 3175285 191 7406400 1851600 2314500 698
81 44 43
The evaluated results on these business datasets are shown in Table 4, with the best scores marked in bold. As we can see, the best performance is achieved by either the proposed method ARC and RAM or method Ensemble-a, and
664
Y.-L. Zhang et al.
Table 4. The AUC, AP and F1 scores of the compared methods on business tasks. Dataset Method
AUC
AP
F1
Task 1
XGBoost-d XGBoost-t Ensemble-a ARC RAM
98.51±0.00• 98.72±0.00• 99.02±0.02 98.90±0.01 98.92±0.02
52.63±0.00• 54.49±0.00• 57.55±0.71 57.59±0.68 57.62±0.70
51.58±0.00• 53.75±0.00• 55.24±0.29 55.56±0.59 55.74±0.66
Task 2
XGBoost-d XGBoost-t Ensemble-a ARC RAM
92.07±0.00• 92.15±0.00• 92.08±0.04• 92.21±0.01 92.26±0.01
9.49±0.00• 9.78±0.00• 10.50±0.07 9.93±0.04• 9.95±0.04•
16.11±0.00• 16.32±0.00• 17.03±0.07 16.69±0.07• 16.73±0.08•
Task 3
XGBoost-d XGBoost-t Ensemble-a ARC RAM
87.48±0.00• 87.81±0.00• 87.80±0.00• 88.13±0.00 88.22±0.00
7.07±0.00• 7.37±0.00• 7.87±0.05 7.71±0.07 7.73±0.09
16.92±0.00• 16.70±0.00• 17.11±0.13 17.19±0.14 17.26±0.18
the preponderance is very significant comparing to the default single model XGBoost-d. ARC and RAM always behaves better than the default single model XGBoost-d and the fine-tuned single model XGBoost-t, while Ensemble-a may be worse than them (AUC for Task2 and Task3). For Task2, the Ensemble-a method achieves the best scores with metric AP and F1, while it is worse than XGBoost-t with metric AUC, so the robustness of the Ensemble-a method under different metrics may be dissatisfactory. RAM performs more robust than ARC across different metrics, verifies the effectiveness of the introduced auxiliary evaluation. One issue that needs to be addressed is that the fine-tuning process on industrial tasks becomes much more time-consuming, and it is impossible to perform this process in all industrial tasks. While for the proposed method ARC and RAM, the parameters of the base models are randomly sampled from the given lists, and the behavior is competitive on all of these tasks, which means that this method is feasible to use in real-world large scale tasks.
6
Conclusion
Machine learning techniques have been widely used in almost all Internet companies, and significant benefits have been obtained. Given a model, strategies like fine-tuning and ensemble techniques can be employed to fully explore its performance. However, more effective methods, with consideration of task-specific requirements, are always demanded. In this paper, we present a performanceenhancing framework, which utilizes the ensemble pruning technique and adapts
A Classification Based Ensemble Pruning Framework
665
a classification based optimization method to perform the pruning procedure. A calibration strategy is introduced in the weight optimization procedure to ensure the consistency of the pruned model between selection and deployment stage. To meet the business requirement for multi-metric consideration, auxiliary evaluation metrics are equipped in our framework, and an improved method is proposed to handle this scenario. Experiments are conducted on various tasks, and the results show that the proposed methods can effectively improve the performance. The proposed framework is further applied to real-world fraud detection tasks, which also demonstrates its effectiveness.
References 1. Breiman, L,: Random forests. Mach. Learn. 45(1), 5–32 (2001) 2. Cai, Q., Pan, Y., Wang, Y., Liu, J., Yao, T., Mei. T.: Learning a unified sample weighting network for object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4161–14170 (2020) 3. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3)15, 1–15:58 (2009) 4. Chen, J., Sun, B., Li, H., Lu, H., Hua, X.S.: Deep CTR prediction in display advertising. In: Proceedings of the 2016 ACM Conference on Multimedia Conference, pp. 811–820 (2016) 5. Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 785–794 (2016) 6. Davidson, B., et al.: The YouTube video recommendation system. In: Proceedings of the 2010 ACM Conference on Recommender Systems, pp. 293–296 (2010) 7. de Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Ann. Oper. Res. 134(1), 19–67 (2005) 8. Falkner, S., Klein, A., Hutter, F.: BOHB: Robust and efficient hyperparameter optimization at scale. In: Proceedings of the 35th International Conference on Machine Learning, pp. 1436–1445 (2018) 9. Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139 (1997) 10. Goldberg. D.E.: Genetic Algorithms in Search Optimization and Machine Learning. Addison-Wesley, Reading (1989) 11. He, X., et al.: Practical lessons from predicting clicks on ads at facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, pp. 5:1–5:9 (2014) 12. Hu, Y.-Q., Qian, H., Yu. Y.: Sequential classification-based optimization for direct policy search. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pp. 2029–2035 (2017) 13. Automated Machine Learning. TSSCML, Springer, Cham (2019). https://doi.org/ 10.1007/978-3-030-05318-5 10 14. Li, N., Zhou. Z.-S: Selective ensemble under regularization framework. In: Proceedings of the 8th International Workshop on Multiple Classifier Systems, pp. 293–303 (2009) 15. Liu, Z., Li, Q., Li, W.: Deep layer guided network for salient object detection. Neurocomputing 372, 55–63 (2020)
666
Y.-L. Zhang et al.
16. Liu, Z., Chen, C., Yang, X., Zhou, J., Li, X., Song. L.: Heterogeneous graph neural networks for malicious account detection. In: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2077–2085 (2018) 17. Margineantu, D.D., Dietterich, T.G.; Pruning adaptive boosting. In: Proceedings of the Fourteenth International Conference on Machine Learning, pp. 211–218 (1997) 18. Mart´ınez-Mu˜ noz, G., Su´ arez. A.: Pruning in ordered bagging ensembles. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 609–616 (2006) 19. Munos, R.: Optimistic optimization of a deterministic function without the knowledge of its smoothness. Advances in Neural Information Processing Systems 24, 783–791 (2011) 20. Pan, J., Mao, Y., Lobos Ruiz, A., Sun, Y., Flores. A.: Predicting different types of conversions with multi-task learning in online advertising. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2689–2697 (2019) 21. Pang, M., Ting, K.-M., Zhao, P., Zhou, Z-H.: Improving deep forest by confidence screening. In: 2018 IEEE International Conference on Data Mining, pp. 1194–1199 (2018) 22. Pazzani, M.J., Billsus, D.: Content-based recommendation systems. In: The Adaptive Web, Methods and Strategies of Web Personalization, pp. 325–341 (2007) 23. Qian, C., Yu, Y., Zhou, Z-H.: Pareto ensemble pruning. In: Proceedings of the 29th AAAI Conference on Artificial Intelligence, pp. 2935–2941 (2015) 24. Qian, H., Hu, Y.-Q., Yu. Y.: Derivative-free optimization of high-dimensional nonconvex functions by sequential random embeddings. In: Proceedings of the TwentyFifth International Joint Conference on Artificial Intelligence, pp.1946–1952 (2016) 25. Qian, H., Yu. Y.: On sampling-and-classification optimization in discrete domains. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 4374– 4381 (2016) 26. Sagi, O., Rokach, L.: Ensemble learning: a survey. Wiley Interdiscipl. Rev. Data Mining Knowl. Discov. 8(4),(2018) 27. Wolpert. D.H.: Stacked generalization. Neural Netw. 5(2), 241–259 (1992) 28. Xu, K., et al.: aDMSCN: a novel perspective for user intent prediction in customer service bots. In: Proceedings of the The 29th ACM International Conference on Information and Knowledge Management, pp. 2853–2860 (2020) 29. Yeung, H.W.F., Li, J., Chung, Y.Y.: Improved performance of face recognition using cnn with constrained triplet loss layer. In: 2017 International Joint Conference on Neural Networks, pp. 1948–1955 (2017) 30. Yu, Y., Qian, H.: The sampling-and-learning framework: a statistical view of evolutionary algorithms. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 149–158 (2014) 31. Yu, Y., Qian, H., Hu, Y.-Q.: Derivative-free optimization via classification. In: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 2286– 2292 (2016) 32. Zhang, Y-L., Li, L.: Interpretable MTL from heterogeneous domains using boosted tree. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pp. 2053–2056 (2019) 33. Zhang, Y-L., et al.: Poster: a PU learning based system for potential malicious URL detection. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 2599–2601 (2017)
A Classification Based Ensemble Pruning Framework
667
34. Zhang, Y-L., Li, L., Zhou, J., Li, X., Zhou, Z.-H.: Anomaly detection with partially observed anomalies. Comp. Web Conf. 2018, 639–646 (2018) 35. Zhang, Y-L., et al.: Distributed deep forest and its application to automatic detection of cash-out fraud. ACM Trans. Intell. Syst. Technol. 10(5), 55:1–55:19 (2019) 36. Zhang, Y.-L., Zhou, Z.-H.: Multi-instance learning with key instance shift. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 3441–3447 (2017) 37. Zhao, H., Ying, X., Shi, Y., Tong, X., Wen, J., Zha, H.: RDCFace: radial distortion correction for face recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7718–7727 (2020) 38. Zheng, Y., Pal, D.K.: Savvides. M.: Ring loss: convex feature normalization for face recognition. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 5089–5097 (2018) 39. Zhou, Z-H.: Ensemble Methods: Foundations and Algorithms. Chapman and Hall/CRC, New York (2012) 40. Zhou, Z.-H., Feng. J.: Deep forest: towards an alternative to deep neural networks. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pp. 3553–3559 (2017) 41. Zhou, Z-H., Jianxin, W., Tang, W.: Ensembling neural networks: many could be better than all. Artif. Intell. 137(1–2), 239–263 (2002) 42. Zhu, H., Jin, J., Tan, C., Pan, F., Zeng, Y., Li, H., Gai. K.: Optimized cost per click in taobao display advertising. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2191– 2200 (2017)
Customs Risk Assessment Based on Unsupervised Anomaly Detection Using Autoencoders Dion T. Oosterman(B) , Wouter H. Langenkamp, and Ellen L. van Bergen Netherlands Organisation for Applied Scientific Research (TNO), The Hague, The Netherlands [email protected] https://www.tno.nl/en/
Abstract. In this paper we describe our initial findings on a method for improving anomaly detection on a dataset with scarcely labeled data, based on an ongoing use-case with the Belgian Customs Administration (BCA). Data on shipping containers is used to predict the level of risk associated with a shipment, as well as the probability that the shipment is fraudulent. The absence of labeled data prevents the use of supervised machine learning techniques and calls for unsupervised analysis. We employ an autoencoder to learn the distribution of the dataset and detect anomalies, under the assumption that only a fraction of all shipments is fraudulent. The absence of labels in the dataset complicates the evaluation of the autoencoder’s performance. A qualitative approach is taken to assess the assess the properties of the detected anomalies. The variable distributions of the anomalies differ significantly from variable distributions in the complete dataset and are marked interesting by domain experts. To obtain an impression of the quantitative performance in the absence of ground-truths, synthetic data is generated using a variational autoencoder. The preliminary qualitative and quantitative results suggest that autoencoders can provide value for customs risk assessment. Keywords: Customs risk assessment · Unsupervised learning Anomaly detection · Autoencoder · Variational autoencoder · Synthetic data generation
1
·
Introduction
Millions of shipments traverse the world each day, making the management of maritime transport a highly dynamic and complex domain. Besides logistical challenges, there are also concerns about various other factors, such as import duties and the prevention of illegal goods crossing the borders. Customs play a large role in intercepting such shipments and are responsible for managing which goods are allowed to come in and go out of the country. Customs authorities have limited resources available to identify and inspect potentially fraudulent c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 668–681, 2022. https://doi.org/10.1007/978-3-030-82193-7_45
Customs Risk Assessment Using Autoencoders
669
shipments. The number of goods traversing borders is increasing, making proper control even more challenging. To keep up, it is therefore necessary to make an informed decision on whether to inspect an incoming shipment or not. In the European Union, modern customs follow a risk-based approach to control the border, where shipments of higher risk are more likely to be inspected and lower risk traffic admitted more freely. To detect high-risk shipments, various methods and rules are being followed; a simple example of such can be an odd combination of, or conflicting information. In the European research program PROFILE, TNO collaborates with the Belgium Customs Administration (BCA) to research customs risk management, focusing on understanding, assessing and improving various aspects of the logistics cycle. One of the aims of the program is to further optimize the current risk assessment process using a data-driven analysis. Throughout the years, data-driven models have proven to be useful, also with successes in the logistics domain [5]. The aim of such models is to identify patterns, detect abnormal observations (‘anomalies’) in the data and to assess associated risk levels. The BCA has provided data containing information regarding goods being imported in, or exported out of the country and the logistical processes around it. For targeting analysis, one of the most important data source is the Entry Summary Declaration (ENS), which details the shipments coming into the country, and in this case, into the European Union. The ENS data is used to get a first impression of the shipments and to assess which of those should be checked for further inspection. This decision is currently largely based on a set of rules and expert knowledge. While this performs significantly better than random controls, it remains challenging to make effective decisions. The goal of the overarching PROFILE project is to explore how data-driven models can be utilized to improve this decision-making process. There are large amounts of (historical) data present that may be used to train models that can help with future assessments. The overall objective is to assess the risks associated with shipments, in order to help prioritize which ones to inspect. The data-driven models are not aimed to replace the current systems, they should rather be seen as an additional source of information and extension of the systems already in place. One of the major insights that customs is interested in, is the identification of trade patterns. A data-driven model can be trained to learn such patterns from the data and to determine what can be considered as ‘normal’ behavior and what should not. So-called anomalies, or outliers, are parts of data that deviate in behavior from the ‘normal’ data. Anomalous behavior is thought to be an indicator for fraudulent shipments, meaning that making these insightful can help target shipments that warrant additional investigation or control. Analyzing large amounts of data can be a difficult and time-consuming task. This is also the case for the ENS data, as its analysis is complicated by various factors, such as poor data quality and a lot of unstructured free-text information. The poor quality and lack of structure makes feature selection and engineering very demanding. The other main difficulty faced analyzing the ENS data, is sparsity of the feedback data and thus very low number of ground-truth values that
670
D. T. Oosterman et al.
can be used to train and validate any model. The analysis can be conducted using an unsupervised framework, where patterns can be sought for without existing labels. Evaluation of such models can, however, be challenging, as the lack of ground-truth values also limits the possibilities for a quantitative analysis. In this paper we describe the implementation of an autoencoder, which is used to learn and identify general patterns found in the shipments. Autoencoders show promising results for anomaly detection in a variety of domains and are versatile in use [8,12,13,16]. The trained autoencoder model learns to represent and reconstruct samples from a lower dimensional space, which forces the model to learn generalized behavior. After training, the model will be able to reconstruct ‘normal’ samples relatively well and therefore with low reconstruction error. Samples deviating from standard patterns, however, will be much harder to reconstruct and will thus come with higher error. This method can be exploited to identify anomalous samples in the data that are of interest for further analysis. Visualizations are made to analyze qualitatively whether the detected anomalies are a good indication of the fraudulent shipments. The autoencoder can thereby be used to assist customs officers in highlighting unusual shipments, who can then ultimately decide which shipments to investigate further. Because feedback data is almost non-existent in the ENS data, a quantitative evaluation of the performance of the model cannot be made using the original data. We explore an alternative approach in which synthetically generated data is used in order to obtain an impression of the quantitative performance. A variational autoencoder (VAE) is used to generate synthetic data that is similar to the ENS data, which is labelled and used for validation of the models. VAEs have already shown good results for generating complex data [4,9,14,15]. Both the qualitative and quantitative evaluations suggest that autoencoders have value for anomaly detection in logistics and that is is possible to gain confidence in its functioning even in the absence of labeled data. VAEs are, similar to autoencoders, suitable for anomaly detection, which will be explored further in the remainder of the project. The data and methodology used in this study are discussed in Sect. 2 highlighting the different approaches used with the autoencoder and VAE. Section 3 details and compares the results of the different approaches. Section 4 provides the preliminary conclusions and next steps planned for this project.
2
Data and Methodology
A large quantity of historical and recent data is available to increase understanding of the processes at customs and improve the targeting of possibly fraudulent shipments. For this research, the so-called ENS dataset is used, which contains information on shipments before they arrive in Europe. The ENS dataset provides, for example, insight into how many containers are transported and which goods are imported, and is used by customs administrations to assess whether the shipment should be inspected. Data features related to a shipment are, for
Customs Risk Assessment Using Autoencoders
671
instance, the route it has traveled, description of the goods, the country of origin, the shipping company, and the container number. There are several other data sources available that contain additional information from customs as well as the logistical processes around it. These sources may be used in future research to enrich the information contained in the ENS to ultimately help improve targeting of fraudulent shipments. Data cleaning and feature engineering are essential to make the ENS data usable for analysis. Examples of transformations that are used are one-hot encoding of categorical variables and scaling of numerical ones. Doc2Vec, a document embedding method, is used to convert textual data into numerical vectors in order to be able to use them as an input to machine learning models [7]. This method facilitates processing of textual data by the autoencoder and makes it possible to learn relationships between different categories. The categories of the HS-code (goods category code) are extracted following the official taxonomy, after which they are one-hot encoded. The features that are used in the analysis are presented in Table 1. The ENS data contains information about both maritime and aerial shipments, but for this analysis the data is filtered such that it contains only maritime shipments. After applying all filters, there remain approximately 835.000 data points for analysis. Of these, BCA estimates that about 2 to 5% of the shipments are fraudulent, although it is impossible to know for certain. The number of observations that contain the variable ‘Result of the control’ (the variable which indicates whether a controlled shipment was fraudulent or not) is only 2.081, which is less than 0.1% of the total. Out of those 2.081 samples, only 32 are shipments were found to contain fraudulent behavior, which are in fact the most insightful cases. To further complicate things, there is no absolute certainty that any of the control feedback values are always reliable either. This makes it difficult to train any supervised model and thus further motivates the choice for an unsupervised framework. Table 1. Overview of features used in the autoencoder. Feature
Data type after processing
Consignee country
Categorical (one-hot encoding)
Consignor country
Categorical (one-hot encoding)
Loading location (place + country)
Categorical (one-hot encoding)
Unloading location (place + country) Categorical (one-hot encoding) First entry (office + postal code)
Categorical (one-hot encoding)
HS-code (goods commodity code)
Categorical (one-hot encoding)
Goods description
Doc2Vec vectors
672
2.1
D. T. Oosterman et al.
Autoencoder
Autoencoders are a type of neural network that can be utilized to learn an approximation of the distribution of the data. The goal of an autoencoder is to learn a lower dimensional representation of the input features and to reconstruct the original input from this lower dimensional representation [3,11]. This makes it possible to train them without the need for labelled feedback data, which also makes it suitable for our use-case. A general architecture of an autoencoder network is shown in Fig. 1. The original data is stored in n ∗ p matrix X, where n denotes the number of observations and p the number of variables. The observations are xi = Xi1 , Xi2 , ..., Xip for i = 1, ..., n, and give the input vectors for the neural network. The output cells are matched to the input cells of the network because the aim is to reconstruct the input. The layer in the network with the lowest dimensionality is called the ‘bottleneck layer’, containing the nodes z1 , ..., zM with M < p. Its purpose is to capture the most useful properties of the data such that it can reconstruct the original input as good as possible. This forces the network to learn general patterns that describe the majority of the samples. Compressing the input dimension of the data into a lower dimensional representation is called encoding. The nodes in the hidden layer can be reconstructed to the high-dimensional input vectors, which is similarly called decoding. The model learns to reconstruct the input data from the encoded repˆ 2 , ..., X ˆ p . The compression in ˆ1, X resentation, which results in the estimations X the hidden layers is typically lossy, meaning that part of the information in the data is lost during encoding and that information cannot be recovered during the decoding. The amount of loss is highly dependent on the extent of the reduction in dimensionality, or in other words, how much fewer nodes M there are compared to the original input size p.
Fig. 1. Network architecture of an autoencoder.
Customs Risk Assessment Using Autoencoders
673
A grid search is done to determine suitable hyperparameters for the autoencoder. Multiple options are implemented for the activation functions of the hidden layers and the output layer, the loss function, and the optimization algorithm. The grid search covered various possibilities for the hyperparameters, but was not exhaustive. Amongst others, a comparison was made between shallow networks and networks with a higher number of hidden layers, as well as a variety of other parameters. Model performance could be further improved by a more extensive hyperparameter optimization. The following parameters were found performing best in our grid search: the Rectified Linear Unit function to activate the hidden layers, the sigmoid function to activate the output layer, the adam optimization algorithm and the binary cross-entropy loss function. The encoding factor set to 16; this results in a network with an input layer of 1140 nodes that is reduced to 71 dimensions in the bottleneck layer. Deeper networks performed better than the shallow networks, for which 5 hidden layers were found to be (sub)optimal. After training the model, it will have learned a generalized representation of the shipments. The assumption is that the number of fraudulent data points in the ENS data is very low, meaning that the learned representation should be more representative of ’normal’, non-fraudulent shipments. This principle can be leveraged to find anomalous shipments, which should help target fraudulent behavior. The intuition behind this is that the autoencoder will have a harder time reconstructing fraudulent shipments than it has reconstructing nonfraudulent shipments, meaning that data points with a higher reconstruction error may be regarded as anomalies. For this case, observations with the 5% highest reconstruction errors are labeled as anomalies. The tested can simply be ordered and this value can thus be adjusted according to risk-appetite, which will also influence the rate of true to false positives. In order to investigate whether the detected anomalies are a good indication of the fraudulent shipments or not, an in-depth analysis with visualizations of the anomalies is made. These visualizations show a comparison between the feature distribution for the anomalies and that of the complete dataset. When these distributions are similar, that feature likely did not contribute to the observation being classified as anomalous. On the other hand, when the distributions are significantly different, the feature values may give insight in what consists of suspicious behavior. 2.2
Variational Autoencoder
Feedback data is sparse in the ENS data, making it difficult to validate the models. One way to overcome this challenge is by generating synthetic feedback data, which can be used to help quantify the performance of the autoencoder. Evaluation using generated synthetic data will give further insights in how well the model actually is in classifying anomalous data points. This should not be seen as a definitive proof of the model’s capability of detecting fraudulent shipments, more so of the model’s ability of classifying samples deviating from the standard patterns. In order for the quantitative results to be a representative evaluation for the real shipment data, it is essential that the generated data is
674
D. T. Oosterman et al.
similar to the original data. Synthetic data that is similar to the original data can be simulated with the use of a variational autoencoder (VAE), which is a type of autoencoder that has a bottleneck layer represented by distributions instead of absolute values, denoted as p(z|X) [6]. By sampling the distributions, different outputs will be constructed at the output layer. This principle can be leveraged in order to generate new (outlier) data that is representative of the samples already present in the data. This distribution is regularized by restricting it to be close to a standard Gaussian distribution, to make the latent space of the autoencoder regular enough to generate new data points. A point can be taken randomly from this space and decoded to obtain new data. Since the VAE generalizes the input in the hidden layer, measures will have to be taken in order to generate realistic outliers. Apart from looking at existing outliers, one way of achieving this is by adding contamination to the data points, which can then be regarded and labeled as outliers. Often in statistics, an outlier refers to a row of the data matrix that represents an abnormal observation [10]. To add row-wise outliers, the classical Turkey-Huber Contamination Model (THCM) [1] can be applied. The main idea is to model data as a mixture of two distributions; a distribution related to the nominal model and a distribution corresponding to the outliers. The THCM principle can be applied in the hidden layer of the VAE. According to the properties of a VAE, encoded values for a ‘normal’ data point are drawn from the multivariate distribution Np (0, I). Decoding the encoded values then results in a new observation. An anomaly can be generated by adjusting the mean or the variance of the distribution in the hidden layer and decoding these values. To visualize the principle, the dimension of the hidden layer of the VAE is set to two and depicted in Fig. 2.
(a) Adjusted mean
(b) Adjusted variance
Fig. 2. Encoded values in the two dimensional bottleneck layer of the variational autoencoder for a synthesized data set with row-wise outliers.
Customs Risk Assessment Using Autoencoders
675
A different type of anomaly is the cell-wise outlier, where most of the features in a row have regular values and one or a few features have anomalous values. The Independent Contamination Model (ICM) [2] can be used to generate such cell-wise outliers. Before applying the model, a clean data matrix X, which is similar to the original data, is simulated using the VAE. Then, randomly, a part of the data is selected and every column of this part of the data is contaminated by adjusting a specified percentage of the cells. As a result, these observations have outlying values for some features and are labeled as anomalies. For our case, a low percentage of contamination in every column was chosen (2%), to limit the number of contaminated cells for the anomalous observations. This leads to observations with outlying values for some features and regular values for all other features. There are two notable differences between the two types of outliers. First, for cell-wise outliers, only specific individual features are adjusted to generate outlier, whereas for row-wise outliers, all features in that row are altered. Secondly, the contamination for row-wise outliers takes place before the decoding of the data points, whereas for the cell-wise outliers this is done after the decoding step. Cell-wise contamination can only be added after the decoding process because one outlying cell in the hidden layer of the VAE can influence all of the decoded values, making it a row-wise outlier instead. Exploring both types of outliers gives a better idea of the performance and stability of the autoencoder. Following the assumption that 2 to 5% of the observations is anomalous, the outlier percentage is set to 4%. This means that in the case with row-wise outliers, 4% of the new data is generated with the adjusted distribution in the hidden layer of the VAE. For cell-wise outliers, the part of the data that is contaminated with outlying cells contains 4% of the generated observations. With the use of these synthetic data points, quantitative measures such as the number of true positives and false positives can be calculated to evaluate the model. The described simulations are done multiple times to account for the randomness in the generation process.
3 3.1
Results Autoencoder on ENS Data
The autoencoder is trained for 30 epochs, after which previously unseen test samples are fed back to it. The distribution of the reconstruction errors for data points in the test set is plotted in Fig. 3, where data points with a high reconstruction error are considered anomalous. Given the assumption that 2 to 5% of all shipments are fraudulent, an error threshold (the red line in Fig. 3) is taken such that 5% of the predictions will be labeled as anomalies. This allows for a high recall, but may also lead to a higher number false-positives. This percentages should in practice be adjusted depending on what parameter (e.g. recall, precision) one would like to optimize for.
676
D. T. Oosterman et al.
Fig. 3. Distribution of the autoencoder reconstruction errors for data points in the test set.
Given the near-absence of feedback data, the anomalies are evaluated qualitatively. To evaluate if the detected anomalies are a good indication of the fraudulent shipments, variable distributions are visualized and compared between all data points and the anomalous data points. The 10 most occurring values for the loading country and the goods description are displayed in Fig. 4 in a pseudonymized manner for confidentiality reasons (each value got assigned a numerical label instead of their true value). The distribution of the loading place is quite different in the anomalies than in all data points. The most occurring value for the good category is ‘0’ in both the anomalies and all data points, but all other values in the top 10 are different. If the distribution in the anomalies is similar to the distribution in all data points, the first can be explained by the latter, which indicates that the variable is not likely to contribute to the high reconstruction error. When the distribution in the anomalies is different from the distribution in all data points (e.g., in the loading place), however, it can be an indication that this variable is used by the autoencoder to predict anomalies. In discussions with BCA, it has been expressed that the anomalies found by the autoencoder are comparable to shipments they would flag as anomalies themselves. This suggests that anomaly detection using an autoencoder can be of value in supporting targeting officers in their task of risk assessment. 3.2
Autoencoder on Synthetic Data
To quantify the results of the autoencoder, a simulation study with synthesized data is performed. Row-wise and cell-wise outliers are generated with an adjusted mean or variance, resulting in four types of outliers. Table 2 presents the confusion matrices and F1-Score for every type of outlier. The simulations
Customs Risk Assessment Using Autoencoders
677
(b) Good Categories
(a) Loading Places
Fig. 4. Comparison of variable distributions between anomalies and all data points. Two variables are illustrated (A shipment’s loading place and its goods category) and the values are pseudonymized.
are done multiple times to account for randomness and the results are taken as the average of all instances. The F1-Score measures two things, namely the precision and the recall of the outlier detection method. Precision represents the percentage of truly positive predictions out of all positive predictions: precision =
TP TP + FP
(1)
Recall gives the percentage of predicted positive values out of the total positive instances: TP (2) recall = TP + FN The F1-Score is a harmonic mean of the two measurements: F1-Score = 2 ∗
precision ∗ recall precision + recall
(3)
For a robust classifier, both percentages should ideally be high and thus the F1Score close to 1. The results show that all the cell-wise outliers are detected by the model, as well as a number of false positives, with an overall high F1-Score. The observations with outlying values deviate from the ‘normal’ data substantially, likely making them easy to detect. Nevertheless, this provides proof for the statement that this is how the reconstruction errors come about. The row-wise outliers proved harder to detect, especially the outliers with an adjusted mean. Less than half of the anomalies are detected by the network and the F1-Score is generally considered to be low. The results for the row-wise outliers with adjusted mean are slightly better. It is likely that the row-wise outliers fit in better with the clean data because the decoding process is applied also on the outliers, making them harder to detect. The question remains whether the generated anomalies are representative of the anomalies in the real ENS data, as it is unknown what the real outliers look
678
D. T. Oosterman et al.
Table 2. Confusion matrices for the four types of anomalies in the simulation study. (a) Row-Wise Outliers with Adjusted Mean
(b) Row-Wise Outliers with Adjusted Variance
Actual value
Prediction
Positive
Negative
Positive
20.21
79.79
Negative
59.79
1840.21
F1 -score:
0.225
(c) Cell-Wise Outliers with Adjusted Mean
Actual value Positive Prediction
58.39
41.61
Negative
21.61
1878.39
F1 -score:
0.649
(d) Cell-Wise Outliers with Adjusted Variance
Actual value
Prediction
Positive
Negative
Positive
80.00
20.00
Negative
0.00
1900.00
F1 -score:
0.889
Negative
Positive
Actual value Positive Prediction
Negative
Positive
80.00
20.00
Negative
0.00
1900.00
F1 -score:
0.889
like given the sparsity of feedback data. However, by trying multiple options and simulating new data numerous times, a first quantitative evaluation can be made. It can be concluded that in two out of four cases, the autoencoder performs excellent. The row-wise anomalies are likely far more representative of a ’normal’ observation and thus harder to detect. Although the performance for the rowwise outliers is significantly lower, it does not necessarily make the model useless. Current targeting practices struggle to obtain high accuracy, meaning that even imperfect performance can in some cases still be considered an improvement. In practice, there may be different priorities, where in some cases low precision or recall are accepted or even inevitable. Quantitative evaluation remains an open challenge; for more grounded conclusions about model performance, the availability of labelled data is key.
4 4.1
Future Work and Conclusions Conclusions
The preliminary findings described in this paper have demonstrated that it is possible to utilize autoencoders for unsupervised anomaly detection in customs risk assessment of shipments. Autoencoders can learn to generalize over a dataset and predict anomalies without the need for labels, on the condition that the majority of data points are non-anomalous. BCA has indicated that anomalies predicted by the autoencoder are resembling data points they mark anomalous themselves. The use of variational autoencoders makes it possible to evaluate anomaly detection performance quantitatively by generating synthetic data and labels. The quantitative results show acceptable performance on the generated data, although this is under the assumption that the generated anomalies are representative of real anomalies.
Customs Risk Assessment Using Autoencoders
4.2
679
Future Work
The PROFILE program is still in progress, so is this research. Amongst others, the following topics will be addressed in the remainder of the program: • Optimizing the models: The models will be further optimized by exploring other network structures, hyperparameter configuration and feature engineering methods. We will compare the autoencoder to other types of anomaly detection (e.g. utilizing the VAE) to determine how well each is suited and whether they can complement each other. Additionally, there are also other data sources available that may be utilized to enrich the information present in the ENS data. An example of such a data source is import declaration data, which provides information about the actual goods being imported, rather than the shipping container. The enriched data will be used for analysis and results compared to determine the benefits of the approach. • Making anomalies interpretable: This paper details various methods to evaluate the performance of the implemented anomaly detection mechanism. Especially from a domain point of view, it is interesting to analyze the anomalies in further detail as it may provide insight in which features are contributing to a shipment being anomalous. Apart from individual features, the combination of features and, for example, temporal patterns underlying the anomalies are of interest. To obtain the most insight out of an anomaly analysis, it is beneficial when the autoencoder predictions are interpretable. We intend to make the outputs interpretable using SHAP (SHapley Additive exPlanations). Utilizing SHAP, we can visualize per feature what its contribution is to the model output. In other words, it answers the question whether a feature makes a shipment more or less likely to be considered anomalous. Since the high-complexity of, and the presence of categorical features in the ENS data, implementing SHAP is not a trivial task. However, if we can provide customs authorities insight in how certain features contribute to risk predictions of the autoencoder, it helps to gain trust in the use of the model as an assisting tool in customs risk assessment. • Compare the results to current practices: The quantitative analysis was done in order to quantify the absolute performance of the model in detecting anomalies. The ability to detect was good for the cell-wise outliers, but less so for the row-wise outliers. Current practices are, however, nowhere near perfect detection rates either. A fairer evaluation may therefore be a comparison with the current performance. It is thereby also important to consider the true- to false positive rate, which may be further optimized by lowering the classification threshold. Since the system will be supplementing the existing systems in place, it may prove more valuable to have a high precision as opposed to a high recall. • Mechanism for continuous updating of the model: The preliminary results show good promise for the use of unsupervised analysis. It would, however, be a waste not to use the feedback information that is or will become available over time. Even though feedback is sparse, it does provide valuable
680
D. T. Oosterman et al.
insights for future analysis. We will therefore implement a mechanism that allows for real-time updating of the models as soon as information becomes available. By doing so, the models should become less sensitive to outliers that were marked as false positive in the past. The inspection-optimization cycle may be regarded as a real-time control problem, which may be optimized using algorithms such as reinforcement learning. Acknowledgments. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786748. We also acknowledge the TNO colleagues involved in PROFILE (e.g. Wout Hofman and Jok Tang) and the PROFILE consortium partners, in particular the Belgian Customs (e.g. Jonathan Migeotte and Mathieu Labare) for their valuable help and support.
References 1. Agostinelli, C., Leung, A., Yohai, V.J., Zamar, R.H.: Robust estimation of multivariate location and scatter in the presence of cellwise and casewise contamination. TEST 24(3), 441–461 (2015). https://doi.org/10.1007/s11749-015-0450-6 2. Alqallaf, F., Van Aelst, S., Yohai, V.J., Zamar, R.H.: Propagation of outliers in multivariate data. Ann. Stati. 37(1), 311–331 (2009) 3. Bourlard, H., Kamp, Y.: Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybernet. (3), 291–294 (1988). https://doi.org/10.1007/ BF00332918 4. Doersch. C.: Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908 (2016) 5. Govindan, K., Edwin Cheng, T.C., Mishra, N., Shukla. N.: Big data analytics and application for logistics and supply chain management (2018) 6. Kingma, D.P., Welling. M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013) 7. Le, Q., Mikolov. T.: Distributed representations of sentences and documents. In :International Conference on Machine Learning, pp.1188–1196 (2014) 8. Oh, D.Y. Yun, I.D.: Residual error based anomaly detection using auto-encoder in SMD machine sound. Sensors 18(5), 1308 (2018) 9. Pu, Y., et al..: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems, pp. 2352–2360 (2016) 10. Rousseeuw, P.J., Hubert. M.: Anomaly detection by robust statistics. Wiley Interdiscipl. Rev. Data Min. Knowl. Discov. 8(2), e1236 (2018) 11. Rumelhart, D.E., Hinton, G.E., Williams. R.J.: Learning internal representations by error propagation. Technical Report, California Univ San Diego La Jolla Inst for Cognitive Science (1985) 12. Sabokrou, M., Fathy, M., Hoseini, M., Klette. R.: Real-time anomaly detection and localization in crowded scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 56–62 (2015) 13. Sakurada, M., Yairi, T.: Anomaly detection using autoencoders with nonlinear dimensionality reduction. In: Proceedings of the MLSDA 2014 2nd Workshop on Machine Learning for Sensory Data Analysis, pp. 4–11 (2014) 14. Santana, E., Hotz. C.: Learning a driving simulator. arXiv preprint arXiv:1608.01230 (2016)
Customs Risk Assessment Using Autoencoders
681
15. Walker, J., Doersch, C., Gupta, A., Hebert. M.: An uncertain Future: forecasting from static images using variational autoencoders. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol. 9911, Springer, Cham (2016). https://doi.org/10.1007/ 978-3-319-46478-7 51 16. Xu, D., Ricci, E., Yan, Y., Song, J., Sebe. N.: Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553 (2015)
Best Next Preference Prediction Based on LSTM and Multi-level Interactions Ivett Fuentes1,3(B) , Gonzalo N´ apoles2 , Leticia Arco4 , and Koen Vanhoof3 1
2
Computer Science Department, Central University of Las Villas, Cuba, Carretera Camajuan´ı km 5.5, 54830 Santa Clara, Cuba Department of Cognitive Science and Artificial Intelligence, Tilburg University, Tilburg, the Netherlands 3 Faculty of Business Economics, Hasselt University, Hasselt, Belgium [email protected] 4 Artificial Intelligence Lab, Vrije Universiteit Brussel, Ixelles, Belgium
Abstract. Predict customer buying behavior is an important task for improving direct marketing campaigns, offering the best possible experiences, and providing personalization in the customer journey trip. Improving how models capture the sequential information from transactional data is essential to learn customer buying order and repetitive buying patterns to generate recommendations over time. In this paper, we propose the deep neural network approach DeepCBPP, which models the sequence prediction problem as a multi-class classification problem and takes the LSTM neural network as the base of the training process. Our main contributions rely on a new sequence customer representation approach based on multi-level interactions of the most recent influenced items, which allows predicting preferences without sophisticated feature engineering. The simulations using 12 datasets from a real-world problem achieve competitive results compared to the stateof-the-art sequence prediction models supporting the effectiveness of our proposal. Keywords: Customer sequence representation · Sequence prediction models · LSTM · Customer buying behavior · Multi-class classification
1
Introduction
Several application contexts—retail sales, web sessions, credit card transactions, etc.—maintain large information systems for capturing records about customer buying transactions in a cost-effective manner [5]. Understanding and predicting the future customer buying preferences from this huge transactional data is crucial to improving customers’ satisfaction and direct marketing campaign, which This study was supported by the Special Research Fund (BOF project BOF17BL08) of Hasselt University. The authors would like to thanks the anonymous commercial partners for providing the data sources and other resources used in this research. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 682–699, 2022. https://doi.org/10.1007/978-3-030-82193-7_46
Best Next Preference Prediction Based on LSTM
683
eventually is materialized in more efficient strategies for customer management and business profit. Also, these tasks provide a set of recommended products that customers are interested in, prevent searching through an extensive collection of products while allowing for personalization in the customer journey trip [9,18,21]. Customer Buying Preference Prediction (CBPP) is about predicting the next best preferences over a collection in a specific application domain of a target customer based on its previous preference interactions [23]. Although the problem of predicting preferences could be conducted by the classical product recommender approaches (i.e., collaborative filtering, cluster-based, or content-based models), these techniques can only partially cover this problem due to they not consider the temporal sequence inside the historical customer buying behavior for making a prediction. As a result, they only can recommend items that a customer would like in general. In order to incorporate the temporal aspect correctly and make accurate predictions, many Sequence Prediction (SP) models have been proposed. Some of the most popular are Dependency Graphs (DG) [19] and All-K-Order-Markov (AKOM) [20] models, which are based on the Markov property. Other models include Transition Directed Acyclic Graphs (TDAG) [11] and LZ78 [30]. Specifically, TDAG are based in part on the Markov stochastic-process model due to build Markov trees where paths represent observed prefix subsequences. Similarly, LZ78 is a classic compression algorithm adapted for sequence predictions and founded on an Information Theoretic approach [30]. However, these approaches build lossy models because they do not exploit all relevant information in the training sequences during the predictive analytic process to anticipate future behaviors. In an attempt to this issue, in [4] the authors presented a novel approach, named Compact Prediction Tree + (CPT+), that uses the whole information based on Frequent Subsequence Compression (FSC), Simple Branches Compression (SBC), and Prediction with improved Noise Reduction (PNR) strategies. With deep learning receiving much attention in the last years, a new approach to model sequential data has been explored, driving a remarkable revolution in recommender applications [1,17,28]. Specifically, Long Short-Term Memory (LSTM) architectures have been very effective for learning complex sequential patterns directly from low-level features without needing feature engineering [22]. Due to the ability of LSTM to remember long-term dependencies over time, which is required to succeed in learning problems with temporal dependencies, this model as a base has become an effective approach for several sequence prediction application research like machine translation [23,28], image captioning [1], handwritten recognition [17], and the next activity prediction [25]. Similarly, recent deep approaches focus on preference sequential recommendations [6,12,13,15,24]. However, most of them consider the time-interval as an explicit component and use it for the learning process in sessions of several periods [6,12,13,24]. As a result, they do not allow capturing the influence of previous multiple levels in scenarios in which the order could be small (i.e., it has a short
684
I. Fuentes et al.
length), and the interactions in large periods have a higher dependence on the previous sessions. In this paper, we propose a deep neural network approach named DeepCBPP, which models the sequence prediction problem as a multi-class classification problem while taking LSTM as the base learner. DeepCBPP automatically learns behavioral patterns from transactional buying histories and suggests the probable product categories or products that a customer could buy in the next visit. In particular, we make the following main contributions: (i) the newly sequence customer representation that allows capturing both the implicit time component and the multi-level interactions in scenarios in which the orders are short and the interactions have a higher dependence on the previous decisions. (ii) the deep learning approach that allows predicting customer preferences without a sophisticated feature engineering by using a combination of both the new multi-level sequence representation and the LSTM model. Besides, we conduct extensive experiments evaluating the performance of our general deep approach by fine-tuning the model structure (i.e., changing the LSTM model architecture). We compare our DeepCBPP approach with five sequence prediction methods, and two deep recommendation approaches, which can be adapted to dealing with consecutive buying preferences, revealing that our approach outperforms these baseline methods. The remainder of this paper is organized as follows. Section 2 revises the approaches reported in the literature related to deep sequential recommendations, mainly those based on LSTM architectures. Section 3 introduces our DeepCBPP approach for the best next preference prediction. The issues related to the LSTM architectures are discussed in Sect. 4. Section 5 presents the experimental results for different LSTM architectures on 12 datasets from a real-world problem and against the five state-of-the-art prediction models mentioned before. Towards the end, Sect. 6 outlines conclusions and future work.
2
LSTM Based Recommendations
In recent years, several researches start to focus on deep sequential recommendations based on the Recurrent Neural Network (RNN) model [6,8,12,13,25,26]. Among them, LSTM architectures are very attractive for predicting preference approaches due to they have effective for learning complex sequential patterns directly from low-level features [8,12,26] and are encouraging to address the challenges related to tedious feature engineering of vector-based methods [12]. Some sequence modeling proposals have been addressed through a simple LSTM model to predict the next activity or recommendation [12,25]. Specifically, in [25] the authors employ a simple RNN architecture with a single LSTM layer and the Adam optimizer with standard settings for the training process. Zalando’s team [12] has applied RNN to model the sequence of the customer
Best Next Preference Prediction Based on LSTM
685
interactions in their webshop. The authors utilize the event data from the different sessions (orders) of the customer to predict the probability that the customer will place an order within the next seven days. In particular, the input is a sequence of one-hot encoding vectors which represent past customer behavior actions as product-views, cart-additions, orders, etc. The authors use a simple RNN architecture with a single LSTM layer. However, this approach attempts to understand customer behavior on the level of sessions based on session streams. A collaborative filtering approach is viewed as a sequence prediction problem by using RNN in [2]. Unlike most of the collaborative filtering methods, the time dimension is considered in this work. In this case, the authors create a sequence of ratings of movies to predict which movies the user will watch in the future. They encode the sequence of movies that the user has rated in the past as a sequence of one-hot encoding vectors. Jing et al. [8] proposed a multi-task learning framework to predict customers’ returning time and recommend items simultaneously. The returning time prediction is inspired by a survival analysis model designed for estimating the probability of survival of a patient [29]. The authors modified this model by using LSTM to estimate the customers’ returning time based on past session actions. Unlike previously mentioned session-based recommendation approaches which focus on recommending in the same session, this model aims to provide intersession recommendations, including local time dependencies. The authors use a single LSTM layer where both the number of hidden units and the embedding size is also set to the same value. Li et al. [15] presented a behavior-intensive neural model for the sequential recommendation. Specifically, this model combines two components: neural item embedding and discriminative behaviors learning. The latter part constructs two alignments of customer behaviors based on two LSTMs for session and preference behaviors learning respectively. In [26], the authors focus on applying encoderdecoder techniques for modeling customer context from mobile device data to infer contextual user preferences. Specifically, they aim to infer unavailability for receiving recommendations (busy) and preferred categories of items (food, nightlife spot). However, the study is limited to few preferences. The model is based on the contextual situation of the user (e.g., a mobile user in the contextual situation of eating in a fast-food restaurant or busy and not available for any further interaction with the mobile phone). Such information could help to deliver targeted personalized advertisements based on the user's dynamic and changing context. In general, this research is based on contextual information, which is different from the problem to be accomplished but explores a different architecture with an encoder-decoder as a base for capturing the content like the models used for language translations or speech recognition [28]. Li et al. [14] proposed an attention-based LSTM model for hashtag recommendation in microblogs. This approach takes the advantages of both RNNs and the attention mechanism to capture the sequential property and recognize the informative words from posts. Loyala et al. [16] proposed an encoderdecoder architecture with attention for customer session and intents modeling.
686
I. Fuentes et al.
This model incorporates two RNNs and could capture the transition regularities more expressively. However, the model is trained based on sequences of items representing a customer session but not considering the whole customer buying sequence. This characteristic could limit the evaluation of the influence of previous multiple levels in scenarios in which the order length could be short, and the interactions have a higher dependence on the previous sessions. Most of these existing approaches are based on session-recommendation by explicitly considering the time component or short preference interactions for the training process. In particular, the explicit time interval in the customer history to determine the influence of previous preferences would involve an inefficient process when trying to determine the next customer preferences for a marketing campaign. Note that some customers could have less inactivity time than others but would have similar preferences. Note that similar preference patterns could appear through greater or lesser intervals. Wang et al. [27] presented a more generic approach whereby a deep neural network is employed to extract generic content-features from any types of items. After the extraction process, these features are incorporated in a standard collaborative filtering model to enhance the recommendation performance. This approach seems to be particularly useful in settings where there is not sufficient customer-item interaction information. In [6] the user sessions are modeled with RNN. They consider the first click when the user enters a website as the initial input of the RNN. Then, each consecutive click of the user produces a recommendation that depends on all the previous clicks. In the input sequence, they use two different representations. In the first one, each element in the sequence represents a one-hot encoding vector of the actual event. In the second alternative, each element of the sequence represents all the events in the session so far. They use a weighted sum of the past events, where the events are discounted if they have occurred earlier. This approach is focused on session-recommendation, where the input of the first mini-batch is formed by using the first event of the sessions and the output is the second events of the active sessions. The second mini-batch is formed from the second events and so on through session-parallel mini-batch. In [24], the RNN is used to make recommendations using only the user’s interactions in the current browsing session. The authors use an embedding layer between the input click sequences and the RNN. This allows representing the one-hot encoded click events into a vector representation. Additionally, they create a new training sample representation for all the prefixes of the original sequence, and the model output is the item embedding prediction instead of the probabilities of different items. The authors report that this model makes predictions using only about 60% of the time used by the models that predict item probabilities; however, the prediction accuracy performs worse. They argue that the cause of this poor performance could be that the quality of the item embeddings is not good enough. The above approaches [6,24] can be mapped to dealing with consecutive buying preferences considering sessions as subsequent preferences. In both
Best Next Preference Prediction Based on LSTM
687
approaches, the embedding of the items gave slightly worse results; therefore, the authors kept the 1-of-n encoding. In Sect. 5, we compare our DeepCBPP approach with these two representation preference approaches, namely, MBS [6] and PSR [24]. In [13], the authors suggest a multi-period product recommender system through sessions, which can learn customers’ buying orders and repetitive buying patterns. The authors demonstrated through experimental results that the LSTM is slightly better than the basic RNN and GRU. As a result, LSTM is used as a final model of recommendation. However, this study is based on a single LSTM layer. Also, it limits the mapping through a set of specific periods, which can affect the learning process over customers with less buying frequency behaviors. Inspired by these results, in Sect. 5 we investigate the following questions: Does our DeepCBPP approach achieve consistently high accuracy across a range of classical sequential prediction models? Does increasing the number of layers contribute to improving the performance of our DeepCBPP approach? Do more complex LSTM architectures improve the DeepCBPP predictions? To address these questions, in the following section, we formally describe our approach for deriving the next customer buying preference based on a multi-class classification LSTM approach. Aiming at the overcoming limitation concerning to capturing the multi-level interactions in scenarios where the order length could be small, and the interactions have a higher dependence on the previous sessions, we exploit a new customer sequence representation approach based on the most recent items through previous subsequence interaction windows.
3
DeepCBPP for Next Preference Predictions
In this section, we will introduce a new approach for deriving the next customer buying preference based on a multi-class classification LSTM approach. Memories in the recurrent neural network to remember former computations make it possible to deal with complex interaction patterns and reflect customer’s preference behaviors. DeepCBPP learns the customer buying patterns from the sequences of orders over time and predicts the next more probable products or product categories to buy for a specific customer. Figure 1 shows the overall workflow of the training process of our DeepCBPP approach. The schema goes through four main components: the transactional data, the customer sequence file, the input as the LSTM training model, and the output of the training model. These components constitute the inputs and outputs of the three main stages of our scheme: (1) Customer Buying Sequence Transformation (CBST), (2) Multi-level Preference Generation (MPG), and (3) Preference Buying Learning (PBL). The structure of the transactional data corresponds with the structure shown in Table 1 [3]. The Multi-level Preference Generation stage allows transforming the transactional data into a customer sequence file, which maps each customer buying history in a unique sequence. In particular, we first transform
688
I. Fuentes et al.
the data into a semi-structured representation by representing each customer as the ordered arrangement by date (oi1 , oi2 , oi3 , · · · , oik ) such that oij denotes the j-th order of the customer i. Finally, we apply the concatenate operation (+) of the product list p (oij ) of each collected order by considering the data order as < p1 , p2 , · · · pq >=< p (oi1 ) + p (oi2 ) + · · · + p (oik ) >, where pv ∈ P , i.e., the collection of products. Accordingly, we will have transformed the semistructured transactional data into a structured representation, so that we can learn a sequential model over this structured data.
Fig. 1. General workflow of the training process addressed by DeepCBPP approach.
The customer sequences parsed from transactional data are used by DeepCBPP to build a training input and to construct an LSTM execution workflow. Definition 1 formally describes how to parse the customer sequence file into the input data for the training process in the Multi-level Preference Generation stage. Definition 1. (Buying sequence input). Let P = {p1 , p2 , ..., pn } be the set of distinct product items (or product category items) from the parsed transactional data. Suppose t is the customer preference id of the next item to appear in the subsequent of length w of the customer sequence for the (Ci , w + 1) the set of customer Ci , where each pˆv is in P . Given C˜iw = all subsequents of length w + 1 of the customer buying sequence Ci , the buying sequence input consists of D = C˜iw , i = 1, m. Recognizing relevant patterns in long input sequences can turn out to be difficult for the human mind. Traditionally, in many e-commerce applications,
Best Next Preference Prediction Based on LSTM
689
Table 1. Transactional dataset structure. Customer Order Order date
Product Amount Revenue
1
1
2018/05/25 12:24:15 a
1
$2.10
1
2
2018/05/26 15:12:21 b
2
$4.50
2
1
2018/05/26 10:12:21 b
2
$4.50
2
1
2018/05/26 10:12:22 c
2
$5.50
2
1
2018/05/26 10:12:21 d
1
$7.50
customer behavior can be diverse, for instance, the length of the orders, the purchase frequency and the number of orders. The Customer Buying Sequence Transformation stage generates customer sequences that allows capturing the historical buying behavior order in which preference interactions appear but not the time component explicitly. Note that, modeling the closest preference interactions could give a better understanding of the customer preferences due to interactions could have a higher dependence on the previous decisions. As a result, the historical buying behaviors can be viewed on the level of multiple consecutive interactions. To be able to analyze different parts of the buying preference history, in the Multi-level Preference Generation stage, each sequence is divided into a set of subsequences (multi-level interactions) based on the sliding window approach, which allows looking back the multi-level interactions of the most recent influenced items. Note that, w is highly dependent on the application domain. If the average length of the orders per customer is high, setting w to a large value is beneficial. However, w should be set to a small value when the power law of the length of the orders reflects the inflection change in the small values (i.e., in transactional data, the number of orders and the length of the orders follow a power law which reflects that most orders have a small length, whereas a few of them a large one). Note that, each entry of the classification process is a window of the w most recent items. For a given sequence of l observations, (l − w) + 1 subsequences of size w are created. In general, the same preference value may appear several times in an entry and pˆi denotes the preference at position i in a customer subsequence window of size w. Clearly, pˆi is strongly dependent on the most recent items that appeared before [7,24]. The proposed customer sequence representation allows to design more informed marketing campaigns based on previous interactions in scenarios where typically the length of the orders and frequency of buying are low. The output of the training process is a model of the conditional probability distribution P r[ˆ pt = pi |w] for each pi ∈ P . The prediction process uses the trained LSTM model M to make a prediction and compares the predicted output against the observed preference value that actually appears. Definition 2. (Buying preference prediction). Let P = {p1 , p2 , ..., pn } be the set of distinct product items (or product category items) from the parsed transactional data. Given a sequence of preferences with length w and the model M
690
I. Fuentes et al.
trained from the buying sequence input D = C˜iw , i = 1, m, M aims to predict the more probable next buying preference pˆw+1 . The training process relies on a small fraction of the customer sequence produced by the parsing step. For each customer subsequence window of length w in the training data, DeepCBPP updates its model for the probability distribution of having pi ∈ P as the next preference. After the training is done, we can predict the output for an input υ = {ˆ pt−w , · · · , pˆw−2 , pˆw−1 }. The output of a prediction (Definition 2) is a probability distribution P r[ˆ pt = pi |w] = {p1 : pb1 , p2 : pb2 , ..., pn : pbn } describing the probability for each item from P to appear as the next item value given the buying history. Our strategy consists of ordering the possible items based on their probabilities and treats a predicted preference as an accurate prediction if it is among the top μ candidates. If a sequence υ is never followed by a particular item i in the training stage, then P r[ˆ pt = pi |w] = 0. Similarly, if a sequence υ is always followed by i, then P r[ˆ pt = pi |w] = 1. A more complicated case is when a sequence υ is to be followed by an item from a group of different products; the probabilities of the preferences that appear after υ sum to 1.
4
LSTM Model Architectures
This section provides a literature review on LSTM based approaches that have been used to solve sequence prediction tasks. Likewise, we provide a brief discussion of some relevant characteristics of the used LSTM architectures. LSTM encodes more intricate patterns and maintains a long-range state over a sequence [22]. As shown in Fig. 2, the LSTM approach is composed of a memory cell (a vector) and three gates (i.e., a forget gate f , an input gate i, and an output gate o). The gates are vectors where the sigmoid functions are applied, making their values between 0 and 1. These vectors are then multiplied by another vector, so the gates decide how much of the other vector obtains. Conceptually, each gate has a different function. The first step is to decide what information will be thrown away from the previous cell state. The operation is performed by the forget gate, which is defined as: ft = σ(Wif xt + bif + Whf ht−1 + bhf )
(1)
it = σ(Wii xt + bii + Wii ht−1 + bhi )
(2)
where ht−1 represents the output at time t−1. As a result, the forget gate defines which information of the previous hidden state is no longer needed. Then, the input gate decides which information is relevant in the current input to update the hidden state (i.e., what new information should be added to the cell state, as shown in Eq. 2). Note that, firstly a sigmoid function decides what information the input contains, and it should be updated. Second, a tanh function, as shown in Eq. 3 generates the vector gt of the candidate state values, which could be
Best Next Preference Prediction Based on LSTM
691
added to the cell state. The combination of the input it and gt represents the current memory that can be used for updating the cell state ct , as shown in Eq. 4. Significantly, the internal state represents the accumulated state that gets updated from the previous step’s output. As a result, the internal state is selective memory of the past observed preferences while the hidden state expresses through an internal learned representation the overall state of the processed sequence. In other words, the hidden state keeps track of the essence of all preferences that have been observed in the sequence. The weight matrices and the bias vectors are denoted by W and b, respectively. gt = tanh(Wig xt + big + Wig ht−1 + bhg )
(3)
ct = ft ct−1 + ii gt
(4)
Fig. 2. The stacked LSTM architecture.
Considering the advantage of the forget gate and the input gate, LSTM can not only store long-term memory but also filter out useless information. As a result, the output gate controls which information of the new computed hidden state goes to the output vector of the network. More specifically, the output of the LSTM block is based on ct and it is controlled by the output gate which decides what information, ot , should be exported. The process is described as: ot = σ(Wio xt + bio + Wio ht−1 + bho )
(5)
ht = ot tanh(ct )
(6)
A single LSTM layer is capable of learning time dependencies, but a chain-like LSTM module, named Stacked LSTM (S-LSTM), is more suitable for processing time sequence data. Figure 2 shows that a stacked LSTM consists of multiple LSTM layers that take signals as input in the order of time. Each gating function is parameterized by a set of weights to be learned. The expressive capacity of an LSTM is determined by the number of memory units (i.e., the dimensionality of the hidden state vector h). The training step entails finding proper assignments to the weights. After that, the final output of the
692
I. Fuentes et al.
sequence of LSTMs produces the desired label (output) that comes with inputs in the training dataset. A series of LSTM blocks form an unrolled version of the recurrent model in one layer, as shown in Fig. 2, where a stacked is composed by a series of LSTM layers. In particular, an LSTM Encoder and Decoder model, named E-LSTM-D, combines the architecture of the encoder-decoder model and the stacked LSTM as shown in Fig. 3. The first LSTM model, named encoder, processes a customer preference input and generates an encoded state. After that, this state summarizes the information in the input sequence. The second LSTM model, named decoder, uses the encoded state to produce an output.
Fig. 3. The E-LSTM-D architecture.
As a matter of closure, evaluating the impact of LSTM architecture selection is essential to obtain accurate predictions through our DeepCBPP approach.
5
Performance Evaluation
In this section, the proposed DeepCBPP is evaluated on 12 datasets from a realworld problem provided by anonymous European stores [3]. Table 2 outlines the number of transactions (T), products (P), customers (CU), orders (O), product sub-categories (PC), and categories (CA) for each dataset, which follows the structure shown in Table 1. We conduct extensive experiments, evaluating the performance of our general deep approach by fine-tuning the model structure (i.e., changing LSTM model architecture), and comparing our DeepCBPP approach with five baseline methods of sequence prediction mentioned before. The evaluated SP models learn product categories in the before detailed datasets. We first investigate the impact of different parameters in DeepCBPP. By default, we use the following parameter values: w = 4, h = 64, μ = 1 and investigate the impact of LSTM architecture in our experiments. In particular, μ decides the top probabilities in the prediction output to be considered. L and h determine the particular configurations in the LSTM model. Specifically, L and h denote the number of the layers that includes the S-LSTM and the number of memory units in one LSTM, respectively. Recall μ decides when the prediction output is accurate (i.e., the
Best Next Preference Prediction Based on LSTM
693
Table 2. Characterization of the European store datasets. Dataset T D1
3,546
CU
P
339 1,853
O
CA PC 372 18
1,034
D2
33,407 1,353 5,061 2,800 18
1,974
D3
25,934 1,171 4,387 2,483 18
1,816
D4
27,050 1,238 4,546 2,466 18
1,850
D5
26,158 1,190 4,475 2,465 18
1,854
D6
25,858 1,218 4,512 2,487 18
1,845
D7
30,217 1,279 4,847 1,805 18
1,935
D8
29,285 1,266 4,729 2,666 18
1,913
D9
29,201 1,286 4,811
855 18
1,915
D10
29,339 1,303 4,886
900 18
1,952
D11
33,012 1,333 5,040 2,737 18
1,967
D12
28,212 1,219 4,919 2,336 18
1,938
next preferences appear on the μ-top more probable items), and w is the window size used for training and detection. The cross-entropy is adopted as the error measure. During the training process, each input/output pair incrementally updates the weights. In particular, weights are adjusted by using an extension to the stochastic gradient descent, known as the Adam optimizer algorithm [10], meanwhile the number of epochs is set to 110 through loss minimization via gradient descent. Notice that, an input consists of a sequence υ of length w, and an output is the most probable preference that comes right after the input sequence υ. The datasets are then split into training and testing sets, using the 10-fold cross-validation technique. For each fold, the training set is used to train each predictor. Accuracy is our main measure to evaluate a given predictor. It computes the number of successful preference predictions against the total number of test sequences. The following experiments are dedicated to evaluating the performance of our proposal determined by the LSTM architecture selection. Firstly, we will investigate the influences of the number of layers in the S-LSTM architecture. Note that, L = 1 reflects a simple LSTM architecture. From the accuracy results in Fig. 4, we can draw some interesting conclusions (i) the performance of our approach increases in general by using the S-LSTM architecture (i.e., more than one layer in the LSTM model), (ii) this performance is similar when the number of candidates μ changes, and (iii) the variations in the μ values produce a significant increase in the accuracy. Likewise, further increasing the number of layers from 3 to 5 has similar performance for the rest experiments when considering S-LSTM architecture by setting L = 2 (S-LSTM(L=2)).
694
I. Fuentes et al.
Fig. 4. Accuracy of the DeepCBPP approach using S-LSTM architecture for different layer settings (L) and number of candidates (µ ∈ [1 : 5]).
Secondly, we will investigate the influences of the number of layers in the ELSTM-D architecture. Note that, both L = 1 and L = 2 reflect a simple LSTM architecture. From the accuracy results in Fig. 5, we can draw that both settings L = 1 and L = 2 produce a similar performance. In an attempt to compare the performance by considering simple LSTM (named LSTM(L = 1)), S-LSTM, and E-LSTM-D architectures, we compare the results by setting the layer number in both encoder and decoder to 1 or 2, named E-LSTM-D(L = 1) and E-LSTM-D(L = 2) architectures, respectively.
Best Next Preference Prediction Based on LSTM
695
Fig. 5. Accuracy of the DeepCBPP approach using E-LSTM-D architecture for different layer settings (L) and number of candidates (µ ∈ [1 : 5]).
Fig. 6. Accuracy of the DeepCBPP approach for different LSTM architectures.
The results in Fig. 6 reveal that E-LSTM-D has the best performance. For that reason, it is setting as the based LSTM architecture of our approach.
Fig. 7. Accuracy of the DeepCBPP approach over datasets by considering deep baseline approaches for SP.
696
I. Fuentes et al.
Fig. 8. Accuracy of the DeepCBPP approach over datasets by considering baseline methods for SP.
The goal of the third experiment consists of getting an overview of the performances of the DeepCBPP approach and the representation preference approaches, named MBS [6] and PSR[24]. Figure 7 shows that the proposed approach obtains the best performance. Similarly, the goal of the last experiment consists of getting an overview of the performance of the DeepCBPP approach against the baseline models CPT+, DG, TDGA, LZ78, and AKOM. For all prediction models, we have empirically attempted to set their parameters to optimal values. Figure 8 shows the prediction accuracy obtained by each model when μ = 1. The results indicate that DeepCBPP significantly outperforms these methods and offers higher accuracy than the baseline models while also being more consistent across the various datasets.
Fig. 9. Lift of the DeepCBPP approach over datasets by considering baseline methods for SP.
Additionally, we employ the lift measure to assess the performance achieved by an SP model regarding estimating accuracy without an SP model. Lift is defined as a ratio of the accuracy obtained with an SP model over the accuracy when the next preference is estimated by considering the product category with
Best Next Preference Prediction Based on LSTM
697
the highest frequency in the considered sequence. The higher this ratio, the better. Specifically, it is desired to obtain values greater than one, which expresses a specific model’s effectiveness. Figure 9 shows that the proposed approach obtains the best performance. Results indicate that DeepCBPP outperforms the baseline methods, while offering higher lift values across the datasets.
6
Conclusions and Future Work
In this paper, we presented DeepCBPP, a new deep neural network approach based on the LSTM neural network and a multi-class classification problem for predicting buying preferences. DeepCBPP automatically learns behavioral patterns from transactional buying histories and allows predicting customer preferences without sophisticated feature engineering by using a combination of both the new multi-level sequence representation and the LSTM model. The newly proposed sequence customer representation as a base of the data transformation process allows capturing multi-level interactions in scenarios where the length of the orders could be short, and the interactions have a higher dependence on the previous sessions. As a result, the problem is reduced to a sequence prediction problem. We have illustrated the impact of the LSTM architecture selection as a training model of our approach towards reaching the best performance. A closer inspection of the predictions over 12 real-world datasets reveals that E-LSTM-D contributes to obtaining higher accuracy values. It is also observed that DeepCBPP reaches the best performance when compared with state-ofthe-art sequence prediction methods. From a business perspective, our approach can be used to design more informed marketing campaigns based on previous interactions in scenarios where typically the orders are small, and the purchase frequency and the number of orders are low. For future work, we propose to extend the applications of DeepCBPP to the prediction of hierarchical preferences based on the hierarchical description of products.
References 1. Bappy, J.H., Simons, C., Nataraj, L., Manjunath, B.S., Roy-Chowdhury, A.K.: Hybrid LSTM and encoder-decoder architecture for detection of image forgeries. IEEE Trans. Image Process. 28(7), 3286–3300 (2019) 2. Devooght, R., Bersini, H.: Collaborative filtering with recurrent neural networks. arXiv preprint arXiv:1608.07400 (2016) 3. Fuentes, I., N´ apoles, G., Arco, L., Vanhoof, K.: Customer interaction networks based on multiple instance similarities. In: Abramowicz, W., Klein, G. (eds.) Lecture Notes in Business Information Processing. Business Information Systems, vol. 389, pp. 279–290. Springer, Cham (2020). https://doi.org/10.1007/978-3-03053337-3 21 4. Gueniche, T., Fournier-Viger, P., Raman, R., Tseng, V.S.: CPT+: decreasing the time/space complexity of the compact prediction tree. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 625–636. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8 49
698
I. Fuentes et al.
5. Guidotti, R., Monreale, A., Nanni, M., Giannotti, F., Pedreschi, D.: Clustering individual transactional data for masses of users. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 195–204 (2017) 6. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. Computer Science, Mathematics; CoRR 1511.06939 (2015) 7. Hidasi, B., Karatzoglou, A., Baltrunas, L., Tikk, D.: Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939 (2015) 8. Jing, H., Smola, A.J.: Neural survival recommender. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, pp. 515–524 (2017) 9. Kaminskas, M., Bridge, D., Foping, F., Roche, D.: Product-seeded and basketseeded recommendations for small-scale retailers. J. Data Semant. 6(1), 3–14 (2017) 10. Kingma, D.P., Ba,J.: Adam: a methodfor stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015) 11. Laird, P., Saul, R.: Discrete sequence prediction and its applications. Mach. Learn. 15(1), 43–68 (1994) 12. Lang, T., Rettenmeier, M.: Understanding consumer behavior with recurrent neural networks. In: Proceedings of the 3rd Workshop on Machine Learning Methods for Recommender Systems (2017) 13. Lee, H.I., Choi, I.Y., Moon, H.S., Kim, J.K.: A multi-period product recommender system in online food market based on recurrent neural networks. Sustainability 12(3), 969 (2020) 14. Li, Y., Liu, T., Jiang, J., Zhang, L.: Hashtag recommendation with topical attention-based LSTM. In: COLING (2016) 15. Li, Z., Zhao, H., Liu, Q., Huang, Z., Mei, T., Chen, E.: Learning from history and present: next-item recommendation via discriminatively exploiting user behaviors. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1734–1743 (2018) 16. Loyola,P., Liu, C., Hirate, Y.: Modeling user session and intent with an attentionbased encoder-decoder architecture. In: Proceedings of the Eleventh ACM Conference on Recommender Systems, pp. 147–151 (2017) 17. Michael, J., Labahn, R., Gr¨ uning, T., Z¨ ollner, J.: Evaluating sequence-to-sequence models for handwritten text recognition. In: Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), pp. 1286–1293. IEEE (2019) 18. Monteserin, A., Armentano, M.G.: Influence-based approach to market basket analysis. Inf. Syst. 78, 214–224 (2018) 19. Padmanabhan, V.N., Mogul, J.C.: Using predictive prefetching to improve world wide web latency. ACM SIGCOMM Comput. Commun. Rev. 26(3), 22–36 (1996) 20. Pitkow, J., Pirolli, P.: Mininglongestrepeatin g subsequencestopredict worldwidewebsurfing. In: Proceedings of UsENIX Symposium on Internet Technologies and systems, p. 1 (1999) 21. Reutterer, T., Hornik, K., March, N., Gruber, K.: A data mining framework for targeted category promotions. J. Bus. Econ. 87(3), 337–358 (2016). https://doi. org/10.1007/s11573-016-0823-7 22. Sundermeyer, M., Schl¨ uter, R., Ney, H.: LSTM neural networks for language modeling. In: Proceedings of the 13th Annual Conference of the International Speech Communication Association (2012)
Best Next Preference Prediction Based on LSTM
699
23. Sutskever, I., Vinyals,O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, pp. 104–3112 (2014) 24. Tan, Y.K., Xu, X., Liu, Y.: Improved recurrent neural networks for session-based recommendations. In: Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pp. 17–22 (2016) 25. Tax, N., Verenich, I., La Rosa, M., Dumas, M.: Predictive business process monitoring with LSTM neural networks. In: Dubois, E., Pohl, K. (eds.) International Conference on Advanced Information Systems Engineering, pp. 477–492. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59536-8 30 26. Unger, M., Shapira, B., Rokach, L., Livne, A.: Inferring contextual preferences using deep encoder-decoder learners. New Rev. Hypermedia Multimedia 24(3), 262–290 (2018) 27. Wang, H., Wang, N., Yeung, D.-Y.: Collaborative deep learning for recommender systems. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1235–1244 (2015) 28. Zeyer, A., Bahar, P., Irie, K., Schl¨ uter, R., Ney, H.: A comparison of transformer and LSTM encoder decoder models for ASR. In: Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 8–15. IEEE (2019) 29. Zhang, S., Yao, L., Sun, A., Tay, Y.: Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. (CSUR) 52(1), 1–38 (2019) 30. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Achieving Trust in Future Human Interactions with Omnipresent AI: Some Postulates Peer Sathikh(B) , Zong Rui Dexter Fang, and Guan Yi Tan School of Art, Design and Media, Nanyang Technological University, 50 Nanyang Avenue, Singapore 639798, Singapore [email protected]
Abstract. Trust is a fundamental ingredient of interaction, especially in highresponsibility roles where Artificial Intelligence (AI) is predicated to use its faculties to empower a user. Present AI systems are too limited in their programmes and approaches as they appear to lack an ability to perceive their purpose and learn the user’s needs. This leaves much uncertainty when the system is engaged in processes with wide range of possibilities such as driving a car. Users find difficulty in ascertaining the AI’s control of the situation with current human-computer interaction methods. It is hoped that this paper will show AI researchers developing a direction for such interactions to understand the basis behind a trustworthy social AI. This paper presents postulates on future human interaction with what we call omnipresent AI based on trust. This is done through establishing elements of interaction which are present in human-to-human communication, based on Speech-Act and Communicative Action Theory. The factors which contribute to an Omnipresent AI being recognised as an intelligent being are also laid out, which together forms the principles behind trustworthy Human-AI interaction. The paper then proceeds to propose a form which the omnipresent AI could take, based on earlier studies done by the authors. Keywords: Artificial Intelligence · Human-computer interaction · Intelligent omnipresent AI · Ambient omputing · Interaction design · Natural communication
1 Introduction In the future, it is feasible to imagine our everyday lives being assisted by AI agents. As the cognitive abilities of AI continue to improve, it may operate autonomously at higher levels such as vehicle operation and customer service. Assuming such an AI agent possesses a certain set of capabilities and prerequisites to perform these tasks effectively, how would such an AI behave and operate in our physical world, what kind of communicative capabilities would it have, and most importantly how can we best communicate with such a being? To begin, we look into establishing the prerequisites to bring about natural interactions with an AI Being. Today’s Human-Computer interaction is defined by Card [1] as, © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 700–718, 2022. https://doi.org/10.1007/978-3-030-82193-7_47
Achieving Trust in Future Human Interactions
701
“…the user and the computer engaged in a communicative dialogue whose purpose is the accomplishment of some task.” This may be described as a dialogue between the computer/artificial intelligence (AI) and the user where both parties exchange information with each other. This interaction is being done through an interface, which may traditionally comprise of devices such as keyboards, mice and buttons. However, as computers are now adopted in many previously analog devices such as home appliances to accomplish a vast range of tasks today, interfaces have expanded to accommodate these new interactions. In this new mode, recent computer advancements such as cloud computing and machine learning have enabled them to interact using several devices at once, which is ideal in the roles of service providers such as self-driving cars. Card [1] explains that in this mode of interaction, the user is no longer the operator but instead works with the computer to accomplish a task. In this paper we wish to propose this as the first criteria of an Omnipresent AI, which still requires work to be done to bring it to an AI Being which exuberates a sense of presence such that it may possess the ability to comprehend and communicate naturally with its user. As mentioned in the abstract, trust is required from its users for the AI to perform these high-responsibility roles. Sundar [2] suggests that a new mode of interaction has to be developed to achieve the level of trust needed from its user, which according to Jhaver et al. [3] is lacking today due to rising phenomena such as “algorithmic anxiety” where users are clueless to how programmes assess their parameters to serve them. To achieve trust, human users have to be able to interact naturally with future AI. Drawing from concepts on interaction models, speechact and communicative-action theories, the authors propose three postulates in which future interactions with AI can be based on in order for the development of trustworthy AI Beings. Finally, an example of current work by the authors where these postulates are applied in a study involving participants in simulated commuting in an autonomous vehicle driven by an omnipresent artificial intelligence is presented. It is hoped through these postulates a direction may be set for the development between humans and AI.
2 Defining Omnipresent AI 2.1 What is an Omnipresent AI? We would like to propose a term, “Omnipresent AI” to describe AI which complements the needs of their users with high levels of autonomy through controlling several devices in a particular area at once. Early examples of omnipresent AI would be “smart assistants” such as Amazon’s Alexa, Google’s Assistant, Apple’s Siri, and social media chat bots. These examples share one thing in common - their core functionality as a service is based on communicating with and understanding the needs of its human users, wherever they may be, validating the communicative dialogue concept proposed by Card in the introduction. Fundamentally, an AI is expected to empower users through the automation of routine tasks, thus freeing up time and resources for more demanding tasks (Maedche et al.) [4]. As automated processes increase in levels of sophistication, AIs are required to manage the processes as well as coordinate the desired outcome with human users. As a result of having the need for greater sophistication and comprehension required for tasks, more expressions and nuances are required than just the level of interaction present in present-day omnipresent AI such as automated voice commands.
702
P. Sathikh et al.
An ideal task for an Omnipresent AI could be customised customer service, which comes today in the form of social robots. To elaborate, the ideal aim of developing social robots is to have an entity which exists in a complex, dynamic and social environment having the ability to behave in a manner that reconciles the goals of itself and its user (Duffy) [7]. Social robots are defined as interactive machines which communicate and interact with us to achieve its instrumental or functional purpose (Sundar; Shaw-Garlock) [2, 5]. Aarts et al. [8] define such a system as an adaptive environment made up of devices with the capability to recognise the user, identify their needs, and interact in their interest. These are considered a form of omnipresent AIs as they assume no physical form, and instead, occupy several “smart” devices located within an adaptive environment and coordinate them so that different functions may be fulfilled. This makes such AI suitable for use in smart homes, museum guides or in autonomous vehicles. Currently most social robots we see today are assigned to these tasks yet they are capable of only performing low levels of interaction as they are utilitarian in nature, such as the “smart assistants”, answering machines and ATMs. These provide a low level of interaction as they were not designed for deeper interaction with humans, for example on an emotional level (Zhao) [6]. In the present day, we observe devices such as the abovementioned “smart devices” which bring about the concept of social robotics through ambient computing. To begin addressing this gap in human-AI interactions, there is a need to understand the factors behind a human user accepting a more complex form of interaction with an AI agent. A study done by Xu et al. [9] on users’ acceptance of a vehicle driven by an AI driver illustrates this in a three-pronged structural model (Fig. 1). For passengers to consider using a self-driving car, trust is required. This is contributed through perceived usefulness, ease of use as well as safety. Abdul [10] echoes this by stating that for an AI system to gain acceptance, users need to trust the AI, be able to understand how the system works, as well as feel in control.
Fig. 1. Three-pronged structural model explaining acceptance of automated vehicle behavior (Xu et al.) [9]
As iterated by Xu et al. effective interaction will solve the three prongs of trust as it effectively conveys the intentions between users and AI, and will lead to a willing adoption of the system. From the literature by Xu et al. and Abdul, the authors propose
Achieving Trust in Future Human Interactions
703
three conditions to be met in order for an omnipresent AI to interact effectively with its users. Condition 1: An omnipresent AI does not require a physical form but has to be recognised as an intelligent presence. Due to the nature of today’s computing, AIs do not require physical form, as it exists within a cloud environment. The user perceives its existence through a concept of its presence, as it interacts with them and performs tasks. The AI also functions through networking with different devices which connect to form the environment. The AI is theoretically present everywhere at the same time in the space which envelopes the user. This facilitates the user to trust the AI. Condition 2: An omnipresent AI must be able to comprehend the goals and needs of its tasks and users. As these beings are speculated to be able to autonomously perform a wide range of tasks, and encounter a variation of unforeseen circumstances, it is essential they have the ability to perceive, experience and learn so as to address their users’ needs more effectively (Yang et al.) [11]. To assist its users, there is a need to balance out the relationship between human and artificial autonomy (Floridi) [12]. When the AI is able to comprehend its user’s needs and its task, it allows for perceived usefulness and the user understanding how the system works. Condition 3: An omnipresent AI must be able to express its intentions effectively. As an intelligent subject with its own agency, an Omnipresent AI should be able to express itself clearly with the communication modes available at its disposal such that the user is able to understand its intentions. This allows the user to feel in control, and facilitates ease of use. To begin elaborating, we will first define the form of interaction needed. Next, we will discuss the goals to be achieved using these interaction models. 2.2 Interaction Models for Omnipresent AI In Sect. 2.1 we describe omnipresent AI as social robots expected to communicate and interact with their users to fulfil their tasks. Dubberly [13] defines interaction as a way of framing the relationship between people and objects designed for them. The basics of such interactions may be illustrated by Verplank’s [14] “How do you… feel-knowdo?” model of interaction where how one feels or does bridges the gap between user and system. As mentioned in the introduction of this paper by Maedche, an AI is a tool to empower its human user, and thus according to Dubberly the AI is in a framed relationship with its user. This fulfills Condition 1. Maldonado [15] describes traditional AI systems as a feedback loop of information running through the human and back into the system as the human programmes the system, the system executes the tasks and the human once again programmes the system based on the result he/she gets. Present AI systems like the one described by Maldonado can thus be classified as a self-regulated, first order system (Fig. 2, self-regulating system), where the system seeks to attain a goal which is determined by a user outside the system. For more advanced AIs such as Boston Robotics’ Spot described by Yang et al. which is capable of learning, they comprise a self-regulated system inside of another self-regulated system. This is known as a second-order system where it monitors the
704
P. Sathikh et al.
first self-regulated learning system, measuring and setting the goal of the first order system (Fig. 2, learning system). An example of this is how Spot responds when dealing with a kick to its side where Spot’s computer, through the pursuance of its own goal, and testing of options learns what would allow it to succeed to stand upright again. In Spot’s example, the human user, through interacting with the system several times, learns how the system reacts, but the system has not learnt how to modify its approach when encountering similar events.
Fig. 2. Self-regulating systems vs learning systems (Dubberly) [13]
No amount of simulative training is adequate to prepare the system to tackle the infinite possibilities of open-world problem solving. We believe the proposed omnipresent AI which is able to interact with its human user in the most natural way possible would use the “Conversing” model (Fig. 3) as proposed by Dubberly, Pangaro and Haque to improve its situational skills. In this model the output of each learning system becomes the input for the other, allowing the system to be adaptable.
Fig. 3. Conversing systems (Dubberly) [13]
Achieving Trust in Future Human Interactions
705
In the model of Conversing Systems, the AI and the human user are signaling each other, asking questions or giving instructions, discovering which actions may maintain their goals while exchanging information of common interest. This model is suitable for the architecture of omnipresent AIs to be based on, as the processes are built on mutual understanding between both parties in the interaction. As AI is speculated to autonomously aid humans in the fulfilment of tasks, Floridi uses the analogy of it as a partner. Wickramasinghe [16] adds to this by suggesting that future AI systems approach human needs with conversational questions such as why, when, and how. This fulfills Condition 2. In the next section we will analyse the conversing system through an example of an exchange between two learning systems trying to achieve a common goal, namely a driver and a passenger. 2.3 Interaction and Trust An Example of Interaction to Achieve a Goal In situations where one interacts with an unfamiliar subject, it is natural for one to doubt its trustworthiness as there is an unwillingness to be vulnerable (Corritore) [17]. Bishop [18] writes that trustworthiness in computer systems may be ascertained when there is sufficient credible evidence leading to believe that the system meets a set of given requirements. In the application of omnipresent AI beings, the given requirements would be the being’s capability to comprehend the situation and work with the user to complement the user’s needs (Marcus and Davis) [19]. The following example (Fig. 4) is an exchange between a human driver and a passenger who made a booking on a ride-hailing app. In this instance the human driver and passenger depict a second order learning system (Conversing):
Fig. 4. Communication between driver and ride-hailing passenger
706
P. Sathikh et al.
1. The passenger wants to be picked up from a pick-up point. Driver’s goal is to pick him up. 2. Driver understands that sensors like GPS positioning have errors and may be wrong and wants to confirm that the passenger indicated the correct pick-up point. 3. In this area, there are several different landmarks with the name “Fullerton” lying close by. The driver identifies the relevant landmarks; those within the subset of buildings with pick-up points. He narrows it down to 3 possible locations for suggestion to the passenger. 4. Passenger knows the driver took his needs into account, acknowledges the driver’s knowledge of the context, and affirms the ride request. 5. Based on the concept of source interactivity as proposed by Sundar (2020) [2], trust is gained in the human driver. The conversing system in Fig. 3 may be extrapolated as strong AI behavior as it fulfils Conditions 1 and 2. This is because the AI is able to learn from the inputs of the passenger, such as asking questions and exchanging information to comprehend the situation. Present-day AI in an autonomous vehicle is driven by the Learning System as illustrated by Maldonado [15]. In this case, the passenger by booking a ride sets the goal from outside the system. The system is programmed with assumptions to learn and perform its individual skills like reacting to traffic but would be unable to anticipate and adapt to challenges arising from the performance of its tasks. The following actions could be anticipated: 1. The passenger wants to be picked up from a pick-up point. Driver’s goal is to pick him up. 2. Sensors like GPS indicate location, routing is updated to lead to the location indicated on GPS. Car heads over immediately. 3. The vehicle is unable to understand and prepare for an event where things may go wrong (GPS positioning error, traffic jam as a result of detour). 4. If the passenger cannot be found, the AI driver proceeds to the next programmed process of contacting the passenger to re-confirm the correct pick-up point. 5. Trust is potentially jeopardised due to the inability of the vehicle to learn the passenger’s needs and adapt its response to them. To understand how an AI system may adopt the conversing model humans use, we need to next understand how an AI could interact with its users in order to increase trust, subsequently fulfilling Condition 3 as set out in Sect. 2.1. Increasing Trust in Interaction Preece et al. [20] define a good interaction experience with AI as being one where the user finds the interface effective, easy and pleasurable to use. In the case of an omnipresent social AI where the interface is through voice commands or gesture interpretation, developers need to understand how humans act and react to events as well as how they interact with one another in order to derive a format for AI to follow. This fulfils Condition 3.
Achieving Trust in Future Human Interactions
707
To seek an answer on how we may achieve a positive response with regards with user experience to this new technology, McCarthy and Wright’s [21] Technology as Experience framework accounts for this user experience through four core threads: 1. Sensual - The level of immersion in the interaction 2. Emotional - The emotions experienced regarding the interaction 3. Compositional - The narrative part in the interaction i.e. order of steps to achieve a task 4. Spatio-Temporal - The perspective of space and time complementing the experience. To create an intelligent being which is capable of human-like interaction, the AI has to have the ability to infer meanings from given information that are implied, but not fully explicitly stated (Marcus and Davis) [19]. Humans engage in two different methods of interaction in an initial exchange where there is much uncertainty regarding both parties, namely the concepts of Prediction and Explanation (Berger and Calabrese) [22]. The first, Prediction is based on Heider’s [23] theory of “making sense”. As there are several alternative ways in which each interactant might behave such that each interactant would attempt to predict what steps the other party would take and select the most appropriate response to the predicted action of the other. However, before the interactant selects the most appropriate response, he has to use his/her knowledge of the other and develop predictions based on inferences about their behavior to narrow down the range of alternatives. This is in line with Marcus and Davis’ views, about the inadequacy of current second-order machine learning AI systems using simulated training to tackle open world problems. The second concept, Explanation concerns the problem of retroactively explaining the others’ behavior. This involves the target interactant stating something which induces other interactants to ask him or herself, or others to redefine his/her statement such that the other parties may understand him/her better. These two abilities of inference require knowledge of the context so the right responses may be generated and adds to the definition of Condition 2. However, there is a delicate balance for interaction to continue (Adams; Altman and Taylor; Homans [24–26]). While it is rewarding to reduce uncertainty between interactants, if one is able to completely predict the other interactant’s response, boredom is elicited, which ends the interaction. Combining Both Interaction Theories Amalgamating McCarthy and Wright’s four threads on technology as experience, and Berger and Calebres’ writings on interaction, it may be concluded that an AI system which is able to elicit trust from the user through interaction has to be able to portray its omnipresence in a pleasant physical manner (Sensual), converse smoothly such as not to create boredom (Emotional), is able to predict possible outcomes and adjust its responses to best facilitate an exchange of information (Compositional) and adjust its pace of conversation or self-portrayal to best suit the interaction (Spatio-Temporal). With these four Core Threads met, the user is then able to achieve a positive response in their experience interacting with an intelligent AI system capable of human-like interaction.
708
P. Sathikh et al.
While this may be a lot to ask for from AI systems at the present juncture, one may work towards fulfilling the four core threads of Sensual, Emotional, Compositional and Spatio-Temporal by establishing postulates and frameworks which this paper works towards. The next section discusses the limitations of present day AI to the conditions set out in Sect. 2.1 with the support of McCarthy and Wright’s [21] Technology as Experience framework. 2.4 The Aspirations of Omnipresent AI Despite AI technology progressing significantly in recent decades, AI is still far from effectively interacting, and thus achieving trust from its human users. One reason is due to the nature of the current interaction model behind present AI systems. As a result of being either a self-regulated or basic learning system, these AI systems are still only able to solve problems which their programmers have anticipated (Marcus and Davis) [19]. When met with an unexpected circumstance which they do not recognise, they are unable to react appropriately. This results in a system where the user has to feed it an anticipated input. An example is Amazon’s Alexa being entirely capable of operating switches and house locks but inadequate in the nuances of human communication: Besides conveying information in a robotic tone, it has difficulty understanding people who speak with different accents, and even when being spoken to, the user has to use specific key words (Ashish) [27]. This fails Condition 3. They are incapable of mastering the conversing model as proposed by Dubberly, they lack the flexibility to process the near-infinite possibilities of real-world situations which fall outside what is anticipated and programmed by its builders, failing Condition 2. Based on McCarthy and Wright’s [21] Technology as Experience framework, this deficiency in communication leads to: 1. Potential unsettlement in the way of speaking failing the sensual aspect 2. Frustrations in the user, failing the emotional aspect 3. Inability to predict outcomes and adjust responses to address them, failing the compositional aspect 4. No ability of self-portrayal, failing the spatio-temporal aspect. With these limitations, current omnipresent AIs fail Condition 1 in garnering their users’ trust to take on high-responsibility service tasks such as driving (Edmonds; The Autonomous Car) [28, 29]. Driving is a high-risk task which requires many cognitive and problem-solving skills, as traffic conditions are unpredictable. Besides controlling the vehicle, the AI in the AV also has to interact with the passenger, which necessitates the need to perform higher order social tasks such as reciprocating social signals. To achieve the requirements of a social robot who can reliably perform more complex tasks, Marcus and Davis acknowledge that these machines have to have a higher level of comprehension. Another reason that AI systems today lack trust from their users is that current systems are still unable to operate in a decentralised way and have to rely on sharing their users’ exchanges with third-party servers in order to process the information more effectively (Google) [30], hence users have doubts as their data might be exploited (Corritore et al.) [31]. For an AI to gain trust with humans, it has to protect the interests and
Achieving Trust in Future Human Interactions
709
welfare of other individuals in the team (Groom and Nass) [32]. According to research on users’ perception of migratable AI agents, it was found that participants had the highest trust and likeability towards agents who have migrated with their users in different embodiments, as that particular agent knows the user’s preferences. (Tejwani et al.) [33]. Whenever information was migrated but the identity of the agent was different, users found discomfort and distrust in the fact that other agents were aware of information shared to a previous agent from another interaction. As a result, a new method of HumanAI interaction drawing inspiration from natural human communication theories has to be developed, keeping in mind interaction conditions as laid out in Sect. 2.1 in order for humans to build trust with them.
3 Towards Postulates of Human-Omnipresent AI Interaction 3.1 Proposing a Natural Communication Method Human communications involve more than one mode of expression. When we speak, we use facial expressions, prosody and body language to bring extra layers of meaning to what we have said (Muller et al.) [34]. In the case of an omnipresent AI, there seems to be no visible way to discern the action of an AI nor will users understand an AI’s mental language, hence the current lack of trust towards these AI systems. Towards achieving a certain level of natural communication to fulfill Condition 3 of an omnipresent AI, universal principles from existing human-to-human communication theories may be used as a starting point. Human-Human Communicative Theories Referring to Abdul [10], users need to be able to understand how the system works, trust it, as well as feel in control for an AI system to gain acceptance. These qualities of communication have been explored in Searle’s Speech Act Theory [35] from a humanto-human communication point of view, which requires in a spoken sentence to fulfill the following. 1. 2. 3. 4.
Essential Condition Preparatory Context Propositional Content Sincerity.
When a sentence is spoken, its intent is carried through 3 different forces, which are the Locutionary, Illocutionary and Perlocutionary Forces. The Locutionary Force is the words chosen to be spoken, the Illocutionary Force is the intention of the message spoken, and the Perlocutionary Force is the intended effect on the listener as a result of what is being spoken. A successful Speech Act is achieved when the perlocutionary force is achieved on the listener. However, an utterance alone cannot claim to be sincere and authentic by itself, and more discerning acts have to be understood for its intentions to be clear. Figure 5 summarises Searle’s Speech Act theory.
710
P. Sathikh et al.
Fig. 5. Summary of Searle’s Speech-Act theory (Sathikh et al.) [59]
In order to bring sincerity and authenticity, Habermas’ Communicative-Action Theory [36] summarised in Fig. 6 builds on this with the introduction of validity claims. These validity claims work towards mutual agreement for both speaker and hearer which are: claim to truth, claim to justice, claim to sincerity and claim to power. These combined and assessed add up towards the creation of a successful speech-act. Claim to power is an exception as the speaker, through sending a signal of authority, uses the power of veto to override any prior Speech Act put forth by the other party.
Fig. 6. Habermas’ communicative-action theory
In addition, effective communication also needs to take social interaction into account. Social interaction is described as the result of a successful parsing and sending of communication signals (Turner) [37]. In Turner’s Theoretical Model of Social Interactions (Fig. 7), he writes that these communication signals are processed by the conceptualisation of a subjective self-image (Deliberative Capacities), the collected definitions and feelings of oneself on the subject (Self-references) and an understanding of the state of the world that can either be internalised or shared with the conversational partner (Stocks of knowledge at hand). These processes drive a chain of mutual roleassignment, framing, validating, accounting, staging and ritualisation practices that form the basis of social conversations.
Achieving Trust in Future Human Interactions
711
Fig. 7. Turner’s original theoretical model of social interactions, with groups of related functions emphasized for clarity
Non-verbal Communication In addition to speech, our natural modes of interaction to elicit a successful Speech-Act to foster trust also involve nonverbal communication signals such as body language and cues. For humans, using our physical bodies coupled with millennia of conditioning provided the means to perform non-verbal gestures with ease (Pollick and de Waal; McNeill) [38, 39]. It is also theorized that body language may have been developed before spoken language (Kendon) [40]. Turner’s [37] theoretical model as mentioned in the earlier section includes accompanying modes of expression such as gesturing which are integral to successful social signal interpretation. This is evident as humans use a combination of nonverbal facial expressions and body language to enhance the interpretation of a spoken or written message by providing more semantic data in each interaction (Wharton; Krauss et al.; Hans and Hans) [41–43]. A familiarity from evidence of our anthropomorphic tendencies necessitates the creation of the AI’s own version of body language for more nuanced self-expression (Airenti) [44]. With this idea that human users have to recognise the AI as a discrete individual, communication theories alone do not appear to be adequate in establishing a natural mode of interaction with omnipresent AI beings. To this end, we believe the development of AI needs to move towards a strong sense of presence for effective interactions.
712
P. Sathikh et al.
3.2 Presence and Personality Presence is defined as a perceptual illusion (Slater; Usoh et al.) [45, 46] from a computergenerated environment (Sheridan) [47]. Ideally, when the viewer sees the AI being, they should nominally react as if it were another person before the cognitive system kicks in to remind them that it is just an illusion. Mantovani [48] writes that in situations where a tangible operator is not present, the presence created is one of operative telepresence. He argues that reality is co-constructed in the relationship between actors and their environments through the mediation of the artifacts. Turner’s [37] social interaction theory supports this where the processes of role, stage and frame making drive social interactions. As a result, because presence is mediated by both physical and conceptual tools belonging to a given culture, it can be achieved through the construction of environments where actors may function in a valid way, not necessarily having to strictly mimic a physical presence. However, to have a social AI, its intelligence has to be recognised by its user, through social competence (Isbister) [49]. Humans recognise intelligence as the ability to learn and respond to questions, the ability to think abstractly, as well as to adapt to the environment (Lanz; Sternberg; Sternberg) [50–52]. This is in line with Marcus and Davis’ views on AI inference, and Berger and Calabrese’s views on Prediction and Explanation. To get into a more comprehensive understanding of humans recognising social competence in an AI, Lee [53] states that authenticity is needed in the interaction in order for humans to achieve a high level of intimacy and interpersonal affection with an AI being. This is illustrated in the Authenticity model of Computer-mediated communication which consists of three components: 1. Authenticity of source - is the communicator who they claim to be 2. Authenticity of message - is it real? 3. Authenticity of interaction – Unscripted. To achieve authenticity of source, the AI has to adopt a stereotype of the character which is being portrayed such that the human audience may pick up on the nuances to authenticate the AI interactant. The message not only has to be in line with what the user expects to receive, but also in line with the receiver’s values. Lastly, to achieve authenticity of interaction, Schudson’s [54] conversational ideals of reciprocity and spontaneity are pertinent. It is not enough for the AI to just follow a give-and-take response, the AI’s responses also must not seem scripted. A solution to achieving authenticity in an AI interactant could be the use of personas. Personas are defined in interaction design as “a description of a fictitious user… who does not exist as a specific person but is described in a way that makes the reader believe that the person is real” (Nielsen) [55]. Referring back to Tejwani’s [33] study on migratable AI, a single persona that has the characteristics most preferred by the user could be used by the omnipresent AI to “inhabit” the different environments which come into contact with the user, very much like how a personal butler or driver will always be by the user’s side standing by to assist them with the personalised knowledge of their needs. In order to adopt a persona, perhaps the omnipresent AIs have an understanding of itself as an entity, which brings us into the concept of proprioception.
Achieving Trust in Future Human Interactions
713
3.3 Proprioception and the Understanding of Context Another quality needed for effective social interactions is the ability to recognise the self from the environment, a form of self-awareness (AlQallaf et al.) [56]. This may mean for an AI to interact with its environment, it has to first differentiate, recognise and situate itself with its body. When the AI is able to come to this form of awareness, it can then view itself as an individual and understand its role in the ecosystem, and may then start creating context-aware interaction. Sundar [2] describes the understanding of context as source interactivity - the ability to not only customise information for the self but also to curate and create content for others. The first step in self-awareness perhaps is to introduce a method by which the omnipresent AI may express itself. For a being to be able to perceive its environment, it requires a mental image of the articulations it has at its disposal to perform the acts of communication (Damasio) [57]. For biological creatures such as human beings, the knowledge of one’s body’s moving parts forms part of the somatosensory system and hence enables proprioception, a phenomenon where the brain monitors the body position through careful accounting of motor commands (Tuthill and Azim) [58]. In the context of an omnipresent AI, the “body” is the set of systems working together, the “brain” being the AI and the “moving parts” are the ways it expresses itself or uses the devices connected to it to perform its task. Drawing inspiration from humanhuman communications, Sathikh et al. [59] believe that the medium in which it inhabits or uses to convey meaning can be constituted as a body, it is mutually beneficial to instill proprioception within the AI’s somatosensory map, in order for it to communicate in a new yet familiar way. This enables Condition 1 and 2 of omnipresent AI to be fulfilled. In summary, the omnipresent AI has to understand the context of itself, the situation and the user, adapt human personalities such as to authenticate itself to its users, adopt human communication theories to portray intent, trustworthiness and sincerity in its speech, then use its faculties to express itself, thus generating an intelligent presence.
4 The Postulates of a Trustworthy Human-Omnipresent AI Interaction For an omnipresent AI to meet in order to interact effectively with its users, we have proposed and discussed the conditions required for trustworthy interactions. These conditions are extracted from the concepts of human-to-human communication theories, non-verbal communication, social presence, personality, proprioception and the understanding of context that were discussed earlier in Sect. 3. Together, they form the basis for the following three postulates as guidelines towards development of an omnipresent AI for trustworthy interactions: 1. Omnipresent AI should adopt a Persona to be Perceived as a Presence. The omnipresent AI should adopt a consistent identity and persona when it serves the same human user. With the AI utilizing a persona, it may then begin to be recognised as an interactant despite having no physical form, fulfilling Condition 1 of an omnipresent AI interaction. This addresses the issue of trust as identified by Tejwani et al. [33] when the same information is shared among several AI entities. The use of a persona facilitates
714
P. Sathikh et al.
additional processes such as proprioception and use of contextual information in social interactions to comprehend its goals and the needs of its users, fulfilling Condition 2 as well. 2. Omnipresent AI should be able to Manage Information in a Manner that Leads to Appropriate Responses. This fulfils the compositional and emotional aspects of McCarthy and Wright’s [21] Technology as Experience framework. When the AI is able to understand social context and comprehend its role in the interaction, it can predict outcomes and adjust its responses to address them. This also helps them fulfil Dubberly’s conversing model, whereby both interactants are able to signal each other, asking questions and giving instructions but most of all working together to achieve a common goal, hence reducing frustrations from a lack of common understanding. This fulfils Condition 3 of an omnipresent AI interaction. 3. Omnipresent AI should be able to Communicate Naturally with Users Despite not having a Physical form. This fulfils the sensual and spatio-temporal aspects of McCarthy and Wright’s (2004) Technology as Experience framework. Applying communication theories which are naturalised to human beings, AI will come across as a more relatable entity which can be easily understood and trust-authenticated. The ability to express itself through the communication modes available at hand will also give the human users ease when interacting within the same social environment. This fulfils Condition 1 of an omnipresent AI interaction.
5 Speculative Application of AI-Human Interaction in an Autonomous Vehicle The authors are currently working towards testing the postulates presented in this paper through experiments being done in physical (Fig. 8) and virtual mockups of an autonomous vehicle. Participants are tasked with role playing as passengers in a pre-recorded road trip scenario driven by an omnipresent AI driver. The AI Driver is simulated using the Wizard-of-Oz experiment1 format where the experimenter, who is hidden from the participant sitting inside the mockup vehicle, pretends to be the AI Driver and engages the participant in conversation that pertains to the context of the road trip. These conversations, which are derived from our earlier discussion in this paper, are augmented with visual and sound design interventions that function as additional modes of expression the AI Driver can use in communication to increase trust. The conversations are later analyzed and recorded both qualitatively and quantitatively to gauge the effectiveness of these design interventions in communicating the AI’s intentions, the results of these experiments to be published in later proceedings.
Achieving Trust in Future Human Interactions
715
Fig. 8. Physical mockup of road trip scenario simulation experiment
6 Conclusion Omnipresent AI appears to be an increasingly popular and new iteration of humancomputer interaction, as they are able to operate several devices at once and work with the user to achieve a goal. These AI communicate with their users through voice and gesture identification, thus presenting the potential for them to provide customised customer service as they are designed to interact with their human users socially to achieve a functional purpose. However, present day omnipresent AI examples may not elicit the levels of trust required of them to carry out high responsibility tasks. We begin addressing this issue by proposing three conditions for omnipresent AI to interact with its users. These take into account the AI’s inherent form, the need to be an interactant in a social exchange, the ability to learn and comprehend its and it’s user’s tasks, and have to express its intentions effectively. From these conditions we explored interaction theories, and identified that for a social exchange, a conversing system proposed by Dubberly [13] is needed. For an interaction to elicit trust, it also has to fulfil McCarthy and Wright’s [21] Technology as Experience framework and Berger & Calabrese’s [22] concepts of Prediction and Explanation. Current AI systems need to work towards meeting these three conditions as they are yet unable to interact in a natural manner, comprehend their tasks and users, and require the help of third-party servers to aid their information processes. Working towards a more natural mode of interaction we explored the concepts of human communicative theories, nonverbal communication, the use of personality to create presence, and finally proprioception. From these concepts we have developed three postulates for trustworthy interactions between humans and omnipresent AI. To further test these postulates, we are conducting experiments which will further strengthen the validity of these postulates. It is hoped these postulates form a first foundation for future developments in AI research. Acknowledgments. This paper is the result of a grant given by the National Research Foundation of Singapore to NTU-TUMCREATE, a collaboration between Nanyang Technological University and Technical University of Munich.
716
P. Sathikh et al.
References 1. Card, S.: The Psychology of Human-Computer Interaction, 1st edn. CRC Press, Boca Raton (2018) 2. Sundar, S.: Rise of machine agency: a framework for studying the psychology of human–AI Interaction (HAII). J. Comput.-Med. Commun. 25(1), 74–88 (2020). https://doi.org/10.1093/ jcmc/zmz026 3. Jhaver, S., Karpfen, Y., Antin, J.: Algorithmic anxiety and coping strategies of Airbnb hosts (2018) 4. Maedche, A., et al.: AI-based digital assistants. Bus. Inf. Syst. Eng. 61(4), 535–544 (2019). https://doi.org/10.1007/s12599-019-00600-8 5. Shaw-Garlock, G.: Looking forward to sociable robots. Int. J. Soc. Robot. 1(3), 249–60 (2009). https://doi.org/10.1007/s12369-009-0021-7 6. Zhao, S.: Humanoid social robots as a medium of communication. New Media Soc. 8(3), 401–19 (2006). https://doi.org/10.1177/1461444806061951 7. Duffy, B.: The social robot, Ph.D. thesis, Department of Computer Science, University College Dublin (2000) 8. Aarts, E., Wichert, R.: Ambient intelligence. In: Bullinger, H.J. (eds.) Technology Guide. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-88546-7_47 9. Xu, Z., Zhang, K., Min, H.: What drives people to accept automated vehicles? Findings from a field experiment. Transp. Res. Part C Emerg. Technol. 95, 320–334 (2018). https://doi.org/ 10.1016/j.trc.2018.07.024 10. Abdul, A., Vermeulen, J., Wang, D., Lim, B.Y., Kankanhalli, M.: Trends and trajectories for explainable, accountable and intelligible systems: an HCI research agenda. In: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–18. ACM (2018). https://doi.org/10.1145/3173574.3174156. 11. Yang, C., Yuan, K., Zhu, Q., Yu, W., Li, Z.: Multi-expert learning of adaptive legged locomotion. Sci. Robot. 5(49) (2020). https://doi.org/10.1126/scirobotics.abb2174 12. Floridi, L.: Establishing the rules for building trustworthy AI. Nature Mach. Intell. 1(6), 261–262 (2019) 13. Dubberly, H., Pangaro, P., Haque, U.: ON MODELING What is interaction? Are there different types? Interactions 16(1), 69–75 (2009). https://doi.org/10.1145/1456202.1456220 14. Verplank, B.: Interaction Design Sketchbook (2019, unpublished manuscript) 15. Maldonado, T., Bonsiepe, G.: Science and design, Ulm 10/11. J. Ulm School Des. HfG Ulm (1964) 16. Wickramasinghe, C.S., et al.: Trustworthy AI development guidelines for human system interaction. In: 13th International Conference on Human System Interaction (HSI). IEEE (2020) 17. Corritore, L.C., Kracher, B., Wiedenbeck, S.: Int. J. Hum. Comput. Stud. 58(6), 738–758 (2003) 18. Bishop, M.: Computer Security: Art and Science. Addition-Wesley Press, New York (2003) 19. Marcus, G., Davis, E.: Rebooting AI: Building Artificial Intelligence We Can Trust. Vintage, New York (2019) 20. Preece, J., Sharp, H., Rogers, Y.: Interaction Design: Beyond Human-Computer Interaction. Wiley, New York (2015) 21. McCarthy, J., Wright, P.: Technology as Experience. The MIT Press, Cambridge (2019) 22. Berger, C.R., Calabrese, R.J.: Some explorations in initial interaction and beyond: toward a developmental theory of interpersonal communication. Hum. Commun. Res. 1(2), 99–112 (1975). https://doi.org/10.1111/j.1468-2958.1975.tb00258.x
Achieving Trust in Future Human Interactions
717
23. Heider, F.: The Psychology of Interpersonal Relations. Wiley (1958). https://doi.org/10.1037/ 10628-000 24. Adams, J.S.: Inequity in social exchange. In: Berkowitz, L. (ed.) Advances in Experimental Social Psychology, vol. 2, pp. 267–299. Academic Press, New York (1965) 25. Altman, I., Taylor, D.A.: Social Penetration: The Development of Interpersonal Relationships. Holt, Rinehart and Winston, New York (1973) 26. Homans, G.C.: Social Behavior: Its Elementary Forms. Harcourt, Brace and World, Inc., New York (1961) 27. Ashish. https://www.wpxbox.com/current-technical-limitations-amazons-alexa/ (2020). Accessed 15 Jan 2020 28. Edmonds, E.: Three in Four Americans Remain Afraid of Fully Self-Driving Vehicles (2019) 29. The Autonomous Car A Consumer Perspective, pp. 5–7 (2019) 30. Google (2021). https://policies.google.com/privacy. Accessed 15 Jan 2020 31. Corritore, L.C., Kracher, B., Wiedenbeck, S.: Int. J. Hum Comput Stud. 58, 738–758 (2003) 32. Groom, V., Nass, C.: Can robots be teammates? Benchmarks in human-robot teams. Interact. Stud. 8, 483–500 (2007) 33. Tejwani, R., Moreno, F., Jeong, S., Park, H.W., Breazeal, C.: Migratable AI: effect of identity and information migration on users’ perception of conversational AI agents. In: 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), pp. 877–884 (2020) 34. Müller, C., Cienki, A., Fricke, E., Ladewig, S., McNeill, D., Tessendorf, S. (eds.) BodyLanguage-Communication, vol. 1. Walter de Gruyter (2013) 35. Searle, J.R., et al.: Speech Act Theory and Pragmatics. Dordrecht, Holland (1980). Edited by John R. Searle, Ferenc Kiefer, and Manfred Bierwisch 36. Habermas, J.: The Theory of Communicative Action. Beacon Press, Boston (1981). Translated by Thomas McCarthy 37. Turner, J.: A Theory of Social Interaction. Stanford University Press, Stanford (1988) 38. Pollick, A., de Waal, F.: Ape gestures and language evolution. Proc. Natl. Acad. Sci. 104(19), 8184–8189 (2007) 39. McNeill, D.: How Language Began (2012) 40. Kendon, A.: Reflections on the “gesture-first” hypothesis of language origins. Psychon. Bull. Rev. 24(1), 163–170 (2016). https://doi.org/10.3758/s13423-016-1117-3 41. Wharton, T.: Pragmatics and Non-Verbal Communication, pp. 130–131. Cambridge University Press, Cambridge (2009) 42. Krauss, R., Chen, Y., Chawla, P.: Nonverbal behavior and nonverbal communication: what do conversational hand gestures tell us? Adv. Exp. Soc. Psychol. 28, 412–420 (1996) 43. Hans, A., Hans, E.: Kinesics, haptics and proxemics: aspects of non-verbal communication. IOSR J. Hum. Soc. Sci. (IOSR-JHSS) 20(2), 47–52 (2015) 44. Airenti, G.: The development of anthropomorphism in interaction: intersubjectivity, imagination, and theory of mind. Front. Psychol. 9, 2136 (2018) 45. Slater, M.: Immersion and the illusion of presence in virtual reality. Br. J. Psychol. 109(3), 431–433 (2018) 46. Usoh, M., Alberto, C., Slater, M.: Presence: Experiments in the Psychology of Virtual Environments, pp. 6–7 (1996) 47. Sheridan, T.B.: Musings on telepresence and virtual presence. Presence (Camb.) 1, 120–126 (1992) 48. Mantovani, G.: ‘Real’ presence: how different ontologies generate different criteria for presence, telepresence, and virtual presence. Presence 8(5), 540–550 (1999) 49. Isbister, K.: Perceived intelligence and the design of computer characters. In: Proceedings of the Lifelike Characters Conference, Snowbird, UT (1995)
718
P. Sathikh et al.
50. Lanz, P.: The concept of intelligence in psychology and philosophy. In: Cruse, H., Dean, J., Ritter, H. (eds.) Prerational Intelligence: Adaptive Behavior and Intelligent Systems Without Symbols and Logic, Volume 1, Volume 2 Prerational Intelligence: Interdisciplinary Perspectives on the Behavior of Natural and Artificial Systems, vol. 3, pp. 19–30. Springer, Dordrecht (2020). https://doi.org/10.1007/978-94-010-0870-9_3 51. Sternberg, R.J.: The concept of intelligence and its role in lifelong learning and success. Am. Psychol. 52(10), 1030–1037 (1997) 52. Sternberg, R.J.: Human intelligence. Encyclopedia Britannica. https://www.britannica.com/ science/human-intelligence-psychology. Accessed 01 Oct 2019 53. Lee, E.: Authenticity model of (mass-oriented) computer-mediated communication: conceptual explorations and testable propositions. J. Comput.-Med. Commun. 25(1), 60–73 (2020) 54. Schudson, M.: The ideal of conversation in the study of mass media. Commun. Res. 5, 320–329 (1978). https://doi.org/10.1177/009365027800500306 55. Nielsen, L.: Personas - User Focused Design. Springer, London (2019). https://doi.org/10. 1007/978-1-4471-7427-1 56. AlQallaf, A., Aragon-Camarasa, G.: Enabling the Sense of Self in a Dual-Arm Robot. arXiv preprint arXiv:2011.07026 (2020) 57. Damasio, A.: Self Comes to Mind. Vintage, London (2012) 58. Tuthill, J.C., Azim, E.: Proprioception. Curr. Biol. 28(5), R194–R203 (2018) 59. Sathikh, P., Tan, G.Y.: Design for Tomorrow - Volume 3 Proceedings of ICORD 2021, 8th edn. Springer, Mumbai (2021)
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems ´ Etienne Houz´e1,2(B) , Jean-Louis Dessalles1,2 , Ada Diaconescu1,2 , David Menga1,2 , and Mathieu Schumann1,2 1
T´el´ecom Paris, Palaiseau, France {etienne.houze,jean-louis.dessalles,ada.diaconescu}@telecom-paris.fr 2 EDF R&D, Palaiseau, France {etienne.houze,david.menga,mathieu.schumann}@edf.fr
Abstract. The Internet of Things (IoT) has been a prominent application for Intelligent Systems in recent years. While the increasing demand for explanations led to many advances in Explainable Artificial Intelligence (XAI), most solutions focus on systems where a single agent takes all decisions to be explained. However within the IoT context, CyberPhysical Systems (CPS) are decentralized, with multiple agents coordinating their decisions to control the overall CPS. By contrast users expect coherent system-wide explanations, as if they were generated by a single agent. We propose a decentralized Explanation System generating such explanations while preserving the advantages of decentralized control: separation of concerns, heterogeneity, flexibility, . . . Our architecture relies on: i) decentralized XAI Component specialists for providing partial explanations based on local knowledge; ii) a central generic Spotlight composing local explanations into a global explanation. We illustrate and qualitatively evaluate our approach via a proof-of-concept implementation for a smart home system.
Keywords: Explanation
1
· Cyber-Physical Systems · Decentralization
Introduction
Recent years have seen an increase in the number of devices that are connected to the Internet or local networks as to form Intelligent Systems offering a wide variety of capabilities. These include smart heaters, smart light bulbs, smart doorbells, smart doors, smart kitchen appliances and smart washing machines. Emerging communication technologies, such as 5G, pave the way towards the integration of such devices into ever more complicated intelligent Cyber-Physical Systems (CPS) and Systems-of-Systems offering more sophisticated functions to end users and stakeholders [25]. One of the most notable examples is the smart home, which integrates multiple devices for monitoring and controlling various aspects within individual houses. While the promises of these systems are many (e.g. better energy management, security and health monitoring [8]), c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 719–738, 2022. https://doi.org/10.1007/978-3-030-82193-7_48
720
´ Houz´e et al. E.
the adoption of smart home technologies is relatively slow compared to expectations, partly because of their apparent complexity and the opacity of their decisions [20]. Hence, the explanation of AI decisions has attracted the interest of academics and consumers. The number of publications citing keywords related to Explainable AI (XAI) has exploded as a result of the DARPA project on XAI in 2016 [1]: in 2020, around 350 papers contained “XAI”, “Explainable” or “Explanatory” in their metadata were reported on IEEEXplore, from less than 100 in 2015. This interest is motivated by various factors, including: i) concerns about the shift of decision-making responsibilities from humans to machines, which raises both trust [21] and legal concerns [7]; ii) the enforcement of stricter regulations, such as the European GDPR, regarding data use and decision making; and iii) the appeal of reducing maintenance costs by allowing systems to self-diagnose and provide relevant information to users, in case of failure. However, despite the rapid growth of research in the XAI field, most Intelligent Systems do not yet benefit from its solutions. This is mainly because XAI focus on centralized approaches which explain a single decision-making agent. To the best of our knowledge, no solution exists for adapting explanatory systems to the core characteristics of Intelligent CPS, especially within the IoT context where systems are typically decentralized organizations coordinating multiple controllers, each with its own internal logic, variables and objectives. Compared to centralized solutions, decentralized designs aim to bring more reliability, adaptability and scalability, as well as to facilitate dynamic device installation and integration into existing CPS. To preserve these advantages, explanatory systems should also adopt a decentralized design. Still, from a user’s perspective, explanation, as a cognitive process, is intrinsically sequential. People analyze one fact after another, in a single centralized process that aims to reach some conclusion. This process seems incompatible with the decentralized CPS approach, where knowledge about the state of the system is dispersed among various components, organized in various configurations (see Fig. 1). Our approach aims to address this discrepancy between the user’s centralized explanation expectation and the CPS’ decentralized knowledge reality. We propose to add an additional layer on top of an existing CPS. This layer comprises specialized XAI Components (decentralized) to observe the CPS while their knowledge is used by a generic Spotlight (centralized) to dynamically generate explanations by following a generic algorithm. This algorithm, named D-CAS for Decentralized-Conflict-Abduction-Simulation, is an adaptation from an existing conflict-solving approach, CAN (Conflict-Abduction-Negation) [5]. The present contribution introduces the D-CAS algorithm and details the decentralized architecture enabling it. A sketch of the principles of this algorithm is as follows: Each XAI component only has access to the public variables of its associated CPS component. It can analyze this local knowledge to detect conflicts (i.e. discrepancies between local objectives and current state) and to find their possible
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
721
Target temperature
Open/Close Blinds
Temperature
Power On/Off
Fig. 1. A simple example of an intelligent cyber-physical system: a smart office is equipped with a temperature controller which monitors room temperature and can act on a heater and the window blinds to regulate it. Next to each component are shown concepts they have access to, thus illustrating how knowledge is distributed across the system.
causes. XAI Components then report such local results to the central Spotlight component. When the user requires an explanation about an observed phenomenon, the Spotlight first identifies the XAI specialized in that phenomenon (i.e. attached to the CPS that controls its defining variables). The Spotlight then activates this XAI component, asking it to either: a) identify the problem and find a possible solution; or b) point to a potential cause in another XAI component. In the latter case, the Spotlight activates the new XAI component and the process starts over, recursively. The process continues until a solution is found, or until it runs out of options (a time limit could also be considered in future work). The Spotlight is generic in that it does not need direct access to specific system variables or the ability to analyze them. It does so indirectly, via specialized XAI components. Hence, the Spotlight’s capabilities are minimal, serving as a generic registry and coordinator of XAI components. The focus of this paper if to present the D-CAS algorithm general reasoning, and the architectural features enabling it. We therefore consider as out-of-scope the precise implementation of the XAI Components. For instance, although a naive implementation is used for a proof-of-concept demonstrator, we do not focus on how XAI Components infer possible causes to an observed conflict. The rest of this paper is organized as follows. Section 2 briefly reviews existing XAI approaches. Section 3, introduces two real-life examples that highlight the benefits and difficulties in explaining decisions within a smart house. Section 4 details our proposal to decentralize XAI and to provide an architecture suited to explain decisions made by a CPS. Section 5 illustrates the proposed capabilities
722
´ Houz´e et al. E.
via a proof-of-concept implementation. Section 6 discusses the potential advantages of decentralized XAI solutions (e.g. scalability, maintainability, flexibility), as well as possible limitations and future developments.
2
Background and Related Works
Cyber-Physical systems (CPS) represent “multidisciplinary systems conducting feedback control on widely distributed embedded computer systems” [17]. As they are endowed with sensors and actuators, their computations affect and are affected by their physical environment. They often consist of numerous components, each with specific computational and/or physical capabilities. Smart houses are typical CPS examples, with various devices cooperating to meet user objectives. CPS control is usually decentralized, or hierarchical [6], as such designs improve flexibility, heterogeneity and ease of maintenance [11]. On the other hand, the XAI field [1,2] has focused on explaining a decisions taken by a single agent. This area has emerged especially in recent years, following the DARPA project [1] and concerns about the application of the “right to explanation” outlined in the European GDPR [9]. As deep learning approaches provide black-box models, related XAI solutions aim to understand, visualize and motivate their decisions. Various strategies are used by XAI to explain AI decisions. Some approaches use “understandable models”, as opposed to black box models: examples range from decisions trees [23] to expert systems [26]. Others approach advanced deep neural networks with decision trees in order to provide the user with an understandable graph explaining the motivations of the network classification [3]. Another notable XAI solution is LIME [24]. It is an agnostic method that uses an estimation of the black box model predictions on points close to the original queried point to approximate the shape of the decision boundary in the neighborhood of these points. It calculates the most discriminating factors used to classify the points. Other approaches follow the same principle, providing post-hoc explanations of local variations [14,18]. All these XAI techniques share a common centralized approach to explanation. The same approach is found in studies of explanation as a cognitive process. When looking for an explanation for an observed phenomenon, one tends to focus successively on one element after another, in a sequence [16,19]. Our mind processes this information to form a coherent explanation, which aims to resolve an apparent conflict between an observation and an expectation [19,22]. Based on this last consideration, we consider explanation as a conflict resolution process where successive hypotheses and the consequences are presented as arguments. One way to generate such dialogues is a process called ConflictAbduction-Negation (CAN) [5]. This approach identifies four generic steps, which are applied sequentially (see Fig. 2): 1. Conflict detection – discovers a conflict and associates it with a numerical value, called necessity, to quantify its intensity. 2. Abduction – identifies the most probable causes for the identified conflict.
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
723
3. Negation – considers an alternative world where the conflict is missing and evaluates potential consequences. This step is similar to wondering what would happen if things were different. Pearl notes that this ability is unique to humans and central to their ability to reason [22]: they model their environment and play alternative possibilities to observe their consequences and thus discover or validate causal relations. 4. Solution – solves a conflict by revising knowledge (i.e. the conflict was a false positive), acting on the world (i.e. changing the conflicting state) or abandoning the conflict (i.e. resolution failure, yet avoiding blockage or infinite loops). Solving a conflict can eventually cause the emergence of another conflict, thus indirectly propagating the conflict.
Fig. 2. The different steps of CAN. When a conflict is detected in the system, CAN can trigger processes of abduction, negation or solution (Green arrows). These processes propagate the conflict to a new one (Regular red arrows) which is in turn examined in a new iteration. Note that the ‘solution’ step can also possibly propagate the conflict to a potential consequence (Dashed red arrow).
This original CAN approach is centralized: all data and rationales are examined within one process. Adding a component with these capabilities to an existing CPS is possible, yet would introduce several drawbacks. Centralizing all system data in a single component may impede scalability, hinder maintenance and reliability. It would also require updating data as components change over the lifetime of the CPS (e.g. installing new devices in a smart home).
3
Smart Home Scenarios
We illustrate the advantages of CPS being able to explain their behavior via two examples from the smart home domain. In both examples, we consider how the Conflict-Abduction-Negation (CAN) would handle the generation of an explanation. Since the purpose is merely to illustrate the rationale, architectural and implementation concerns are not discussed in this section. In a first example, we consider a phenomenon that occurs periodically in one of the author’s “smart” offices: without notice or apparent reason, the control system decides to close or open the window blinds (Cf. Fig. 1). Even if this event seems meaningless, its repetition and opaque reasons leave most occupants rather
724
´ Houz´e et al. E.
perplexed: was closing the blinds a deliberate action to control room temperature or manage lighting? or was it a mere sensor failure? Without further indication, users can’t know which support to contact, or which settings to change for fixing the problem. Providing explanations would increase user acceptance of this CPS and facilitate interventions in case of component failures. In this example, the Conflict-Abduction-Negation (CAN) approach would proceed as follows. Firstly, it detects the conflict: the window blinds closed when the user was not expecting it (e.g. daytime but no direct sunlight). Then, it attempts to find a cause for this conflict via abductive reasoning, or try to negate the conflict and see what the consequences would be. Negation in our case implies investigating what would happen if the blinds remained open. Indeed, another conflict would occur in such case: the room temperature would become too high. This temperature conflict is more intense to the user than the blinds being closed, so CAN reverses its previous negation (i.e. the blinds are closed again) and goes back to the initial conflict about the blinds. However, having already seen that opening the blinds would result in a more severe conflict, CAN now discards this solution and, having nothing better to do, abandons the blind conflict. This line of reasoning can be presented as explanation to users, as comparable with what rational humans would understand when faced with this situation: “The blinds were closed to avoid heating the room, which would be a bigger problem than poor natural lighting.” A second example illustrates a smart home scenario that is more difficult to explain. Two controllers operate within a room to achieve different objectives; they are unaware of each other. This case may occur when the user installs a new smart device to an existing smart home environment. Concretely, a smart home equipped with a temperature control system integrates a new ventilation controller operating the window. During a cold day, this extended installation may lead to conflicts and potentially surprising decisions (e.g. the two controllers opening and closing the window sequentially to achieve their respective objectives). When the window is open and the user needs an explanation for the cold indoor temperature, we can once again employ the CAN process. Here, CAN identifies the cold temperature as a conflict with the user’s warm temperature objective. Then, by abduction, it propagates the conflict to the potential cause of the cold temperature – the window. It consequently raises a new question on why the window is open, wondering what would happen if the conflict were negated, and observing that the room would be inadequately ventilated. If, this time, the user considers the lack of ventilation conflict to be more bearable than the cold, then CAN would stop there, having found a solution to the initial conflict. The window can be closed and the problem solved. Its explanatory trace, when translated into natural language, would be “I infer that the cold comes from the fact that the window is open. If I close it, it puts an end to the question of the cold, but raises a minor problem concerning the ventilation of the room.” Presented with this explanation, the user is able to make an informed decision, and decide whether she wants to close the window.
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
725
In these two examples, we put aside any architectural concern and assumed that CAN had access to all the knowledge about the system components and their physical environment (i.e. centralizing all knowledge within a single component). In the following section, we detail how our decentralized architecture allows local knowledge to remain within local control components, while being able to generate the same global rationale as the centralized CAN presented above.
4 4.1
Decentralizing Explanatory Reasoning Solution Overview
In this paper, we assume that CAN’s four basic steps, i.e. conflict detection, abduction, solution and negation can be performed on local components. Based on this assumption, an overall explanation can be composed from local argumentation. The proposed composition process relies on a coordinator that triggers local explanatory components and assembles their results. Hence, this central component is, in terms of knowledge, minimalist and generic, as it does not have access to the knowledge possessed by the local components, but rather only points to them. A global overview of our architecture is presented in Fig. 3. Spotlight Calls for explanation
Reports
XAI Components XAI Components XAI Components Observations
Simulator
Calls
Observations
Cyber-Physical Components Cyber-Physical Components Cyber-Physical Components Actions
Monitoring
Physical Environment
Fig. 3. The layer organization of our architecture, and communications between layers.
While we are aware that more complicated cases exist and may require system-level coordination, we consider that our assumption holds in numerous situations, which can hence be explained by composing local arguments. Our intention here is to show that a decentralized version of CAN [5] can be suitable for CPS configurations. We only focus on the architectural and coordination aspects at the system level which ensure the advantages of a decentralized architecture while offering a centralized view to the user. Questions about how local steps are implemented (e.g. conflict detection, abduction, simulation), are beyond the scope of this paper. Indeed, numerous research proposals focus on these aspects precisely and can be adopted within our general framework in future work (Cf. Sect. 6).
726
´ Houz´e et al. E.
The generic smart home setup in which our architecture takes place is as follows. A single physical environment is equipped with various sensors, actuators and controllers. The organization of these components can be purely decentralized, or hierarchical [6]. Since our approach aims to be organizationally agnostic, this distinction is not further discussed here. We call physical variables the set of variables that describe the current state of the physical environment (e.g. temperature, luminosity, time and energy consumption). We also include here internal variables describing the current state of the CPS components. In Fig. 3, we show how these physical variables are monitored by our added Explanatory System. As the Physical Environment is monitored and controlled by Cyber-Physical Components, which can perform actions to modify it, these latter are in turn observed by XAI Components (XAIC). Then, XAICs are able to report conflict to the coordinator of the Explanatory System, which we call the Spotlight. In addition, a simulator allows to play out counterfactual scenarios during the explanation generation [22]. An concrete example implementation of the system into a room equipped as such is presented in Fig. 4.
Questions Explanations
Temperature Controller
User
XAI Heater
Simulator Spotlight
Heater Controller
XAI TempCtrl
XAI Thermometer Thermo Contoller
Window Controller
XAI Window
Fig. 4. Our architecture implemented for the window blinds example discussed in Sect. 3. Each CPS component (Green) is paired-up with an XAIC (Red). Each XAIC is triggered by and reports to the spotlight. The simulator can be accessed by any XAIC, if need be (Communications not shown for clarity).
4.2
A Decentralized Knowledge
Each XAIC is coupled to a CPS component, be it a physical device or a software controller. Then, each XAIC can observe the state of its attached CPS component via an access to its exposed variables while having no access to its internal logic. Hence, XAICs are specialized in particular kinds of CPS Components (i.e. depending on the types of objectives they manage) without depending on their actual implementation.
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
727
From observing their attached CPS component, an XAIC is able to form and evaluate Boolean predicates, which can be thought of as words describing the state of their attached CPS component. For example, an XAIC connected to a room temperature controller can formulate and evaluate predicates such as hot (room), cold (room), failure (temperature controller). Therefore, it is important to note that a given XAIC only has access to part of the description of the entire system, i.e., directly concerning its attached CPS component. Thus, knowledge of the system state is dispersed across the set of XAICs, following the same organization as that of the CPS components. To be able to detect conflicts, which is one of the necessary steps for CAN, XAICs need to compute a prior opinion about their predicates. Indeed, we define a conflict similar to [5], i.e. when the observation of a predicate’s value contradicts a prior opinion regarding this predicate. This opinion is encode into a number, called the necessity, whose sign indicates whether the prior opinion is that the predicate is true or false. In addition, the absolute value of the necessity, which we call the intensity, encodes the strength of this prior opinion. A conflict then occurs when the value of a predicate contradicts its value observed by the XAIC. For instance, if cold (room) is valued true while its necessity is evaluated to be −20, a conflict of intensity 20 is detected. 4.3
A Unifying Algorithm: D-CAS
To supervise the XAICs and ensure the necessary coordination between them, the Spotlight acts as a minimalist generic coordinator. It implements a variant of CAN, which we call D-CAS, for Decentralized Conflict-Abduction-Simulation, to highlight the two main differences with CAN: 1) decentralizing knowledge and reasoning, and 2) introducing a CPS simulator to play-out negation in alternative worlds (i.e. before actuating it in the real CPS). The role of the Spotlight is to ensure that the relevant XAICs are called during an explanation process for building a coherent system-wide explanation. To integrate within the explanatory system, each XAIC registers with the Spotlight; and then, via a background process, reports any detected conflicts and their necessity. To achieve decentralized reasoning, we keep this Spotlight as simple and as generic as possible. Notably, the Spotlight has no access to predicates or to their meanings; it only holds pointers to the XAICs and keeps track of which ones are reporting conflicts. The D-CAS algorithm is detailed in Algorithm 1. It periodically examines whether XAICs signal any conflicts in the system (line 4). If a conflict is reported, the Spotlight asks the responsible XAIC to handle it. To do this, the responsible XAIC is called successively with the methods doAction, abduction and giveUp. Each of these methods corresponds to a basic step that we identified in the original CAN process, and is executed by the XAIC on its local knowledge:
728
´ Houz´e et al. E.
Algorithm 1: D-CAS (spotlight) 1 2 3 4 5 6 7 8 9 10 11 12
Result: A Tree Object Representing the Explanation Reasoning Initialization Tree ← RootN ode() while Not interrupted by user do currentAI ← findAIBiggestConflict() ; while currentAI = null do Tree.append(N ode(currentAI)) ; decisionN ode ← currentAI.doAction() ; nextAI ← null ; if decisionN ode = null then nextAI, decisionN ode ← currentAI.abduction() ; if nextAI = null then decisionN ode ← currentAI.giveUp() ;
14
if nextAI = null then currentAI ← nextAI ;
15
Tree.append(decisionN ode) ;
13
16
return Tree ;
1. DoAction: covers the negations aspect of CAN. The XAIC checks whether it knows how to reverse the current conflict (i.e. perform an action that removes the conflicting state). If a solution is possible, it tries to simulate it. A signal is then sent to all other XAICs so that they consider the simulation results and detect potential conflicts within their control scope. In case one of them reports a conflict, it will report it to the spotlight, which will eventually give it the floor. If the currently activated XAIC does not know any solution to resolve the conflict, it does nothing. 2. Abduction: the XAIC tries to infer a potential cause for the current conflict. To find a suitable cause, it can look into its own local knowledge, or listen to broadcasts of other XAICs which would report events or abnormal states. Once such a cause is found, it sends a signal to the XAIC handling it, so that it considers the cause as its own conflict. To do so, the XAIC responsible for the cause will set its necessity to a value which will cause a conflict. Finally, the method returns a pointer to the XAIC responsible for that cause, or a null pointer if no cause was found. 3. GiveUp: in case nothing better was found, either because no solution exists for now, or that the conflict necessity was too small to consider causes of greater necessities, the XAIC can simply discard the conflict by changing its necessity to the opposite sign. To check the consequences of alternative scenarios (Cf. doAction step), we introduce a simulator that models the physical environment and its interactions with the CPS sensors and actuators. This simulator is supervised by the Spotlight and is based on digital twins of the CPS Components and is updated based on the knowledge of the system [4].
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
4.4
729
Generating an Explanation
Throughout the D-CAS algorithm, necessity is used to avoid loops and prioritize the importance of conflicts. When a conflict propagates through abduction, necessity is also propagated towards the cause. This reflects the following consideration: “If I don’t want A with an intensity 10, and I know that B is a cause of A, then I don’t want B with an intensity 10”. In the process, if a predicate has already been assigned a necessity, it can only be affected by considerations with a necessity of a strictly larger absolute value. Limiting conflicts to propagate only to predicates which were assigned lower necessities of lower intensity guarantees that D-CAS can not loop indefinitely and always terminates by reaching a state where no conflict is present in the system. This termination however does not guarantee that D-CAS always succeed in finding an explanation: as we will see in Sect. 6, it is very possible, and even desirable, that the system terminates by giving up on conflicts. The output of the D-CAS algorithm is not its final conflict-free state. Rather, it is the process that led to this state that is interesting. To best represent this process, D-CAS dynamically builds a tree at every step. Each of the aforementioned doAction, abduction and giveUp method returns a node or a leaf of the tree describing the option explored by the algorithm (line 6). Then, this node or leaf is appended to a node representing the XAIC active at this time (line 15). From this tree output, a simple parser is then able to reconstruct a “readable” explanation comprising the causal links from the observed phenomena to their causes. Figure 5 shows how D-CAS processes the window blinds example from Sect. 3. Figure 5a details the first calls that occurs during the explanation process of the window blinds example. In step (01), during a periodic call, XAI window-1 informs the Spotlight that it detected a conflict of a given intensity. In step (02), the Spotlight calls the XAIC to try to find a solution via an action; or to propagate the conflict via abduction. An action can directly revert the conflict, so XAI window-1 simulates this action (03), and simultaneously alert the other XAICs to listen to simulation results (04). A feedback is returned to the spotlight, informing that an action was simulated. In step (08), after reading values from the simulation (06-07), XAI TemperatureController reports a conflict to the Spotlight. The Spotlight thus calls onto XAI TemperatureController to handle this new conflict (09). Abduction finds that the cause might be in XAI window-1, the conflict is propagated (10). The Spotlight now interrogates XAI window-1 (12). The process then goes on and ends up generating the lower branch of the tree graph of Fig. 5d.
730
´ Houz´e et al. E.
(b)
(a)
(c)
(d)
Fig. 5. A typical explanation rationale and its output. (a) shows the calls between components of the explanation layer during an explanation process. Dashed arrows indicate periodic calls in background processes; regular arrows indicate punctual calls during an explanation process. For readability reasons, some steps have been omitted from this figure. Outputs corresponding to different steps of the same rationale are depicted in (b), (c), (d). Blue nodes denote XAICs interrogated in the process, with the examined conflict, its current value and necessity written below. These nodes can be followed by either an attempt to end the conflict – via an action or by giving-up – depicted by green nodes; or by propagating the conflict – via abduction – to another conflict (Yellow node). Figure 5(a) corresponds to the state of the reasoning after call (02) of Fig. 5a, Fig. 5(b) comes after call (09) and Fig. 5(c) represents the final output.
5
Implementation and Results
We implemented the proposed Explanatory Engine on top of a CPS simulator for smart homes – iCASA [15], which is a Java-based service-oriented component platform. iCASA models a physical home environment including temperature and luminosity, which can be controlled via specific devices, such as heaters and lamps. It allows adding and removing devices at runtime, without stopping the platform. iCASA simulates time via discrete steps (of configurable sizes).
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
731
With respect to our layered architecture (Cf. Fig. 3) iCASA implements the two lowermost layers – i.e. physical environment and CPS control. We added the two upper XAI layers – XAI components and Spotlight – coded in Python. To integrate new devices and/or controllers into the XAI system, the Spotlight is notified whenever such new components are added to iCASA. Then, the Spotlight creates a new XAI component (XAIC) and attaches it to the new CPS component. The newly created XAIC is initially generic. It then specializes by considering the values that it reads from its attached CPS. From these values, the XAIC creates a vocabulary of predicates that it will be able to evaluate. In the current implementation, this specialization relies on the XAIC recognizing the CPS component’s type, then fetching the standard predicates for that type from a common registry. This basic implementation could later be replaced by more sophisticated procedures, e.g. learning predicates dynamically by observing trends and patterns in the CPS component’s values. Conversely, the XAICs are removed when the corresponding CPS components leave iCASA. This seamless integration is possible thanks to the autonomic manager of iCasa which exposes all relevant information to our components and dynamically handle the publication of services. Since the aim of this prototype is only to serve as a proof-of-concept for the decentralization of an explanatory reasoning via our D-CAS algorithm, the local steps required in the XAICs are implemented naively. More precisely, the predicates known to XAICs correspond to simple predefined threshold comparisons of certain variables, and their necessities is either predefined or can be directly updated by the user at runtime. Similarly, the implemented XAIC’s abduction process employs a simple event-based method. When a conflict occurs, all previous events are considered as hypothesis, sequentially, in their inverse order of occurrence (limited by a maximal time length). An associated score is computed based on the event’s origin component and time of occurrence. The simulator component is implemented by using another instance of iCasa, controlled by the Spotlight. Upon request from one of the XAICs, the Spotlight Initializes this second instance with the current observations of the house, then runs it with a faster time step setting than the original iCasa instance. All constants in our implementation, mainly time constants, were given arbitrary values as to provide a system that is both easy to use and to run. For instance, the base iCasa simulation runs 60 times faster than real-time, while the simulator instance runs 10 times faster. Similarly, call period for XAICs are all set to 1 s, as to provide frequent updates without risking demanding too much of the machine and explanations occur at a constant 1 s per step, as to provide the user some time to apprehend each step.
732
´ Houz´e et al. E.
Fig. 6. The setup for the ventilation scenario in our demonstrator. The left-hand side shows the house view, including the CPS components. The right-hand side shows a list of the corresponding XAICs, with their known predicates. Some of the predicates have predefined necessities, accounting for user preferences.
Our demonstrator runs on machine with standard laptop hardware: a CPU equivalent to a Intel i7-8565U and RAM capacity of 8 GB allow to run 5 parallel complete instances on a server. This server implementation is available through a web interface at https://explainableai.fr/smarthome/1 (see Fig. 6). The rest of this section details how our proof-of-concept implementation plays the two motivating examples presented in Sect. 3. 5.1
The Window Blinds
As introduced in Sect. 3, this first example aims to answer a typical question arising in the authors’ office: why have the window blinds just shut-down? We simulate in our demonstrator a room with a basic temperature model which takes into account whether or not the sun shines onto the building. We connect our decentralized architecture implementation on top of this CPS, by pairing-up one XAIC to each CPS controller and adding a Spotlight. Figure 4 depicts the resulting organization. In this situation, the window blinds close to prevent the room from overheating. Despite the knowledge of this situation being scattered across different CPS components (Fig. 1), the system generates an explanation. The calls and trees depicted in Fig. 5 show the explanatory process operated by D-CAS in this configuration. These outputs provide the following insights into the decentralized reasoning process. In a first step (Fig. 5b), the Spotlight calls the XAIC of the window (‘XAIC-window-1’ – depicted via a blue node in the graph) to explain the conflict detected on the window blinds (‘openBlinds(window-1):False,20’, shown in red, 1
Demonstrator access identifiers: login: reader, password: xai.
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
733
to signify conflict). This conflict is represented via the ‘openBlinds’ predicate, applied to window-1, which has the value False with necessity 20. The fact that this predicate is false, while being desirable to the user, as indicated by its positive necessity, is perceived as a conflict. In the second step (Fig. 5c), the window XAIC aimed to solve the conflict by applying the doAction method of the D-CAS procedure: it found an action for reversing the conflictual state - i.e. opening the blinds - and play out the consequences in the simulator (green node in the graph). The result of the simulation shows that while the predicate ‘openBlinds (window-1)’ becomes true, another conflict is raised as the ‘hot (room)’ predicate, which has a necessity of −30, also turns true. In the third and final step (Fig. 5d), the XAIC of the Temperature Controller follows the D-CAS procedure itself. doAction is not applicable here, as the room temperature has no direct actuator (i.e. temperature cannot directly be switched from hot to cold). Hence, the XAIC temperature controller proceeds with the second step – abduction (yellow node). It finds a possible cause for the temperature being hot: the window blinds are open. Once again, applying the DCAS procedure to this new conflict implies executing the first action step, hence attempting to close blinds (making ‘openBlinds’ False again). As the absolute necessity for the ‘hot’ room predicate (30) is higher than the necessity of the ‘openBlinds’ predicate (20), XAIC temperature controller can proceed with this action and override the previous action of the XAIC window-1. The ‘openBlinds’ predicate becomes False, with necessity −30 (which, when negated, is equivalent to having the blinds closed with necessity 30). As no more causes and actions can be found at this point, the reasoning process ends, in this compromise state - giving priority to the user’s preferred objective (temperature). Importantly, all XAIC actions were performed in simulation, without further disturbing the home occupants. While not directly outputting a textual description, one can read the final tree graph output as: “The blinds were closed to prevent the room temperature from becoming too hot”. This output is similar to the expected explanation for this scenario, as presented in Sect. 3, and is relevant to the user in the sense that it enables her to make an informed decision regarding the window blinds. 5.2
The Ventilation Monitoring System
This example answers the user’s question on ‘why is it cold?’. This occurs after the user has installed a new ventilation system in the smart house, which was already equipped with a thermostat (Cf. Sect. 3). Figure 6 depicts the implementation setup for the ventilation scenario, as provided by our demonstrator’s graphical interface. On the left-hand side, it shows the iCASA home simulation, with deployed smart devices: thermometer, thermostat (or temperature controller), heater, window and ventilation controller. On the right-hand side, the demonstrator shows the available XAICs with their associated predicates. For instance, the XAI ventilation controller has a predicate ‘ventilated’ applied to the ‘house’, which is currently True, with necessity 20. This indicates that
734
´ Houz´e et al. E.
the house is currently ventilated, which is a desirable state for the user (positive necessity). The new Ventilation Controller can open and close the window to maintain air freshness in the room at a desired level. Since the outdoor temperature is cold (12 ◦ C), opening the window for ventilation triggers a conflict with the Temperature Controller’s objective. This conflict can be seen in Fig. 6, as the predicate cold (house) is true, while having a negative necessity of −30 (implying an undesirable state for the user). The Temperature Controller reports this conflict to the Spotlight. The user can launch an explanation request via the EXPLAIN button of the graphical interface. Currently, this triggers the explanation process for the conflict with the highest intensity –‘cold (house) : true, −30 in this case. Still, the user may also change predicate necessities manually via the interface, so as to request explanations about any other conflict. Consequently, the XAI system starts the D-CAS algorithm to inquire the selected conflict. The resulting tree graph, depicted in Fig. 7, shows the consequent steps that the D-CAS process went through sequentially.
Fig. 7. The output of the ventilation scenario. The black node indicates that the XAI component realized its mistake, and tries another approach. (An interactive version is available at https://explainableai.fr/trees/tree ventilation.html)
The reasoning of the XAI system goes as follows. Firstly, as in example 1, the XAIC of the Temperature Controller cannot apply the action step of DCAS, as the are no direct actuators on temperature. Hence, it proceeds with the abduction step and finds a possible cause in the window being open. It tests this hypothesis by closing the window in the XAI Simulator. This, in turn, creates a new conflict in the Ventilation Controller, on its ventilated predicate. Hence, the Spotlight activates the Ventilation Controller to investigate this new conflict. Again, as ventilation provides no actuators, the abduction step is executed next. The heater (in ‘on’ state) is proposed as a potential cause first, yet this hypothesis is invalidated via the XAI Simulator, which shows that switching the heater off does not solve the ventilation conflict. This makes XAI ventilation realize that this lead was wrong (depicted as a black node in the output graph)
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
735
and triggers the search for another cause. This outcome was a consequence of the naive abduction implementation within this demonstrator, yet it did show how the D-CAS algorithm can cope with wrong leads, and can provide feedback information to help abduction perform better on the next inquiry. The next potential cause identified via abduction concerns the window being closed. However, the necessity of the ventilation conflict, −10, is insufficient for allowing the re-examination of the open (window-1) predicate, which was given necessity −30 by the previous abduction of the Temperature Controller. Hence, as the XAIC ventilation-controller fails to find any other satisfying cause, it gives-up the ventilation conflict. The final rationale output can be literally translated into a textual explanation as follows: “It is cold because the window is open. However, closing the window would cause a conflict in the ventilation controller, regarding room ventilation. I do not find a convincing solution for this latter conflict, with its current intensity. Giving it a higher intensity might enable another solution.”. This output matches the answer we expected when this example was first discussed in Sect. 3. Even though the abduction module provided a wrong answer, the D-CAS algorithm was able to make for it and still provide the expected explanation.
6
Discussions and Future Works
Our proof-of-concept implementation showed that it is able to provide explanations relevant to the user for the discussed scenarios occurring in a smart home setup. It also runs on easily-available hardware, which allows to consider implementation on lighter, embedded devices. This validates our initial approach of decentralizing the explanatory rationale, but also raises several future challenge. The first major concern is the quality and existence of the provided explanation. While D-CAS is guaranteed to terminate, we have no guarantee regarding the quality of the generated explanation. For this proof-of-concept, we only evaluated generated explanations compared to subjective expectations for the implemented scenarios. Future developments will aim to use more objective measures, which itself has been determined a difficult task [27]. Explanations generated by D-CAS highly rely on how the local steps (i.e. abduction, conflict detection, and simulation) are implemented on the different XAICs. Future research will therefore be focused on further defining requirements for these steps, and providing possible implementation. For instance, abduction is at the core of our rationale, being the main conflict propagation mechanism. While abduction is a complex subject, we provided only a simple abduction reasoning in our implementation. The ventilation example showed the limit of this abduction method, which wrongly inferred a cause in the heater. For more realistic approaches, we consider refining this approach: events and their importance relative to the current conflict can be extracted from the time series from the components using a co-compression measure [13]. We can further expand this by allowing the broadcast of aggregated variables by XAICs no notify neighbors of the most important changes.
736
´ Houz´e et al. E.
Similarly, the evaluation of predicates and of their corresponding necessity can be implemented using more advanced methods. For instance, similarity could be inferred from a statistical analysis of previous measures and user’s choices, while predicates could be dynamically generated at runtime by observing. Using better local steps will lead to an improvement of the quality of the generated explanations, as well as expanding the range of situations handled by the system. A typical topic for the design of Intelligent System is their ability to scale. Thanks to the one-to-one pairing between XAI and CPS components, the actual size of the XAI system grows linearly with the size of the CPS. Moreover, the number of steps in our explanation process is not impacted by the number of XAI components that are not relevant to the conflict under investigation. This was the case in the examples we illustrated. The main scalability challenge thus concerns the abduction and simulation processes, which are left for further research, and are no worse in the decentralized process than in a central one [10,12]. Another typical difficulty for XAI solutions targeting CPS is dealing with the dynamicity of devices and controllers, which may join and leave the platform at runtime. Our architecture supports the dynamic discovery and integration of new CPS components, by adding and configuring generic XAI counterparts during runtime. Namely, to fit our architecture, XAICs must implement the three CAN steps: conflict detection, abduction and simulation. They must also be configurable with local knowledge (predicates), specific to each CPS component. Our current XAI demonstrator supports the dynamic extension of the XAI system following the addition of new CPS devices. This feature was successfully tested without requiring additional intervention. We address this topic in further details in another paper, currently under review. Importantly, our decentralized XAI solution relies on the assumption that CAN steps - conflict detection, abduction and simulation - can be performed locally. This was the case in the illustrated examples and most likely holds for numerous other scenarios. Still, for more complicated cases, our solution will need to be extended so as to enable XAI components to coordinate when performing these steps, as needed. As a further assumption, we considered that coherent system-wide explanations can be composed from partial explanations provided by local XAIC. While this is certainly the case for many situations, such as the ones we tested, further investigation is needed to identify potential cases where this assumption may not hold and to offer suitable XAI extensions accordingly.
7
Conclusion
We proposed the architecture of an decentralized Explanatory System which can be added to a generic CPS. This architecture features a minimalist coordinator which sequentially triggers local argumentation and observations, then collect the results in order to form a system-wide explanation. Playing out different smart home scenarios in our proof-of-concept implementation showed that the system is able to generate explanations as intended, which could prove useful for the user to improve its decision making.
A Decentralized Explanatory System for Intelligent Cyber-Physical Systems
737
While these first observations are promising, many further refinements could be done to the Explanatory System, mostly by refining the implementation of the different modules composing it. Since we designed our architecture with the concern of genericity in mind, future work will seek to develop such improved modules and integrate them into the architecture.
References 1. Explainable Artificial Intelligence. Broad Agency Anouncement DARPA-BAA-1653, DARPA, August 2016 2. Arrieta, A.B., et al.: Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020) 3. Augasta, M.G., Kathirvalavakumar, T.: Reverse engineering the neural networks for rule extraction in classification problems. Neural Process. Lett. 35(2), 131–150 (2012). https://doi.org/10.1007/s11063-011-9207-8 4. Bencomo, N., G¨ otz, S., Song, H.: Models@ run. time: a guided tour of the state of the art and research challenges. Softw. Syst. Model. 18(5), 3049–3082 (2019) 5. Dessalles, J.-L.: A cognitive approach to relevant argument generation. In: Baldoni, M., Baroglio, C., Bex, F. (eds.) Principles and Practice of Multi-Agent Systems, LNAI 9935, pp. 3–15. Springer, Cham (2016). https://doi.org/10.1007/978-3-31946218-9 1 6. Diaconescu, A., Di Felice, L.J., Mellodge, P.: Multi-scale feedbacks for large-scale coordination in self-systems. In: 2019 IEEE 13th International Conference on SelfAdaptive and Self-Organizing Systems (SASO), pp. 137–142. IEEE (2019) 7. Doshi-Velez, F., et al.: Accountability of AI under the law: the role of explanation. arXiv preprint arXiv:1711.01134 (2017) 8. Ghayvat, H., Mukhopadhyay, S., Shenjie, B., Chouhan, A., Chen, W.: Smart home based ambient assisted living: recognition of anomaly in the activity of daily living for an elderly living alone. In: 2018 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), pp. 1–5. IEEE (2018) 9. Goodman, B., Flaxman, S.: European union regulations on algorithmic decisionmaking and a “Right to Explanation”. AI Magazine 38(3), 50–57 (2017) 10. Horstemeyer, M.F.: Multiscale modeling: a review. In: Leszczynski, J., Shukla, M. (eds.) Practical Aspects of Computational Chemistry, pp. 87–135. Springer, Dordrecht (2009). https://doi.org/10.1007/978-90-481-2687-3 4 11. Hu, L., Xie, N., Kuang, Z., Zhao, K.: Review of cyber-physical system architecture. In: 2012 IEEE 15th International Symposium on Object/Component/ServiceOriented Real-Time Distributed Computing Workshops, pp. 25–30. IEEE (2012) 12. Ignatiev, A., Narodytska, N., Marques-Silva, J.: Abduction-based explanations for machine learning models. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 1511–1519 (2019) 13. Khryashchev, D.: Pattern Discovery in Time Series: A Survey, September 2018 14. Krause, J., Perer, A., Ng, K.: Interacting with predictions: visual inspection of black-box machine learning models. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI 2016, San Jose, California, USA, pp. 5686–5697. Association for Computing Machinery, New York (2016) 15. Lalanda, P., Hamon, C., Escoffier, C., Leveque, T.: iCasa, a development and simulation environment for pervasive home applications. In: 2014 IEEE 11th Consumer Communications and Networking Conference (CCNC), pp. 1142–1143. IEEE (2014)
738
´ Houz´e et al. E.
16. Legrenzi, P., Girotto, V., Johnson-Laird, P.N.: Focusing in reasoning and decision making. Cognition 49(1–2), 37–66 (1993) 17. Liu, Y., Peng, Y., Wang, B., Yao, S., Liu, Z.: Review on cyber-physical systems. IEEE/CAA J. Automatica Sinica 4(1), 27–40 (2017) 18. Lundberg, S.M., et al.: From local explanations to global understanding with explainable AI for trees. Nat. Mach. Intell. 2(1), 2522–5839 (2020) 19. Miller, T.: Explanation in artificial intelligence: insights from the social sciences. Artif. Intell. 267, 1–38 (2018) 20. La Diega, G.N., Walden, I.: Contracting for the ‘Internet of Things’: looking into the nest. In: Queen Mary School of Law Legal Studies Research Paper, vol. 219 (2016) 21. Papenmeier, A., Englebienne, G., Seifert, C.: How model accuracy and explanation fidelity influence user trust. In: 2019 IJCAI Workshop on Explainable Artificial Intelligence (2019) 22. Pearl, J., Mackenzie, D.: The Book of Why: The New Science of Cause and Effect. Basic Books (2018) 23. Polat, K., G¨ une¸s, S.: Classification of epileptiform EEG using a hybrid system based on decision tree classifier and fast Fourier transform. Appl. Math. Comput. 187(2), 1017–1026 (2007) 24. Tulio Ribeiro, M., Singh, S., Guestrin, C.: Why should i trust you?: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144. ACM (2016) 25. Shafique, K., Khawaja, B.A., Sabir, F., Qazi, S., Mustaqim, M.: Internet of things (IoT) for next-generation smart systems: a review of current challenges, future trends and prospects for emerging 5G-IoT scenarios. IEEE Access 8, 23022–23040 (2020) 26. Swartout, W.R.: XPLAIN: a system for creating and explaining expert consulting programs. Artif. Intell. 21(3), 285–325 (1983) 27. Tintarev, N., Masthoff, J.: Designing and evaluating explanations for recommender systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (eds.) Recommender Systems Handbook, pp. 479–510. Springer, Boston (2011). https://doi.org/10. 1007/978-0-387-85820-3
Construction Control Organization with Use of Computer and Information Technologies in Context of Sustainable Development Providing Zalina Ruslanovna Tuskaeva(B) and Zaurbek Valerievich Albegov FSBEI of HE “North Caucasian Institute of Mining and Metallurgy”, 44, Nikolaeva Street, Vladikavkaz 362021, Russian Federation
Abstract. Organizing construction control with computer and information technologies is narrowed down to creating two areas: quality control and design control. The quality control is an extensive area that includes the design control process and solves certain control tasks of the entire construction production process. The impact of uncertainties is more common in the construction industry, which necessitates an effective management control. The quality control process measures the quality characteristics of the work performed and compares them with the agreed project standards, then analyzes any differences between the results obtained and the desired results to determine if corrections are needed. Nowadays, digital technologies are increasingly used in construction, there is a gradual transition to the organization of construction control using computer and information technologies. The article substantiates the objectivity of digital technologies usage in construction, studies the features of computer and information technologies usage, that is an effective tool for improving the quality of construction production organization. Keywords: Information technologies · Organization of construction processes · BIM models · Artificial intelligence · Construction control
1 Introduction Construction production is a capacious and complex process, and this is the reason why it is necessary to regulate a wide range of issues and solve complex problems. New technologies make it possible to research more correctly and plan the elements of a construction project, to perform various construction tasks. Actually the construction industry is on the cusp of digitalization, as digital technologies take an increasing place in our lives. Modern computer and information technologies make it possible not only to design construction projects of any complexity with high accuracy and in a short time, but also to model alternative project options with the possibility of their visualization. Using them over time will become almost a mandatory requirement for any specialist in the field of construction. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 739–743, 2022. https://doi.org/10.1007/978-3-030-82193-7_49
740
Z. R. Tuskaeva and Z. V. Albegov
2 Materials and Methods The subject of the research is the organization of construction control and consideration of possible effective scenarios for the use of computer and information technologies using BIM technologies (Building Information Model or Modeling) at the stages of renovation. The empirical methods were used in this article which are necessary to form the main directions of research and determine the object, the subject of the topic of the article. The article is based on the analysis and systematization of knowledge about BIM-modeling and the implementation of a computer-aided design (CAD) system, building information modeling (BIM). The development of mobile computing means that computing and information technology is becoming a fundamental part of not only the design office, but also the construction industry.
3 Results Advanced building technologies implementation can improve quality, efficiency, safety, sustainability and value for money. However, it is worth noting that there is often a conflict between traditional industry practices and innovative new practices, and this is often blamed on the relatively low rate of technology transfer within the industry. Thus, the introduction of advanced building technologies and the use of computer, information technology requires appropriate design, commitment from the entire project team, suitable procurement strategies, good quality control, appropriate training and careful commissioning.
4 Discussion The processes of organizing construction control, design and planning are the most important parts of the life cycle of a construction project. Recently, there has been a trend where about 90% of megaprojects are significantly delayed, over budget or show other deviations from planning in the construction industry. All these factors lead to unjustified costs for companies involved in the construction process [1]. Computer and information technology and artificial intelligence solutions can help the construction process in many ways. Computer and information technology can improve control, construction scheduling, task management, while keeping all stakeholders informed. In addition, the aforementioned technologies contribute to increased productivity in construction work. Artificial intelligence tools can detect potential collisions, delays and changes during construction. Quality control in the construction industry is a mandatory process due to contract requirements and is used by contractors to ensure that their finished work meets the standards and specifications of the designers and other stakeholders involved in the work. For example, planners, engineers and architects regularly spend many hours designing a building. Control is one of the main project management tools in the context of construction design. The process of creating design options and organizing construction control
Construction Control Organization
741
of architectural statics and other parameters of a building (for example, compliance with building codes, whether the building meets all functional requirements, etc.) is especially laborious. So this is where generative design based on computer and information technology comes into play, the design research process [2]. As construction projects become more complex and the desired time frames for completion become more stringent, a comprehensive means of display for all aspects of the project is required [3]. Design and construction is an intensive information activity in which a large number of people participate, collaborating to produce complex, one-time developments. While information has historically been managed and communicated using paper systems and verbal instructions, supply chain integration, computer-aided design (CAD), building information modeling (BIM), and the advancement of mobile computing (MC) mean that information and communication technologies (ICT) are becoming a fundamental part of not only the design office, but also the construction site [4]. There is also a growing potential for building processes automation using ICTA (Information and Communication Technology and Automation) [5]. Solutions are applied that must meet the requirements of today and meet the criteria of efficiency at the project design stage. Traditional methods are obsolete. Thus, modern building information modeling (BIM) systems provide all project stakeholders - architects, investors, construction teams and others - the opportunity to take a close-up look at the workflow of a construction project to optimize planning and project coordination [6]. Climbing the computer-aided design (CAD) technological ladder, BIM uses tools such as robotic tool stations and 3D laser scanners (3D - three-dimensional) to give designers a deeper understanding of designs. The transition to 3D models presents the possibility of using AR (Augmented Reality) and VR (Virtual Reality) technologies at the design and control stages. These tools allow BIM models to be available in the field, which allows faster and more accurate work in real time in the workplace without delays in waiting for updated projects [7]. It is worth saying that the use of AR and VR technologies looks especially promising as they allow solving a considerable number of tasks, in particular: • • • •
minimization of design, construction and implementation time; improvement and enhancement of the quality of construction production; increase of productivity; simplification of the operation of engineering systems, etc.
Thus, a computer and information technology-based system that has access to a database of many building plans built earlier is capable of developing design alternatives based on knowledge gained from building plans in the database. Thus, the software examines all possible permutations of the solution, generating design alternatives that meet all previously specified requirements. The software then learns at each iteration what is the most appropriate design choice, and accordingly with each new project it becomes more perfect [8].
742
Z. R. Tuskaeva and Z. V. Albegov
Thus, the use of computer and information technology in the construction process significantly improves the process, as if you compare it with the traditional scenario, it allows you to take into account many more parameters and permutations. Besides these improvements, generative design can also enhance creativity. Firstly, it allows you to find ways of constructing shapes and curves that previously architects could only dream of. Secondly, generative design sometimes provides design solutions that designers and engineers would never consider. So generative design is a new design technology, which is directly based on the use of computer and information technologies, capable to independently generate three-dimensional models that meet specified conditions. Within the AEC Industry (architecture, engineering, construction), designers and engineers currently start to implement generative design. Recent studies and polls show that nowadays about a third of architects and engineers at least experiment with generative design. Due to the potential benefits, it can be expected that generative design will be further adopted by the AEC industry [9]. Delays in construction projects are often due to problems and delays in the planning stage, which are especially common in large projects. One of the reasons is the need to prevent collisions with utilities, which is important for the construction of urban infrastructure projects, in particular. Engineering teams have been comparing project documentation to utilities around a construction site for weeks until now. This requires numerous alterations and reevaluations, and takes a lot of time and money. As you can see, the use of computer and information technologies can greatly simplify this process. Artificial neural networks can perform collision detection within 24 h without the need for engineering teams. This is especially useful given the lack of design and construction engineers in many countries. The use of computer and information technology can not only detect potential collisions with utilities, but also find solutions and change plans according to them [10]. In addition to the possibilities of using information technology and artificial intelligence throughout the entire life cycle of a construction project, artificial intelligence can also become an incentive for the next big step in digitalization of manufacturers and / or distributors of building materials. This would increase efficiency throughout the value chain in functions such as purchasing, marketing and sales, manufacturing, logistics, customer service and after-sales services for building material manufacturers, vendors, and construction companies [11]. The use of information technology and artificial intelligence can help companies predict raw material prices and other input factors in the field of procurements. For this task, the price development charts will be analyzed, as well as other factors affecting the price of raw materials, after which the price will be predicted and the optimal time for purchase will be determined. So with the help of information technology and artificial intelligence, it will be possible to automate more and more additional stages.
Construction Control Organization
743
5 Conclusion To sum up, the use of information technology and artificial intelligence could be a leverage to start the next big leap towards better quality and greater efficiency in the construction industry. The use of information technology and artificial intelligence also improves the utilization of production capacity and reduces production downtime, and allows you to detect problems in production at an early stage. In addition, it is predicted that the use of information technology and artificial intelligence will provide a clearer forecast of construction costs. However, despite the fact that many companies have already identified the potential for using information technology and artificial intelligence in their activities, the implementation of specific plans is still facing certain problems. Thus, in light of the foregoing, it can be concluded that the introduction and application of the latest information technologies, as well as the automation of planning processes using modern information and computer modeling tools, allows directly ensuring the rational use of the means of production of a construction organization.
References 1. Baiburin, A., Kocharin, N.: Digital Technologies Application in Construction. A. Miller Library, Chelyabinsk (2020) 2. Jones, S.: National BIM standard – United States. National Institute of building sciences, BuildingSMART alliance (Version 2), 676 (2012) 3. Baiburin, A., Kocharin, N., Baibulin, D., Vaisman, S.: Reliability of Organizational and Technological Systems. SUSU Publishing Center, Chelyabinsk (2018) 4. The AEC (UK) committee: AEC (CAN) CanBIM designer’s committee: Implementing Canadian BIM Standards for the Architectural, Engineering and Construction industry based on international collaboration. AEC (CAN) BIM protocol. (Version 1), 54 (2012) 5. Karppinen, A., Lennox, M., Lehto, M.: Use of models in construction. Common BIM Requir. (13), 21 (2012) 6. Building and Construction Productivity Partnership: A Guide to Enabling BIM on Build Projects. New Zealand BIM Handbook, New Zealand (2014) 7. Talapov, V.: BIM Technology: The Essence and Features of the Buildings Information Modeling Implementation. DMK-Press, Moscow (2015) 8. Sharipov, R., Baiburin, A.: Russia Needs TIPS: Problems of Technical Creativity. Issue 2. New Time, Cheboksary (2009) 9. Lenkovskaya, R.: On the issue of certain problems arising in the course of construction activities in the Russian Federation. Probl. Econ. Legal Pract. 15(1), 121–124 (2019) 10. Ashby, W.: An Introduction to Cybernetics. Chapman & Hall ltd, London (1956) 11. Schwab, K.: The Fourth Industrial Revolution. Eksmo, Moscow (2016)
Computational Rational Engineering and Development: Synergies and Opportunities Ramses Sala(B) Department of Mechanical and Process Engineering, Technische Universität Kaiserslautern, 67663 Kaiserslautern, Germany [email protected]
Abstract. Research and development in computer technology and computational methods have resulted in a wide variety of valuable tools for Computer-Aided Engineering (CAE) and Industrial Engineering. However, despite the exponential increase in computational capabilities and Artificial Intelligence (AI) methods, many of the visionary perspectives on cybernetic automation of design, engineering, and development have not been successfully pursued or realized yet. While contemporary research trends and movements such as Industry 4.0 primarily target progress by connected automation in manufacturing and production, the objective of this paper is to survey progress and formulate perspectives targeted on the automation and autonomization of engineering development processes. Based on an interdisciplinary mini-review, this work identifies open challenges, synergies, and research opportunities towards the realization of resource-efficient cooperative engineering and development systems. In order to go beyond conventional human-centered, tool-based CAE approaches and realize Computational Intelligence Driven Engineering and Development processes, it is suggested to extend the framework of Computational Rationality to challenges in design, engineering and development. Keywords: Computational Intelligence · Artificial Intelligence · Computer-Aided Engineering · Computational Rationality · CAE · CIDD · CRD · CRE
1 Introduction and Motivation Advances in computer technology and computational science have provided crucial tools to aid the engineering and realization of a wide variety of mechanical structures and systems [1–3]. Examples of influential tools are: the geometrical modeling by means of Computer-Aided Design (CAD) [4, 5], the simulation and analysis of virtual prototypes using Computer-Aided Engineering (CAE) tools [6], and automated machining using Computer-Aided Manufacturing (CAM) [1, 7]. The increase in computational engineering capabilities, however also led to a progressive increase in the complexity of processes and products, which poses a massive challenge for modern industrial engineering [8]. After the 1990s, a paradigm shift in engineering design was expected due to the developments in the fields of Computational Intelligence (CI), Soft Computing (SC), Machine © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 744–763, 2022. https://doi.org/10.1007/978-3-030-82193-7_50
Computational Rational Engineering and Development
745
Learning (ML), and AI [9]. But, despite that the capabilities of computational tools for specific tasks in the engineering process have improved exponentially, the structure, organization, and paradigms of the overall design, engineering and development processes have been adapted only modestly [10]. Many of the visions and expectations on automated engineering and development systems formulated in the early literature [11, 12] have not yet been realized [13]. To address the challenges of increasing complexity in product design, engineering and development [8], new paradigms and research frameworks might be needed [10]. How to effectively realize AI technologies and intelligent systems that can enable improvement and automatization of industrial design, engineering, and development processes? To analyze and discuss the many aspects of this quest, seminal classical works as well as recent results from the fields of Systems Engineering, Computer Science, Computational Mechanics, Uncertainty Quantification, and Operations Research, Cognitive and neuroscience, are reviewed with a focus on intersections related to the understanding and automation of problem-solving and decision making in design, engineering, and development. Based on the presented interdisciplinary mini-review progress, open challenges, synergies, and perspectives on directions for further research are identified, formulated, and discussed in the following sections.
2 Recent and Past Perspectives on Computer Systems for Automation of Engineering and Development Relatively soon after computing machines or early computers became available for nonmilitary purposes, they were applied in the development process of various engineering products [14]. Besides the obvious applications of computers for calculations, more revolutionary ideas, concepts, and theories for computer-aided design systems were established in the 1960s [4]. In particular, the development of graphical human-machine communication interfaces enabled new possibilities for Computer-Aided Design (CAD). Around the same time, also new computational methods to model, simulate and optimize the response of complex systems and structures were developed [15, 16]. Other early developments relevant to engineering automation were general problem-solving programs [17] and expert systems [18]. These and other seminal works initiated the research and development, which eventually resulted in the wide variety of ComputerAided technologies (CAX) [19] that provide today’s state-of-the-art tools for engineering development [5, 20–24]. In 1960, early visionary perspectives related to automation of the engineering process were presented in [11]. The paper described expected developments towards intuitive man-machine cooperation and interaction technologies that would enable computers to facilitate the problem formulation and decision-making processes for complex engineering endeavors. The described targets aimed to go beyond mechanical extensions and mere automation of prescribed tasks, resulting in a man-computer symbiosis, that would enable thinking capacities as no human brain has ever thought. “One of the main aims of man-computer symbiosis is to bring the computing machine effectively in to the formulative parts of technical problems” [11].
746
R. Sala
Other visionary concepts of Intelligent Computer-Aided Engineering (ICAE) were described in [12]. The conceptual ideas were presented as a roadmap towards longterm targets for the development of computer programs or partners that could capture engineering knowledge to assist engineers in the engineering design, realization maintenance, and operation of engineering products. Some of the identified concepts required to achieve ICAE were: broad domain models, layered domain models, routine design, functional descriptions, qualitative simulation, and communication [12]. Furthermore, the importance of developing methods for the hierarchical decomposition of physical problems and qualitative physics models to approximate the responses of the systems and subsystems were highlighted. Also, the necessity for long-term research commitments to go beyond incremental progress was emphasized. In a perspective paper [25] identifying general open challenges in the field of AI and CI, also important aspects and challenges of importance to the automation of engineering and design were identified. Human intelligence, and the type of intelligence measured by the Turing test, is very multidimensional. These dimensions of intelligence are often considered separately, and many systems can only be considered “partially intelligent”. An important observation was that no adequate test or performance measures to quantify utility and integration in AI systems of partially intelligent agents are available [25]. Furthermore, it was highlighted that: “For an artifact, a computational intelligence, to be able to behave with high levels of performance on complex intellectual tasks, perhaps surpassing human level, it must have extensive knowledge of the domain” [25]. The perspective on intelligent machines in the context of engineering design [10] identified that most modern computer-aided design tools are still essentially extensions of engineering and practices going back more than two centuries. It was also was highlighted that: “Today’s innovations in robotics, advanced materials and additive manufacturing require newer and more creative design processes, enabling an entirely new kind of Arsenale—an Arsenale in which computers work as our creative partners” [10]. From that perspective, computer-augmented design was identified as a next step beyond merely computer-aided design. The article in [26] on cognitive AI systems provided a discussion on important bottlenecks and topics for further research targeting human-level functionality AI [26]. Many AI or CI systems that intend to aid humans cognitively can be categorized as: (i) Cognitive prosthesis or (ii) Cognitive orthotics. The aim of cognitive prosthetic systems is to operate independently before human supervision. An example of a prosthetic system is for example, machine translation such as Google translate. Although it generally needs human modifications, it is considered a cognitive prosthetic because it operates fully independently before human interaction is needed. Cognitive orthotic systems are characterized by the intent to enhance human capabilities and require human-machine interactions. An important build-in quality ceiling of such systems is the communication with humans [26]. The work pointed out that “In order to burst through the quality ceiling and move toward comprehensive applications that are more like intelligent agents than mechanistic automata, the field must readdress newly available theories and methods, the development of systems featuring human-inspired computational models” (p7. [26]).
Computational Rational Engineering and Development
747
Relatively recent strategic research initiatives and trends such as “Industry 4.0” [27] and “Made in China 2025” have a strong emphasis on manufacturing and focus less on the engineering design and product development processes [28]. Although these new perspectives and projections on the future of industrial automation lean towards cyberphysical-systems and advanced human-machine interactions, those visionary concepts however still paint a rather human-centered picture in the execution (see also [29]). Why have intelligent systems as envisioned in [11, 12] with capabilities beyond the current CAX tools not been realized yet [10, 30], despite all progress and advances in computation, simulation, ML, CI, and AI? Based on the articles discussed in this section, several trends, open issues, and obstacles towards automation in engineering and development can be identified and summarized: 1. There seems to have been a trend to focus on human-centered engineering development paradigms and automation approaches such as tool-based systems, cognitive orthotics, and man-computer symbiosis [26, 29]. 2. Human-machine communication is still a bottleneck in current intelligent systems for automation in engineering and design [26, 29, 31]. 3. Improved domain descriptions and models of the various agent tasks, environments, and resources in engineering and development processes are required [25, 26, 31]. 4. The progress and success of AI for narrow tasks seems to have diverted the attention from long-term high-level goals on the automation of complex design and engineering processes towards the many lower hanging fruits in the field of AI and automation [10, 12]. To break the quality ceiling, research that targets intelligent systems with higher-level capabilities is necessary [26]. In the following sections, general and domain-specific aspects central to the research and development of intelligent systems for design, engineering, and development are reviewed and discussed in order to highlight promising directions and areas for future research on automation and decision-making in engineering development.
3 Computational Rationality in Engineering Development 3.1 Domain Characteristics: Problem-Solving and Decision-Making in the Context of Industrial Design, Engineering, and Development Industrial engineering and development are often associated with the resulting technological products and impact on our environment. The resulting technological products are, however, only the tip of the iceberg of the engineering development process. Industrial engineering not only involves the design and engineering of a technological product, but it also involves the planning and development of the processes and facilities involved in material extraction, manufacturing, control, maintenance, and recycling during the product life-cycle stages. Furthermore, not only the final product and the involved production processes, but also the product development process itself (the organization and structuring of all the involved activities), needs to be established and realized in a way that satisfies requirements on performance, quality, cost, sustainability, and other operational aspects. The following sections review aspects related to intelligence and
748
R. Sala
rationality in the operational and executive processes of engineering development. For interesting aspects beyond rationality related to sustainability, and ethics of the product and process objectives and requirements defined and set by humans is referred to [32–34]. Process and Problem Complexity. The development of engineering products and systems can involve thousands of people over several years. Increasing complexity is one of the biggest challenges in engineering design and modern product development [8]. Many products are becoming increasingly complex due to the integration and blending of various state-of-the-art technologies, such as composite materials, smart materials [35], and distributed control systems. Large-scale concurrent engineering on complex projects involves many tasks, sub-problems, various types of uncertainties [36, 37], decisionmaking based on incomplete information, and a dense web of information flows and interdependencies [38]. The engineering and product realization process of complex products has itself become a complex system, one that could be described as “organized complexity” [39]. Hierarchical Bounded Rationality. From an industrial engineering perspective, the development of a product generally involves a composition of many interdependent decisions and tasks in a complex hierarchical structure, which all need to be solved using a common resource budget that needs to be allocated over all activities to achieve a common objective. The core challenge in industrial engineering is to organize and address the many sub-tasks in order to realize the overall objectives using only limited information, knowledge, and other resources. The industrial engineering context thus poses a scenario of Bounded Rationality (BR) [40] at the level of individual tasks as well as at the level of the organization [41]. Although the environment and policies of agents dealing with technical decision-problems and organizational problems might be very different, the general concepts from the framework of Computational Rationality (CR) [42] could be used to target further progress in understanding and automation of engineering activities. Although there has been research on hierarchical decision-making [43–45], synergies with concepts from BR and CR for hierarchical decision problems in engineering development seem rather underexplored. Uncertainties in Expected Utility and Resource Use. Challenging aspects in design, engineering, and development processes are the errors and uncertainties involved in the estimation of the system response behavior before its realization [36]. Although by means of virtual-prototyping and simulations, the response of physical systems can be approximated, reliable estimations for the simulation accuracy and effort are still difficult to obtain, especially for nonlinear systems. While there has been substantial progress in the areas of error estimation [46], uncertainty quantification [47–49] Global Sensitivity Analysis [50], and related areas [51] in academic settings, the application of these methods in industrial settings are still relatively rare. Therefore, further work targeting deeper integration of uncertainty quantification in industrial engineering and development processes would be beneficial. Non-rational Design Criteria and Problem Specification. Although the postulation of the design objectives, requirements, and targets are often considered as non-rational
Computational Rational Engineering and Development
749
[52], many evaluation criteria used in engineering are heuristics in disguise. The highlevel, truly non-rational designer preferences often require a translation or reformulation of lower-level technical goals, requirements, and objectives. The activities related to formulating and specifying technical objectives and requirements at various levels of detail are related to the value alignment problem [53] and reward specification in reinforcement learning (RL). Hierarchical (heuristic) sub-problem approximations and approximate rewards or utilities could play a role in problem-solving [54, 55]. Further development of approaches that combine data-mining and simulation workflows (e.g. [56, 57]) could also improve the formulation and specification of partial approximate design evaluation criteria, utility, and reward functions. Besides data and information mining to extract useful design specifications, also effective languages are required. Although several modeling methods and languages have been presented, they are still deficiencies in generality for requirement specification [58]. The work in [59] indicated that it is even not clear how to evaluate and compare the different modeling methods and languages. Relatively recently also reward modeling techniques for RL have been developed, which can efficiently learn from (interactively communicated) human preferences for those decision problems where the evaluation criteria are difficult to specify in formal languages [60]. Since in engineering and development, not only the physical implementation of the systems but also the specification of goals, requirements, and targets can be complex, further work in these directions is required.
3.2 Interdisciplinary Opportunities and Synergies Computational Rationality. “A rational agent is one that acts so as to achieve the best outcome or, when there is uncertainty, the best expected outcome” [53]. Rational agents thus seem the ideal candidates for many activities, including decision making and problem-solving in engineering and development. Because in an industrial engineering development setting, knowledge time and other resources are limited, while there are many tasks and decision problems, agents must decide and act under conditions of Computational Rationality (CR). In a nutshell: the challenge is not only what to decide, but also how to decide, given the available resources. The meta-level decisions about resource allocation and method or policy selection in agent-based bounded rational decision making can be based on metareasoning using metalevel models or on heuristic decision policies [42]. The framework of CR [42, 61] aims to unify the fields of AI, cognitive science, and neuroscience in order to exploit synergies between the fields. The goal of CR is: “Identifying decisions with the highest expected utility, while taking into consideration the cost of computation in complex real-world problems in which most relevant calculations can only be approximated” [42]. This is also relevant in the context of understanding, formalizing, improving, and eventually automating engineering development processes. The perspective of understanding intelligence as computational rationality is in principle domain agnostic and open to consider human, natural, as well as artificial systems and activities.
750
R. Sala
Neuroscience and Cognition. Engineering and development involve decision-making and problem solving under limited knowledge, time, and other resources. In the framework of CR [42], two directions to address such problems are model-based metareasoning and the application of heuristic methods. Limited resources can make detailed metareasoning or formal methods unfeasible and can justify the use of heuristics for artificial as well as human agents [42, 62]. In [63, 64] systematic errors and biases in common human heuristics and interesting insights on fast heuristics and slow reasoning were identified. The work in [65, 66] highlighted the importance in human agents of matching patterns in the environment with decision heuristics. In [67] various Bayesian-based approaches to build intelligent systems using reverse-engineering of human cognitive functionalities and development were reviewed. This work emphasized the importance of language and hierarchical flexible structured data representations for cognitive capabilities such as abstraction and generalization [67]. In [68], concepts of BR are used and combined with set-based design, meta-modelling and multi-objective optimization to improve decentralized design problems. Investigations in [69] on a human grandmaster chess player indicated the importance of recognition compared to look-ahead search based on investigations on human experts. The theory of Ecological Rationality formalizes that the rationality of a decision policy depends on the circumstances [70]. This conclusion matches in spirit with the results of the No Free Lunch (NFL) theorems [71, 72]. Improved understanding of decisions and meta-decisions in human cognitive processes and other aspects of psychology could contribute to insight and development of computational methods in AI, engineering and science [62, 73–75], and maybe also vice versa. Design and Engineering Science. Design and Engineering can benefit from strategic, systematic, and scientific approaches [76, 77]. In order to use computers and computational methods to solve design and engineering problems, it could help to establish formal (mathematical) descriptions of the problems or tasks of interest [78, 79]. Aspects related to creative design and problem-solving in the development process can be transformed in constraint satisfaction, optimization and search problems using Formal Design Theory (FDT) [80]. The use and extension of FDT and other formal design approaches (see also [81]) could support the frontiers of research on the automation of engineering design. Surveys on various theories and process models of engineering design have concluded: that presently no single model can address all issues and that different models may be useful for different situations [38, 82]. There are still many aspects of design and engineering which have not yet been rigorously formalized and which thus still pose open challenges and opportunities. Education and further research on general formal design theories and engineering science seem therefore of crucial importance for automation of engineering design and development. Computational Physics and Uncertainty Quantification. To make predictions and inferences on systems and processes, numerical models and simulations can be used. Computational Physics and Mechanics based models are commonly used in robotics, control and computational engineering of physical systems. In [83], a differentiable physics simulation was presented, which enabled the use of gradient-based methods in the control and optimization of physical/mechanical systems. A new approach to use
Computational Rational Engineering and Development
751
physics simulations combined with multi-level path planning in the context of robotics was described in [84]. Conversely, methods to learn and infer physical principles from data have been presented in [85, 86]. The accuracy of physical models and simulations in general is limited due to errors and uncertainties and requires tradeoffs w.r.t accuracy and computational effort. Important approaches to address and investigate these accuracy limitations are: Validation and Verification (VV) [87], and Uncertainty Quantification (UQ) [48, 88], and Global Sensitivity Analysis [50] approaches. Optimization and Control. Many sub-tasks and design problems in engineering can be formulated as optimization and control problems. In combination with physics engines or numerical models and simulations, the approximate representation of the properties and behavior of physical systems or processes can be optimized with respect to specified design objectives and constraints. The simulations and responses involved are, however, often relatively complex and computationally non-trivial, such that the selection and tuning of effective optimization algorithms is difficult. Optimization and automated design approaches and workflows have been developed and investigated for applications as: topology optimization and generative design of structures [89, 90], circuit design [91, 92], Elevator Systems [93] bioelectrochemical systems [94], automotive control actuators [95] and electric vehicle transmissions [96]. These examples demonstrate the use and potential of automated Modeling Simulation and Optimization (MSO) workflows for specific applications of industrial relevance. General frameworks for MSO-workflows that include automated agents for decisions regarding modeling accuracy, model parameterization, algorithm selection, and computational resources, are however still lacking, and seem a promising direction for further research. In the context of massive complex software systems, the use of Bayesian Optimization was proposed relatively recently in [97]. In [97] Bayesian Optimization was recognized as a powerful tool to address the many distributed design choices, and a key ingredient to take humans out of the loop in the development of complex software systems. The Bayesian perspective also highlights the importance of model selection, the consideration of uncertainty, and learning or model updating. When design problems are formulated as true Black-Box optimization or search problems over finite search spaces, the NFL theorems [71, 72] apply. These theorems imply that no universally superior algorithms exist when performance is averaged over all possible problems. Thus, the remaining quest is to match specific problem classes of task-environment-resource combinations with specific efficient policies or algorithms. This, in turn, highlights the importance of: a) problem characterization and categorization (or fitness landscape analysis) [98–101]; b) systematic and generalizable optimization algorithm benchmarking [102–104]; c) algorithm performance analysis and selection [99, 105–107]. While there has been increasing interest towards algorithm selection for black-box optimization problems in a general context [99, 108] as well as for simulationbased engineering applications [100, 109], there are still many open challenges of scientific and practical relevance related to optimization algorithm benchmarking, selection and analysis [104, 106]. The extended process-perspective of optimization to the metalevel also highlights the need for optimization algorithm performance measures that go beyond fixed-budged and fixed-target performance evaluation criteria, to also include
752
R. Sala
measures that can be used in dynamic hierarchical settings. Such settings involve decisions regarding method selection and resource allocation, which require more complex performance measures involving estimations of the expected utility per resource use, also considering the uncertainties. Operations Research and Systems Engineering. Although not always directly targeted at computer-based automation, interesting methods and strategies to manage the design of complex systems have been developed in the fields of Engineering Management, Operations Research, and Systems Engineering, which could also benefit the automation of engineering and development processes [110–112]. One research direction towards a general approach to manage complexity in systems engineering is ModelBased Systems Engineering [113], there are, however, still many open challenges, and further work is needed to close the gaps between theory and implementation [114]. One of these challenges in to establish models that do not only estimate the expected results but also quantify the uncertainties. One interesting contribution in this research direction is the concept of Experimentable Digital Twins (EDT) [115]. The idea is to establish communication between virtual twin models, which represent the data, functions, and capabilities of real objects or processes, in networks of communicating EDTs on a system level, in order to realize complex control systems. In [116] the potential applicability of RL and ML in the domain of Systems Engineering was discussed, and it was concluded that further work in this promising direction was recommended. AI and CI cover many areas of high relevance to intelligent systems in general [53, 117]. The following sub-sections highlight recent progress from various sub-fields of specific importance for automation in design and engineering processes. Automated Software Development. Interesting automated software testing and design approaches have been presented that could contribute to the automation of engineering and development of physical products [118–121]. Agent and Multi-Agent Models, Systems and Control. Complex processes can be modeled and controlled by means of agent-based and multi-agent models and systems [122–125]. Multi-agent based models and systems can be combined with systematic management and systems engineering approaches [123, 126]. Knowledge-Based Systems for applications in Engineering, often referred to as Knowledge-Based Engineering (KBE), is another approach to capture, store and reuse information that could be used in engineering and development [127, 128]. A review of developments and open challenges for KBE systems is presented in [22]. Robotics and Control. In the research field of evolutionary robotics, several methods have been presented that enable the design morphology and control of interesting virtual creatures/robots [129, 130]. In [131] also aspects of the development and production have been considered. Machine Learning. Deep artificial neural network-based approaches have been developed and used for generative design and analysis of materials, biomechanical products
Computational Rational Engineering and Development
753
[132, 133]. In [89, 90] deep neural networks have been combined with topology in the design and optimization of mechanical structures. Reinforcement Learning (RL) approaches have been developed to achieve impressive performance in many applications such as games, control, and simulation-based optimization [60, 134, 135]. Recently also RL methods have been applied in the field of design and engineering, such as drug and circuit design [136, 137]. A review of advances in reinforcement learning is provided in [135, 138, 139]. Evolutionary Computing and Nature-Inspired Algorithms have been used in the design and optimization of software and mechanical systems [90, 91, 119, 140]. Fuzzy Logic approaches enable the consideration of uncertainties in decision-making and have been used in safety engineering and inference systems [51, 141].
4 Discussion and Perspectives 4.1 Mind the Gap: Intelligent Systems for Design, Engineering, and Development Contemporary design, engineering, and development paradigms are still rather humancentered in the execution stage. In a nutshell: engineering development processes are generally executed by a collective network of human agents that drive and control a wide variety of computational tools and automated workflows. In conventional toolbased engineering development paradigms, the involved “narrow” AI agents are rather passive, and require well-defined problems as well as pre- and postprocessing by human agents. Many of the essential activities in design, engineering, and development processes involve aspects of intelligence (e.g., flexibility, adaptivity, problem decomposition, learning, planning, and resource allocation) that are currently still performed and provided to the process by the human agents in the loop using: intuition, experience, reasoning, heuristics, and creativity. Improved understanding and automation of these and similar qualities and capabilities require further interdisciplinary research and progress. Towards Computational Rational Processes: Interdisciplinary Paradigms The framework of computational rationality [42, 61] aims to unify the fields of AI, cognitive science, and neuroscience with the goal to exploit synergies in improving the understanding of decision-making and problem solving considering conditions with limited resources for reasoning. In the context of design, engineering, and development processes, problem-solving and decision-making not only involves CR but also intersects with fields such as Design, Engineering Science, Operations Research, Systems Engineering, AI, Computational Physics, Uncertainty Quantification, Optimization, and Control. A joint framework of Computational Rational design (CRd) Engineering (CRE) and Development (CRD) could bring insightful and rewarding synergies in research and development among all of the involved fields. Besides the economic and technological incentives, there is an abundance of possibilities to collect data and feedback from trained and experienced human agents in the respective fields. The central research goal
754
R. Sala
of CRX is to understand and improve how the decisions, policies, agents, organizational structures with the highest expected utility of the overall process X, given the available resources can be identified and realized. The objectives can go beyond increased understanding and automation of individual human-level capabilities and include aspects related to collective human intelligence and AI-human hybrid intelligence. Human intelligence has been described using agent-based models as a “Society of Mind” in [125]. Improved understanding of complex (engineering) processes involving collective intelligence over cooperative agents requires an inter- or even a transdisciplinary approach and a common vocabulary [142]. Technical Goals and Perspectives: Computational Intelligent Driven Development Computationally intelligent systems with higher-level competencies could increase the overall capability and efficiency of design, engineering, and development processes. Besides the current trends in the development of a diversity of AI-agents for specific narrow tasks, it could be rewarding to set goals towards the realization of composite intelligent systems that have the capabilities to perform higher-level tasks and which could eventually drive complex design, engineering and development processes. Computational Intelligence-Driven Engineering (CIDE) and Development (CIDD) could serve as technical goals towards the automation of engineering and development beyond the current state-of-the-art tool-based “computer-aided” approaches. With CIDE as an initial mid-to-long-term milestone with a focus on automated and autonomous engineering design. CIDD could be a next long-term milestone, additionally including further consideration of a wider range of realization aspects such as the engineering of the manufacturing process and extended product life-cycle impact factors. The development of intelligent systems that are able to “drive” engineering and development processes requires more than just connecting the many narrow-capability agents together in a workflow. Although much can be learned from automated manufacturing systems developed using the industry 4.0 paradigm, the processes and tasks in design engineering and development are more complicated and complex and require the collective of agents to work as an integrated hierarchical system to handle demanding interactive higher-level cognitive tasks. Agents or systems that are more flexible with increasing capabilities in areas such as: problem recognition, problem decomposition or disentanglement, adaptivity, planning and resource allocation, method selection, cooperation, self-reflectivity, and learning are therefore needed. Scientific Goals and Perspectives: Computational Rational Development Computational Systems and agents that are can address higher-level and complex engineering development tasks are still an open challenge in science, research, and technology. History indicates that systems are generally realized with increasing levels of complexity when considered functionally and chronologically. Therefore, targets and progress in the direction of systems and agents for gradually increasing levels of complexity and generality are not only of technological importance but could also contribute towards Artificial General Intelligence (AGI). The scientific challenge of CRD goes beyond the technical goal of establishing programmed or trained learning systems that can deal with specific types of complex engineering development tasks, but the overall aim is to establish the frameworks, theories, and methods that enable the realization of
Computational Rational Engineering and Development
755
intelligent systems that are capable of higher-level tasks of increasing complexity that feature aspects of development. Understanding and realizing intelligent systems that are capable of causal inference (concluding how things are and how they will be) [143] is an important step towards systems that can grasp features of development (realizing how desirable things that have never been could be achieved). Besides the challenges of realizing such systems also aspects of safety and ethics require research consideration [144, 145]. Both inductive research with reasoning and generalization from the specific, as well as deductive research with reasoning from general theories, can be valuable to understand and create the next generation of intelligent systems. It could therefore be beneficial to establish transdisciplinary research frameworks and programs with the goal to increase the understanding of computationally rational decision-making and problemsolving for complex engineering development tasks and processes by intelligent systems with bounded resources.
4.2 Open Challenges and Prospective Research Directions Improved understanding and the realization of intelligent systems for design, engineering, and development involve a variety of open multidisciplinary challenges at different process levels: 1. Domain Knowledge, Problem Specification, and Description: Improved methods to formalize and describe the various decision-tasks, activities, environments, and resources that typically occur in engineering development are necessary. 2. Task and Problem Decomposition and Recognition: Research on methods for the characterization, decomposition, categorization, and recognition of tasks and decision problems in sub-tasks/problems. 3. Policy Modeling and Evaluation: Development of methods for the estimation and description of the expected performance, resource requirements, and costs for the different available solution procedures and strategies for the overall and sub-problems, under consideration of the available resources and the involved uncertainties. 4. Policy Selection, Planning, and Resource Allocation: The endowment of agents or agent-based systems with capabilities for meta-level reasoning regarding policy selection, planning, and resource allocation based on systematic evaluation of the sub-problems. 5. Adaptive Reflective Agents: Improvement of methods to enable agents to reflect their true performance after execution w.r.t. their estimated performance in order to update and learn and performance estimates for policy selection. 6. Organizing the Society of Mind: Development of improved methods to link, combine and organize “narrow” AI Systems together, in ways such that the efficiency or capabilities of the integrated system exceed those of the separate systems. 7. Information Representation and Communication: Investigation and development of effective representations and/or languages to store and communicate: problems, solution procedures, and results in ways that enable recognition, generalization, and adaptation for future tasks and problems.
756
R. Sala
8. Language, Interaction, and Communication: Development of ways to improve human-machine and machine-human interactions. Not only taking into account communication interfaces but also the information, structure, language and context which is being communicated. 9. Education: Cross-disciplinary education and training in AI, design, engineering, and related fields to empower the capabilities of human agents to develop and improve automation systems.
5 Concluding Remarks In order to make progress towards intelligent systems which are able to efficiently realize high-level design, engineering, and development processes, it is necessary to increase the understanding of computational rationality in the context of the complex hierarchically structured task and decision environments occurring in these application domains. To effectively increase the required understanding of the many involved factors, the knowledge and research from various disciplines could be exploited and explored in the scope of transdisciplinary research frameworks such as CRD. This paper highlighted important contributions from various research disciplines, focusing on their intersections related to problem-solving and decision-making processes in design, engineering, and development. Based on the presented mini-review, specific open challenges have been identified, and a road map of future research directions through an interdisciplinary research framework is presented. The overall objective of this contribution was, however, not to restrict future research to specific directions but to motivate and stir up an interdisciplinary discussion and movement to set challenging targets and initiate innovative research. The presented perspectives could extend Herbert Simon’s “science of design” [79], towards a science of systems that purposefully design, engineer, and develop.
References 1. Matta, A.K., Raju, D.R., Suman, K.N.S.: The integration of CAD/CAM and rapid prototyping in product development: a review. Mat. Today: Proc. 2, 3438–3445 (2015) 2. Harish, V., Kumar, A.: A review on modeling and simulation of building energy systems. Renew. Sustain. Energy Rev. 56, 1272–1292 (2016) 3. O’Brien, J.M., Young, T.M., O’Mahoney, D.C., Griffin, P.C.: Horizontal axis wind turbine research: a review of commercial CFD, FE codes and experimental practices. Prog. Aerosp. Sci. 92, 1–24 (2017) 4. Coons, S.A.: An outline of the requirements for a computer-aided design system. In: Proceedings of the Spring Joint Computer Conference, May 21–23, pp. 299–304 (1963) 5. Hirz, M., Rossbacher, P., Gulanová, J.: Future trends in CAD–from the perspective of automotive industry. Comput. Aided Des. Appl. 14, 734–741 (2017) 6. Park, H.-S., Dang, X.-P.: Structural optimization based on CAD–CAE integration and metamodeling techniques. Comput. Aided Des. 42, 889–902 (2010) 7. Crowley, T.H.: The computer as an aid to the design and manufacture of systems. Proc. IEEE 51, 513 (1963) 8. ElMaraghy, W., ElMaraghy, H., Tomiyama, T., Monostori, L.: Complexity in engineering design and manufacturing. CIRP Ann. 61, 793–814 (2012)
Computational Rational Engineering and Development
757
9. Saridakis, K.M., Dentsoras, A.J.: Integration of computational intelligence applications in engineering design. In: Darzentas, J., Vouros, G.A., Vosinakis, S., Arnellos, A. (eds.) SETN 2008. LNCS (LNAI), vol. 5138, pp. 276–287. Springer, Heidelberg (2008). https://doi.org/ 10.1007/978-3-540-87881-0_25 10. Regli, W.C.: Design and intelligent machines. AI Mag. 38, 63–65 (2017) 11. Licklider, J.C.: Man-computer symbiosis. In: IRE Transactions on Human Factors in Electronics, pp. 4–11 (1960) 12. Forbus, K.D.: Intelligent computer-aided engineering. AI Mag 9, 23 (1988) 13. Hehenberger, P., et al.: Design, modelling, simulation and integration of cyber physical systems: methods and applications. Comput. Ind. 82, 273–289 (2016) 14. Strang, C.R.: Computing machines in aircraft engineering. In: 1951 International Workshop on Managing Requirements Knowledge. p. 94. IEEE (1951) 15. Clough, R.W.: The finite element method in plane stress analysis. In: Proceedings of 2nd ASCE Conference on Electronic Computation, Pittsburgh, 8–9 September 1960 16. Schmit, L.A.: Structural design by systematic synthesis. In: Proceedings of the Second National Conference on Electronic Computation, ASCE, September 1960 17. Newell, A., Shaw, J.C., Simon, H.A.: Report on a general problem solving program. In: IFIP congress. Pittsburgh, PA, p. 64 (1959) 18. Feigenbaum, E.A., Lederberg, J.: Mechanization of inductive inference in organic chemistry. In: Kleinmuntz, B., Cattell, R.B (ed.) Formal Representation of Human Judgment. Wiley, New York (1968) 19. Dankwort, C.W., Weidlich, R., Guenther, B., Blaurock, J.E.: Engineers’ CAx education—it’s not only CAD. Comput. Aided Des. 36, 1439–1450 (2004) 20. Antonietti, P.F., et al.: Review of discontinuous Galerkin finite element methods for partial differential equations on complicated domains. In: Barrenechea, G.R., Brezzi, F., Cangiani, A., Georgoulis, E.H. (eds.) Building bridges: connections and challenges in modern approaches to numerical partial differential equations. LNCSE, vol. 114, pp. 279–308. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41640-3_9 21. Zawawi, M.H., et al.: A review: fundamentals of computational fluid dynamics (CFD). In: AIP Conference Proceedings, p. 020252. AIP Publishing LLC (2018) 22. Łukaszewicz, A., Szafran, K., Józwik, J.: CAx techniques used in UAV design process. In: 2020 IEEE 7th International Workshop on Metrology for AeroSpace (MetroAeroSpace). pp. 95–98. IEEE (2020) 23. Plappert, S., Gembarski, P.C., Lachmayer, R.: The use of knowledge-based engineering sys´ atek, J., Borzemtems and artificial intelligence in product development: a snapshot. In: Swi˛ ski, L., Wilimowska, Z. (eds.) ISAT 2019. AISC, vol. 1051, pp. 62–73. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-30604-5_6 24. Leondes, C.T.: Computer-Aided Design, Engineering, and Manufacturing: Systems Techniques and Applications the Design of Manufacturing Systems, vol. V. CRC Press (2019) 25. Feigenbaum, E.A.: Some challenges and grand challenges for computational intelligence. Journal of the ACM (JACM) 50, 32–40 (2003) 26. Nirenburg, S.: Cognitive systems: toward human-level functionality. AI Mag. 38, 5–12 (2017) 27. Kagermann, H., Lukas, W.-D., Wahlster, W.: Industrie 4.0: Mit dem Internet der Dinge auf dem Weg zur 4. industriellen Revolution. VDI nachrichten 13, 2 (2011) 28. Lu, Y.: Industry 4.0: a survey on technologies, applications and open research issues. J. Ind. Inf. Integr. 6, 1–10 (2017) 29. Romero, D., Stahre, J., Wuest, T.: Towards an operator 4.0 typology: a human-centric perspective on the fourth industrial revolution technologies. In: Proceedings of the International
758
30.
31. 32. 33.
34. 35.
36.
37. 38. 39. 40. 41. 42. 43. 44.
45. 46. 47.
48. 49.
50.
R. Sala Conference on Computers and Industrial Engineering (CIE46), pp. 29–31. Tianjin, China (2016) Lesh, N., Marks, J., Rich, C., Sidner, C.L.: “ Man-Computer symbiosis” revisited: achieving natural communication and collaboration with computers. IEICE Trans. Inf. Syst. 87, 1290– 1298 (2004) Regli, W.C., Hu, X., Atwood, M., Sun, W.: A survey of design rationale systems: approaches, representation, capture and retrieval. Eng. Comput. 16, 209–235 (2000) Mulvenna, M., Boger, J., Bond, R.: Ethical by design: a manifesto. In: Proceedings of the European Conference on Cognitive Ergonomics 2017, pp. 51–54 (2017) da Luz, L.M., de Francisco, A.C., Piekarski, C.M., Salvador, R.: Integrating life cycle assessment in the product development process: a methodological approach. J. Clean. Prod. 193, 28–42 (2018) Klotz, L., Weber, E., Johnson, E., et al.: Beyond rationality in engineering design for sustainability. Nat. Sustain. 1, 225–233 (2018) Vyas, G.M., Andre, A., Sala, R.: Toward lightweight smart automotive hood structures for head impact mitigation: integration of active stiffness control composites. J. Intell. Mater. Syst. Struct. 31, 71–83 (2020) De Weck, O., Eckert, C.M., Clarkson, P.J.: A classification of uncertainty for early product and system design. In: DS 42: Proceedings of ICED 2007, the 16th International Conference on Engineering Design, Paris, France, 28–31 July 2007, pp. 159–160 (exec. Summ.), full paper no. DS42_P_480 (2007) Eckert, C.M., Clarkson, P.J.: Planning development processes for complex products. Res. Eng. Des. 21, 153–171 (2010) Wynn, D.C., Clarkson, P.J.: Process models in design and development. Res. Eng. Des. 29(2), 161–202 (2017). https://doi.org/10.1007/s00163-017-0262-7 Weaver, W.: Science and complexity. Am. Sci. 36, 536–544 (1948) Simon, H.A.: Models of Bounded Rationality. MIT Press, Cambridge (1982) Simon, H.A.: Bounded rationality and organizational learning. Organ. Sci. 2, 125–134 (1991) Gershman, S.J., Horvitz, E.J., Tenenbaum, J.B.: Computational rationality: a converging paradigm for intelligence in brains, minds, and machines. Science 349, 273–278 (2015) Sethi, S.P., Zhang, Q.: Hierarchical Decision Making in Stochastic Manufacturing Systems. Springer, New York (2012)https://doi.org/10.1007/978-1-4612-0285-1 Feyzabadi, S., Carpin, S.: Planning using hierarchical constrained Markov decision processes. Auton. Robot. 41(8), 1589–1607 (2017). https://doi.org/10.1007/s10514-0179630-4 Hu, H., et al.: Hierarchical decision making by generating and following natural language instructions. arXiv preprint arXiv:190600744 (2019) Ainsworth, M., Oden, J.T.: A posteriori error estimation in finite element analysis. Comput. Methods Appl. Mech. Eng. 142, 1–88 (1997) Marelli, S., Sudret, B.: UQLab: a framework for uncertainty quantification in Matlab. In: Vulnerability, Uncertainty, and Risk: Quantification, Mitigation, and Management, pp. 2554–2563 (2014) Ghanem, R., Higdon, D., Owhadi, H.: Handbook of Uncertainty Quantification. Springer, Cham (2017) Sudret, B., et al.: Recent developments in surrogate modelling for uncertainty quantification. In: 3rd International Conference on Vulnerability and Risk Analysis and Management (ICVRAM 2018). ETH Zurich, Risk, Safety and Uncertainty Quantification (2018) Iooss, B., Lemaître, P.: A review on global sensitivity analysis methods. In: Dellino, G., Meloni, C. (eds.) Uncertainty management in simulation-optimization of complex systems. ORSIS, vol. 59, pp. 101–122. Springer, Boston (2015). https://doi.org/10.1007/978-1-48997547-8_5
Computational Rational Engineering and Development
759
51. Kabir, S., Papadopoulos, Y.: A review of applications of fuzzy sets to safety and reliability engineering. Int. J. Appro. Reason. 100, 29–55 (2018) 52. Simon, H.A.: Decision making: rational, nonrational, and irrational. Educ. Adm. Q. 29, 392–411 (1993) 53. Russel, S., Norvig, P.: Artificial Intelligence: A Modern Approach, 4th edn. Pearson Education Inc., Boston (2021) 54. Marthi, B.: Automatic shaping and decomposition of reward functions. In: Proceedings of the 24th International Conference on Machine Learning, pp. 601–608 (2007) 55. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cyber. 4, 100–107 (1968) 56. Heese, R., Walczak, M.ł, Morand, L., Helm, D., Bortz, M.: The good, the bad and the ugly: augmenting a black-box model with expert knowledge. In: Tetko, I.V., K˚urková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 391–395. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_38 57. Asprion, N., et al.: Gray-box modeling for the optimization of chemical processes. Chem. Ing. Tech. 91, 305–313 (2019) 58. Glinz, M.: Problems and deficiencies of UML as a requirements specification language. In: Tenth International Workshop on Software Specification and Design. IWSSD-10 2000, pp. 11–22. IEEE (2000) 59. Siau, K., Rossi, M.: Evaluation techniques for systems analysis and design modelling methods – a review and comparative analysis. Inf. Syst. J. 21, 249–268 (2011). https://doi.org/ 10.1111/j.1365-2575.2007.00255.x 60. Christiano, P., et al.: Deep reinforcement learning from human preferences. arXiv preprint arXiv:170603741 (2017) 61. Lewis, R.L., Howes, A., Singh, S.: Computational rationality: linking mechanism and behavior through bounded utility maximization. Top. Cogn. Sci. 6, 279–311 (2014) 62. Griffiths, T.L., Lieder, F., Goodman, N.D.: Rational use of cognitive resources: levels of analysis between the computational and the algorithmic. Top. Cogn. Sci. 7, 217–229 (2015) 63. Tversky, A., Kahneman, D.: Judgment under uncertainty: heuristics and biases. Science 185, 1124–1131 (1974) 64. Kahneman, D.: Thinking, Fast and Slow. Macmillan, New York (2011) 65. Todd, P.M., Gigerenzer, G.: Précis of “simple heuristics that make us smart.” Behav. Brain Sci. 23, 727–741 (2000) 66. Goldstein, D.G., Gigerenzer, G.: Models of ecological rationality: the recognition heuristic. Psychol. Rev. 109, 75 (2002) 67. Tenenbaum, J.B., Kemp, C., Griffiths, T.L., Goodman, N.D.: How to grow a mind: statistics, structure, and abstraction. Science 331, 1279–1285 (2011) 68. Gurnani, A.P., Lewis, K.: Using bounded rationality to improve decentralized design. AIAA J. 46, 3049–3059 (2008) 69. Gobet, F., Simon, H.A.: The roles of recognition processes and look-ahead search in timeconstrained expert problem solving: evidence from grand-master-level chess. Psychol. Sci. 7, 52–55 (1996) 70. Todd, P.M., Gigerenzer, G.: Environments that make us smart: ecological rationality. Curr. Dir. Psychol. Sci. 16, 167–171 (2007) 71. Wolpert, D.H., Macready, W.G: No free lunch theorems for search. Technical Report SFITR-95-02-010, Santa Fe Institute (1995) 72. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Trans. Evol. Comput. 1, 67–82 (1997) 73. Griffiths, T.L., Callaway, F., Chang, M.B.: Doing more with less: meta-reasoning and metalearning in humans and machines. Curr. Opin. Behav. Sci. 29, 24–30 (2019)
760
R. Sala
74. Elms, D.G., Brown, C.B.: Intuitive decisions and heuristics–an alternative rationality. Civ. Eng. Environ. Syst. 30, 274–284 (2013) 75. Young, M.T.: Heuristics and human judgment: what we can learn about scientific discovery from the study of engineering design. Topoi 39(4), 987–995 (2018). https://doi.org/10.1007/ s11245-018-9550-8 76. Cross, N.: Engineering Design Methods: Strategies for Product Design. John Wiley & Sons, New York (2021) 77. Cash, P.J.: Developing theory-driven design research. Des. Stud. 56, 84–119 (2018) 78. Dixon, J.R.: On research methodology towards a scientific theory of engineering design. Ai Edam 1, 145–157 (1987) 79. Simon, H.A.: The Sciences of the Artificial. MIT Press, Cambridge (2019) 80. Braha, D., Maimon, O.: A Mathematical Theory of Design: Foundations, Algorithms and Applications. Springer, Berlin (1998). https://doi.org/10.1007/978-1-4757-2872-9 81. Antonsson, E.K., Cagan, J.: Formal Engineering Design Synthesis. Cambridge University Press, Cambridge (2005) 82. Bahrami, A., Dagli, C.H.: Models of design processes. In: Parsaei, H.R., Sullivan, W.G. (eds.) Concurrent Engineering, pp. 113–126. Springer, Boston (1993). https://doi.org/10. 1007/978-1-4615-3062-6_7 83. Degrave, J., Hermans, M., Dambre, J.: A differentiable physics engine for deep learning in robotics. Front. Neurorobot. 13, 6 (2019) 84. Sebastian, B., Ben-Tzvi, P.: Physics based path planning for autonomous tracked vehicle in challenging terrain. J. Intell. Robot. Syst. 95, 511–526 (2019) 85. Raissi, M., Karniadakis, G.E.: Hidden physics models: machine learning of nonlinear partial differential equations. J. Comput. Phys. 357, 125–141 (2018) 86. Wu, T., Tegmark, M.: Toward an artificial intelligence physicist for unsupervised learning. Phys. Rev. E 100, 033311 (2019) 87. Oberkampf, W.L., Trucano, T.G., Hirsch, C.: Verification, validation, and predictive capability in computational engineering and physics. Appl. Mech. Rev. 57, 345–384 (2004) 88. Schefzik, R., Thorarinsdottir, T.L., Gneiting, T.: Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci. 28, 616–640 (2013) 89. Oh, S., Jung, Y., Kim, S.: Deep generative design: integration of topology optimization and generative models. ASME. J. Mech. Des. 141(11), 111405 (2019). https://doi.org/10.1115/ 1.4044229 90. Bujny, M., Aulig, N., Olhofer, M., Duddeck, F.: Learning-based topology variation in evolutionary level set topology optimization. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 825–832 (2018) 91. Koza, J.R., Bennett, F.H., Andre, D., Kean, M.A.: Automated design of both the topology and sizing of analog electrical circuits using genetic programming. In: Gero, J.S., Sudweeks, F. (eds.) Artificial Intelligence in Design 1996, pp. 151–170. Springer, Dordrecht (1996). https://doi.org/10.1007/978-94-009-0279-4_9 92. Javaheripi, M., Samragh, M., Koushanfar, F.: Peeking into the black box: a tutorial on automated design optimization and parameter search. IEEE Solid-State Circuits Mag. 11, 23–28 (2019) 93. Annunziata, L., Menapace, M., Tacchella, A.: Computer Intensive vs. Heuristic Methods. In: Automated Design of Elevator Systems. In: ECMS, pp. 543–549 (2017) 94. Gadkari, S., Gu, S., Sadhukhan, J.: Towards automated design of bioelectrochemical systems: a comprehensive review of mathematical models. Chem. Eng. J. 343, 303–316 (2018) 95. Picard, C., Schiffmann, J.: Automated design tool for automotive control actuators. In: International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, p. V11BT11A027. American Society of Mechanical Engineerss (2020)
Computational Rational Engineering and Development
761
96. Kieninger, D., Hemsen, J., Köller, S., Uerlich, R.: Automated design and optimization of transmissions for electric vehicles. MTZ Worldwide 80, 88–93 (2019) 97. Shahriari, B., et al.: Taking the human out of the loop: a review of Bayesian optimization. Proc. IEEE 104, 148–175 (2015) 98. Mersmann, O., et al.: Exploratory landscape analysis. In: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation, pp. 829–836 (2011) 99. Kerschke, P., Trautmann, H.: Automated algorithm selection on continuous black-box problems by combining exploratory landscape analysis and machine learning. Evol. Comput. 27, 99–127 (2019) 100. Sala, R., Baldanzini, N., Pierini, M.: Representative surrogate problems as test functions for expensive simulators in multidisciplinary design optimization of vehicle structures. Struct. Multidiscip. Optim. 54(3), 449–468 (2016). https://doi.org/10.1007/s00158-016-1410-9 101. Muñoz, M.A., Kirley, M., Smith-Miles, K.: Analyzing randomness effects on the reliability of exploratory landscape analysis. Nat. Comput. 1–24 (2021). https://doi.org/10.1007/s11 047-021-09847-1 102. Sala, R., Baldanzini, N., Pierini, M.: Global optimization test problems based on random field composition. Optim. Lett. 11(4), 699–713 (2016). https://doi.org/10.1007/s11590-0161037-1 103. Bartz-Beielstein, T., et al.: Benchmarking in optimization: best practice and open issues. arXiv preprint arXiv:200703488 (2020) 104. Sala, R., Müller, R.: Benchmarking for metaheuristic black-box optimization: perspectives and open challenges. In: 2020 IEEE Congress on Evolutionary Computation (CEC). pp. 1–8. IEEE (2020) 105. Rice, J.R.: The algorithm selection problem. Adv. Comput. 15, 5 (1976) 106. Roughgarden, T.: Beyond worst-case analysis. Commun. ACM 62, 88–96 (2019) 107. Muñoz, M.A., Sun, Y., Kirley, M., Halgamuge, S.K.: Algorithm selection for black-box continuous optimization problems: a survey on methods and challenges. Inf. Sci. 317, 224– 245 (2015) 108. Golovin, D., et al.: Google vizier: a service for black-box optimization. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495 (2017) 109. Vanaret, C., Gallard, F., Martins, J.: On the consequences of the “No Free Lunch” theorem for optimization on the choice of an appropriate MDO architecture. In: 18th AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference, p. 3148 (2017) 110. Steward, D.V.: The design structure system: a method for managing the design of complex systems. IEEE Trans. Eng. Manag. EM-28(3), 71–74 (1981) 111. Yassine, A., Braha, D.: Complex concurrent engineering and the design structure matrix method. Concurr. Eng. 11, 165–176 (2003) 112. Yassine, A.A.: Managing the development of complex product systems: an integrative literature review. IEEE Trans. Eng. Manag. (2019) 113. Wymore, A.W.: Model-Based Systems Engineering. CRC Press, Boca Raton (1993) 114. Madni, A.M., Sievers, M.: Model-based systems engineering: motivation, current status, and research opportunities. Syst. Eng. 21, 172–190 (2018) 115. Schluse, M., Priggemeyer, M., Atorf, L., Rossmann, J.: Experimentable digital twins— streamlining simulation-based systems engineering for industry 4.0. IEEE Trans. Ind. Inf. 14, 1722–1731 (2018) 116. Lee, J.H., Shin, J., Realff, M.J.: Machine learning: overview of the recent progresses and implications for the process systems engineering field. Comput. Chem. Eng. 114, 111–121 (2018) 117. Bezdek, J.C.: (Computational) intelligence: what’s in a name? IEEE Syst. Manag. Cybern. Mag. 2, 4–14 (2016)
762
R. Sala
118. Branke, J., Nguyen, S., Pickardt, C.W., Zhang, M.: Automated design of production scheduling heuristics: a review. IEEE Trans. Evol. Comput. 20, 110–124 (2015) 119. Stützle, T., López-Ibáñez, M.: Automated design of metaheuristic algorithms. In: Gendreau, M., Potvin, J.Y. (eds.) Handbook of Metaheuristics. International Series in Operations Research & Management Science, vol. 272, pp. 541–579. Springer, Cham (2019). https:// doi.org/10.1007/978-3-319-91086-4_17 120. Geng, Z., Wang, Y.: Automated design of a convolutional neural network with multi-scale filters for cost-efficient seismic data classification. Nat. Commun. 11, 1–11 (2020) 121. Böhland, M., et al.: Automated design process for hybrid regression modeling with a oneclass SVM. Automatisierungstechnik 67, 843–852 (2019) 122. Dorri, A., Kanhere, S.S., Jurdak, R.: Multi-agent systems: a survey. IEEE Access 6, 28573– 28593 (2018) 123. Herrera, M., Pérez-Hernández, M., Kumar Parlikad, A., Izquierdo, J.: Multi-agent systems and complex networks: review and applications in systems engineering. Processes 8, 312 (2020) 124. Mascardi, V., et al.: Engineering multi-agent systems: state of affairs and the road ahead. ACM SIGSOFT Softw. Eng. Notes 44, 18–28 (2019) 125. Minsky, M.: Society of Mind. Simon and Schuster, New York (1988) 126. DeLoach, S.A., Wood, M.F., Sparkman, C.H.: Multiagent systems engineering. Int. J. Softw. Eng. Knowl. Eng. 11, 231–258 (2001) 127. Zawadzki, P.: Methodology of KBE system development for automated design of multivariant products. In: Hamrol, A., Ciszak, O., Legutko, S., Jurczyk, M. (eds.) Advances in Manufacturing. LNME, pp. 239–248. Springer, Cham (2018). https://doi.org/10.1007/9783-319-68619-6_23 128. Wu, X., et al.: Knowledge engineering with big data. IEEE Intell. Syst. 30, 46–55 (2015) 129. Cheney, N., Clune, J., Lipson, H.: Evolved electrophysiological soft robots. In: Artificial Life Conference Proceedings, vol. 14, pp. 222–229. MIT Press (2014) 130. Zhao, A., et al.: RoboGrammar: graph grammar for terrain-optimized robot design. ACM Trans. Graph. 39, 1–16 (2020) 131. Schulz, A., et al.: Interactive robogami: an end-to-end system for design of robots with ground locomotion. Int. J. Robot. Res. 36, 1131–1147 (2017) 132. Balu, A., et al.: A deep learning framework for design and analysis of surgical bioprosthetic heart valves. Sci. Rep. 9, 1–12 (2019) 133. So, S., Mun, J., Rho, J.: Simultaneous inverse design of materials and structures via deep learning: demonstration of dipole resonance engineering using core–shell nanoparticles. ACS Appl. Mater. Interfaces. 11, 24264–24268 (2019) 134. Shao, K., et al.: A survey of deep reinforcement learning in video games. arXiv preprint arXiv:191210944 (2019) 135. Kiumarsi, B., Vamvoudakis, K.G., Modares, H., Lewis, F.L.: Optimal and autonomous control using reinforcement learning: a survey. IEEE Trans. Neural Netw. Learn. Syst. 29, 2042–2062 (2017) 136. Popova, M., Isayev, O., Tropsha, A.: Deep reinforcement learning for de novo drug design. Sci. Adv. 4, eaap7885 (2018) 137. Settaluri, K., et al.: Autockt: deep reinforcement learning of analog circuit designs. In: 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 490–495. IEEE (2020) 138. Arulkumaran, K., Deisenroth, M.P., Brundage, M., Bharath, A.A.: Deep reinforcement learning: a brief survey. IEEE Signal Process. Mag. 34, 26–38 (2017) 139. Al-Emran, M.: Hierarchical reinforcement learning: a survey. Int. J. Comput. Digit. Syst. 4(2), 137–143 (2015)
Computational Rational Engineering and Development
763
140. Sala, R., Baldanzini, N., Pierini, M.: SQG-Differential Evolution for difficult optimization problems under a tight function evaluation budget. In: Nicosia, G., Pardalos, P., Giuffrida, G., Umeton, R. (eds.) MOD 2017. LNCS, vol. 10710, pp. 322–336. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72926-8_27 141. Ojha, V., Abraham, A., Snášel, V.: Heuristic design of fuzzy inference systems: a review of three decades of research. Eng. Appl. Artif. Intell. 85, 845–864 (2019) 142. Dafoe, A., et al.: Open problems in cooperative AI. arXiv preprint arXiv:201208630 (2020) 143. Schölkopf, B., et al.: Toward causal representation learning. In: Proceedings of the IEEE (2021) 144. Amodei, D., et al.: Concrete problems in AI safety. arXiv preprint arXiv:160606565 (2016) 145. Russell, S.: Human Compatible: Artificial Intelligence and the Problem of Control. Penguin, New York (2019)
QPSetter: An Artificial Intelligence-Based Web Enabled, Personalized Service Application for Educators Mohammad Ali Kadampur1(B) and Sulaiman Al Riyaee2 1 Department of Electrical Engineering, College of Engineering, Imam Mohammad Ibn Saud
Islamic University, Riyadh, Kingdom of Saudi Arabia [email protected] 2 Department of Information Management, College of Computer and Information Sciences, Imam Mohammad Ibn Saud Islamic University, Riyadh, Kingdom of Saudi Arabia [email protected]
Abstract. Setting a question paper is an integral activity of a teaching-learning process, it is more so in formal education systems such as schools, colleges and Universities. At times, it becomes a tricky and annoying act for the tutor to divert oneself from main stream teaching and spend time thinking about setting a question paper on the intended topic. This paper presents an Artificial Intelligence powered, web enabled, personalized service application for the educators to automatically set up a question paper on the selected topic/syllabus. The application allows the setter to choose the type of the questions, type of headers, number of allotted marks etc. and in a click of a button the question paper is displayed in the desired format. The application silently scrapes the relevant web content in the background and creates a database of questions with topics, weights, and levels of difficulty. The application also allows the user to enter the questions manually and integrates them into the database. The application has an answer database module to couple the correct answers with the questions. The AI component of the application works on text mining and content classification. It also assists in suggesting questions on the supplied text content. The application allows the user to have a simple login and explore the QPSetter for the personal usage. This application is useful for all levels of tutors who want to automatically set the question papers. It helps in preserving, confidentiality, integrity of the question papers along with helping in a quick time question paper setting. Keywords: Question paper · Artificial intelligence · Web intelligence · Web scraping · Text mining · Deep learning · NoSQL databases · Education · Teacher · Student
1 Introduction A “question” is basically a request for information. The art of requesting information is central for evaluation in formal education systems. The improvements in educational psychology, standardizations, accreditation guidelines and factors alike make the act © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 764–777, 2022. https://doi.org/10.1007/978-3-030-82193-7_51
QPSetter: An Artificial Intelligence-Based Web Enabled
765
of asking a question more trickier than ever before Modern day question paper based evaluation includes different category of questions to be asked on a given topic. This includes multiple choice questions, short answer type questions, descriptive answer type questions, questions to test knowledge (memorization), analytical ability, problem solving ability, application abilities etc. A typical NQF (National Qualifications Framework) domain observes learning outcomes in typical categories such as: Knowledge domain skills, Cognitive skills, Interpersonal, responsibility skills, Communication, Information Technology, Numerical skills, Psychomotor skills. In order to assess a learner in each of these categories, a carefully planned question paper needs to be set up. At this juncture, the tutor is perplexed as to whether the focus on the planning of content delivery or planning a question paper? This kind of thinking consumes lots of quality time for the tutor. On top of it, in examinations, keeping confidentiality, integrity & authenticity of the question paper adds another level of difficulty to the educator. In this paper, an Internet Communication Technology (ICT) based solution is presented that addresses the issue of setting up a question paper in formal education systems. The paper discusses the system design issues, the architectural issues and the issues related to features that are offered in this application. The research in this paper is unique in its application of state-of-the-art technologies and the use of a document-oriented database. The implementation explores AI features in text mining. The simple-looking random data extraction and presentation is designed to work as an intelligent system. The paper is organized as follows: Sect. 2 points at the motivation for this work. Section 3 is about the related work and a brief literature review. Section 4 provides a detailed account of system architecture with sub-sections on scraper module, educational artificial intelligence (EAI) module, The Q-adder, UI interface, and the database module. Section 5 is about conclusions.
2 Motivation It is found that traditionally educators (teachers) spend substantial time on tedious proposition process [8]. If the time on proposition process is reduced, then the teacher would be allowed to devote more time to teaching & research, thereby augmenting the teaching efficiency. Therefore, separation of “Teaching” from “Testing” is important. On the technology front, the Internet has brought rich resources for learning. A wide variety of competitive examination questions, patterns on every course are available readily online. It is important to make propositions taking cognition of these resources of global standard. Storage and management of these resources for testing would provide a competitive edge in learning process. The requirements, such as setting the question paper in a quick time, an up to date question bank, efficient storage & management of questions act as driving factors to attempt such an application. Finally, satisfying the CIA (Confidentiality, Integrity, & Authenticity) requirements in a test paper are the motivating factors for this research.
3 Related Work There are quite a number of applications on automatic question paper generation [1]. Some of them are executed as graduate projects [11], some of them are presented as
766
M. A. Kadampur and S. Al Riyaee
online web services [2, 9] and a few as research works [3]. Each of the work appearing in these citations has enough room for improvement, in terms of data structures used for implementation, services offered, intelligence in the system, automatic notion, and alignment with the educational psychology (Blooms levels, etc.). None of the earlier implementations have explored the document database and utilized the advantages of scalability and schema-free features of document database.
4 System Architecture The high-level structure of the software system is as shown in Fig. 1. The important components are: • Scraper module: Scrapes the URLs and creates data dumps. • EAI module: Educational Artificial Intelligence module, It currently uses deep learning algorithms for education-related text mining. • Q-Adder module: This is a legacy interface for inserting questions into the database. • AQG module: Alternative Question Generator. It rephrases the selected questions and adjusts the numeric values to provide an alternative question. • QP-Set module: This module is available for setting up the question paper. It has built-in standard templates and provisions for new template design. • Database: It basically stores the questions in the form of documents (MongoDB style) [4].
Fig. 1. QP setter system architecture
QPSetter: An Artificial Intelligence-Based Web Enabled
767
4.1 Scraper Module The objective of the scraper module is to enable automatic crawling on the target websites. The scraper component, when set active, crawls on the target DOM (Document Object Model) tree (Fig. 2) of the web pages and extracts the text information available in the HTML tags. The extracted data is saved as a CSV file only to be cleaned and loaded on to the MongoDB database in the subsequent processing stages. In this application, the scraper focuses on the websites containing question papers, item banks, and other pedagogical sites. The data that exists in the MongoDB database now, is in the form of the document object model [4]. The scraper is designed using Beautiful Soup, a python parser [9]. The scraper is designed to addresses mainly two important dimensions of the Web,
Fig. 2. The web scraper schematics
XPath and URI: Semantic scraping defines a model that maps HTML fragments to semantic web resources. By using this model to define the mapping of a set of web
768
M. A. Kadampur and S. Al Riyaee
resources, the data from the web is made available as a knowledge base to scraping services. 4.2 Educational Artificial Intelligence (EAI) Module The EAI module has deep learning based text-mining implementations that assist in intelligent search/selection and classification tasks. The EAI component is scalable with respect to the addition of new algorithms in different categories. The EAI the component provides options for algorithm selection such as GA based intelligent QP, Random selection based QP, Learner centric incrementally growing QP, etc. The EAI component has a text wizard to read and comprehend the content. Based on such comprehension, it autogenerates intelligent questions. The process of AI-assisted question paper generation is shown in Fig. 3.
Fig. 3. The question paper setting process of EAI module
QPSetter: An Artificial Intelligence-Based Web Enabled
769
Intelligent QP Selection. The EAI component has several options for the selection The EAI component has several options for the selection of the questions from the database. The principle of intelligent selection is by the Evolutionary process of natural selection. In this process, the survival of the fittest occurs. The user requirements such as type of questions (MCQs, Short answer types, Essay types, etc.), NQF domain, whether GATE (Graduate Aptitude Test in Engineering) related questions? etc. are collected and quantified to form a chromosome (a gene strip). Example:
An example chromosome with a multiple choice question (MCQ), GATE (Graduate Aptitude Test in Engineering) exam question and difficult question (DL = 1), would look like Fig. 4. Note that corresponding CR (Criterion bits are set to 1).
Fig. 4. An example chromosome
The simplest mathematical model in this case would look like: w1 CR1 + w4 CR4 + w10 CR10 = K
(1)
Where K is a constant, wi is the weight associated with each question under criterion CRi .The question extraction system allows several choices for extraction, importantly: • Draw Verify & Drop (DVD) method. • Multi-Objective Evolutionary Algorithm (MOEA) method [7]. In the DVD method, the questions are drawn from the database, then they are matched to the mathematical model, if the match is found, select else drop, continue to extract. For example, questions with IDs, QID001, QID786 and QID108 are selected refer to Fig. 5. The final question paper is produced as per the user criterion and limits from the selected pool of questions.
770
M. A. Kadampur and S. Al Riyaee
Fig. 5. Extraction of questions from the question population
In MOEA (Multi-Objective Evolutionary Algorithm) based method of extraction [7], a Pareto front [7] the chart is drawn marking the r objectives. A chromosome, the question meets out of the required s objectives. The objective scores for each question. In the selection process, only those questions that are in accordance with the Pareto principle are selected. For example consider for simplicity of illustration, the pool has only four questions and unity weights (wi ’s, meeting the required criterion as below:
The objective scores are calculated using equations such as the following: Qi wi CRi k i=1 wi CRi
(2)
QPSetter: An Artificial Intelligence-Based Web Enabled
771
The computed values for the illustrative example are shown in the following tables: Table 1. The objective scores for each question entered Question ID Criterion ration Objective score QID45
3/4
0.75
QID99
2/4
0.50
QID27
1/4
0.25
QID 75
3/4
0.75
The Pareto chart for the above data would look like as shown in Fig. 6.
Fig. 6. Sample pareto graph for the data of Table 1
The data that falls on the Pareto area is selected. That means, the system will select questions QID45 and QID75.A DEAP (Distributed Evolutionary Algorithm in Python) framework is used for the implementation. The model question paper is also used as a training data set for generating new question papers with the updated data into the database. It is a continuously evolving process of learning to produce an automatic question paper as per the user requirements. The system-generated question paper is then edited by human intervention if required. The simplest method of question extraction from the database is by random selection. Alternative Question Generator (AQG). The AQG module of EAI component provides the option for generating an alternative question mimicking the original selected question. For Example, if the selected question is: Two cars start at the same time from the same location and go in the same direction. The speed of the first car is 80 km/h and the speed of the second car is 90 km/h. What is the number of hours it takes for the distance between the two cars to be 25 km?
772
M. A. Kadampur and S. Al Riyaee
Note that in the alternative question the numerical values have been changed retaining the original semantics of the question. In the application, if the AQG option is selected, it would automatically generate alternative questions for all numerical questions in the final production of the question paper. The AQG module uses a text parser of Python. A text re-phrase logic is implemented to add AQG intelligence to the system. The command instance for text parser installation on windows is as shown below: >pip install textparser
The prototype program for text parsing is shown below: import text parser from text parser import Sequence class Parser(text parser.Parser): def token_specs(self): return [ (’SKIP’, r’[ \r\n\t]+’), (’WORD’, r’\w+’), (’EMARK’, ’!’, r’!’), (’COMMA’, ’,’, r’,’), (’MISMATCH’, r’.’) ] def grammar(self): return Sequence(’WORD’, ’,’, ’WORD’, ’!’) tree = Parser().parse(’20, cars!’) print(’Tree:’, tree)
It tokenizes the given text (question) and returns the words in the sentence, resulting in the tree for the sentence. Example: Tree: [’20’, ’,’, ’cars’, ’!’]
The full version of this prototype code collects the numeric values, in the order of their occurrence in the question and records simple associations among them. For example, in the original question above, if we consider Value1 = 50 km/h, Value2 = 60 km/h and. Value3 = 20 km/h then the simple association reveals that the last value, Value3 < Value1 < Value2 . Random numbers are generated in the suitable range in consistence with the above association. For range fixation, a simple logic based on numeric averages is used. Supposing that Avg is the average of all numeric values then, range of Valuei = (Valuei , Valuei + Avg) constrained to the association rules among the text values. 4.3 The Database In the database design, a “question” is considered as a document. A question has a text body and implicit attributes associated with it. For example, is it a Multiple Choice
QPSetter: An Artificial Intelligence-Based Web Enabled
773
Question (MCQ)? or is it a difficult question? or did this question appear in GATE examinations? etc. The number of attributes for each question also keeps varying. Because of these reasons, it is not a good idea to put a text description in structured, legacy databases. A database that uses document model for data processing is suitable for the application. MongoDB, with its distributed document processing model, horizontal scalability, high processing speeds, schema-free database structure, and ease of interfacing with python, make it a natural choice as a database in this research. The instance of creating a questions table (collection), inserting a record (a question) as a “key: value” pair and the display of the JSON like document object is shown below: > use ques ons switched to db ques ons > db ques ons > db.ques ons.insert({QID45:’What is Blackhole?’, difficulty: "High", GATEQ:’No’, Type:Descrip ve}) WriteResult({"nInserted" :1}) The find() function of the record document appears as below > db.ques ons.find() {"_id":ObjectId("5c72f29487a78117b5d873bd"), "QID45": "What is Blackhole?", "difficulty": "High", "GATEQ" : "No", Type:Descrip ve} >
The QPsetter application and MongoDB database along with the corresponding interfaces fit into the architecture as shown in Fig. 7. “Bottle” is a python a web framework [6], inside which the application resides. The PyMongo is a driver that drives the MongoDB database. Mongo shell offers command-line interface with the MongoDB for database management activities. Images, audio files, etc. are added into the MongoDB database with its gridFS specification with ease and flexibility. The other important feature of MongoDB is that we can apply index for anything inside the document, For example, for keys, for tags, for values, etc. and make query processing faster than before. The QPSetter application, along with its EAI components, fits into the MongoDB interface as shown in Fig. 7. The application also provides an option to add an “Answer” to the question with proper tagging and correspondence to the question. The authenticity and validity of the question and answer is linked to the author of the question and the “key: value” pair is recorded in the question document. The literature for MongoDB based project implementation is referred from [4, 6, 10].
774
M. A. Kadampur and S. Al Riyaee
Fig. 7. The QPSetter and MongoDB interface
4.4 The Q-Adder This is a personalized, user-specific component that facilitates the addition of questions & corresponding answers offline. Using this module the user can log in and CRUD the questions. The author can enter a question and its attributes such as, is it a multiplechoice question? is it a difficult question? is it a question taken from a specific previous examination? etc.? The general format of a question document is:
As there is no fixed schema for the database, the user is free to add any number of attributes for a question. Each question is treated as a MongoDB document. Each data item inside the document resides as a “key: value” pair. Any new attribute can be added by specifying a specific key for that attribute. During query processing entire database is searched using these key-value pairs. The addition of images or figures is done through the gridFs framework. The module also accepts answers for a specific question by properly tagging it with the corresponding question. 4.5 The User Interface The User Interface basically has a login page (Fig. 8) followed by many interactive pages for each high level structure of the software (Fig. 9). The UI is kept simple using the minimum dependency web framework of Python, the Bottle framework. The UI Dataflow. The UI provides an easy and interactive interface for the system. The data flow is as shown in Fig. 10.
QPSetter: An Artificial Intelligence-Based Web Enabled
775
Fig. 8. The QPSetter UI for registration
Fig. 9. The QPSetter a screen shot of inserting a question
Initially, a user login is authenticated. The system directs for task selection after valid user log in. User can set the system in any of the threes task modes: • Scrape • Add Question • Generate Question
776
M. A. Kadampur and S. Al Riyaee
Fig. 10. The UI data flow
The scrape task will enable the scraping of URLs for relevant data collections, particularly, question paper websites. The scraper crawls on the websites and creates. a csv dump (Soup), this is then cleaned and converted into proper documents that get stored in the MongoDB database as questions. The add question task selection enables the Q-Adder component and the user can add new questions in a schema-free format directly into the database. The Q-Adder interface interacts in continuous loop to accept questions and their attributes (keys) till a terminate command is issued. This module also enables the user to add answers to the corresponding questions. Qadeer is an offline facility to allow the user to add the questions as and when desired. The question paper generator component provides many options of template selection, selection of question types, any new constraint on the question selection. This module also provides the options for selecting/searching the questions from the database. Templates and Annotations. The system provides certain standard templates of question papers based on Blooms taxonomy [5], University/Board patterns, purpose of testing etc. The system also allows generating user-specific customized templates. Any new the template can be saved as an appended asset in the template library.
5 Conclusions In this work, the system design features of an artificial intelligence [10] based, webenabled question paper setter is presented. The paper utilized the advantages of current state of art technologies of text mining and machine learning to improve the implementation strategies of a useful application in the education system. The system design approach uses the schema-free database management system and thereby offers the
QPSetter: An Artificial Intelligence-Based Web Enabled
777
flexibility of defining any type of question in the database. The system becomes horizontally scalable due to the decision of using a database like MongoDB. The automatic question generator (rephrase) feature modifies the existing questions without affecting the semantics of the question. The question extraction system uses several options of selecting questions including the multiobjective optimized approach. The web scraper module in the architecture allows automatic extraction of content relevant to the question generation and is unique to such an application. The future work would involve adding more and more intelligent implementations incrementally to the scalable architecture of the system.
References 1. Generating questions and distractors automatically from multimedia. https://github.com/nar ain280493/. Automatic-Question-Generation. Accessed 28 Apr 2021 2. Kipsqpg. www.kipsqpg.com. Accessed 28 Apr 2021 3. Scigen-an automatic cs paper generator. https://pdos.csail.mit.edu/archive/scigen/. Accessed 28 Apr 2021. 4. Chodorov, K.: MongoDB: The Definitive Guide (2013). OReilly, Sebastapool, CA USA, 2nd edition.Author, F., Author, S., Author, T.: Book title. 2nd edn. Publisher, Location (1999) 5. Krathwohl, D., Paul Pintrich, R.E.M.: A Taxonomy for Learning, Teaching, and Assessing, 2nd edn. Longman Pennsylvania State University, USA (2001) 6. de la Guardia, C.: PythonWeb Frameworks, 1st edn. OReilly, Sebastapool (2016) 7. Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms: An Introduction, 1st edn. KANGAL Labs IITK, India (2011) 8. Ischinger, B.: Creating effective teaching and learning environments first results from tali. Technical report, Organization for Economic Co-operation & Development (2009) 9. Mitchell, R.: Web Scraping with Python Collecting Data from the Modern Web, 1st edn. OReily, Sebastopol (2015) 10. Pattanayak, S.: Intelligent Projects Using Python: 9 Real-World AI Projects Leveraging Machine Learning and Deep Learning with TensorFlow and Keras, 1st edn. Packt, Amazon (2019) 11. Rohan Bhirangi, S.B.: Automated question paper generation system. Int. J. Emerg. Res. Manage. Technol. IJERMT, India (2016)
Is It Possible to Recognize a Philosophical Zombie and How to Do It R. V. Dushkin(B) Artificial Intelligence Agency, Moscow, Russia [email protected]
Abstract. This analytical article attempts to consider the problem of recognition and differentiation of the so-called “philosophical zombies” for a set of operational criteria for determining the subjectivity of artificial intelligent systems. This problem can be seen as one of the ways to solve the “hard problem of consciousness”. The proposed approach does not solve the “hard problem” by itself but reveals some aspects of neurophysiology, cybernetics, and information theory on the way to solving it. The research methodology uses an interdisciplinary approach that allows one to study the research subject and combine the results of a review of four theories into one conclusion. The relevance of this theme comes from the wider use of artificial cognitive agents in life—where exactly the boundary separates the intellectual part of the artificial cognitive agent, even if it has a different mind. According to the author, it is the presence of phenomenological consciousness that gives the object subjectivity; therefore, the development of more and more complex artificial cognitive agents (artificial intelligent systems) leads to a heated discussion on this issue. The article attempts to substantiate recognized philosophical concepts and their limitations, as well as assumptions about the possibility of qualifying artificial cognitive agents. This article will be of interest to anyone interested in artificial intelligence in all its aspects, as well as the philosophy of mind. Keywords: Philosophy of mind · Philosophy of artificial intelligence · Philosophical zombie · Qualia · Perception · Phenomenology of consciousness · Artificial intelligence · Machine intelligence · Mind · Nonhuman mind
1 Introduction Artificial intelligence is an interdisciplinary field of scientific research, the main task of which is to understand the nature of human intelligence, reason, and, ultimately, consciousness (especially phenomenological) [1]. Initially, researchers in the field of artificial intelligence set such a goal [2], but in the course of development of this science, a large number of applied methods and technologies have appeared for solving various problems using a cognitive approach—methods that humans use to solve such problems seem to be outside the framework of a strict computational approach, as understood by A. Turing and J. von Neumann [3]. The latter has led to the emergence and widespread use of artificial cognitive agents (artificial intelligent systems) of various types and classes to solve a variety of applications in all areas of life—from security to the creation of general-level personal assistants. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 778–790, 2022. https://doi.org/10.1007/978-3-030-82193-7_52
Is It Possible to Recognize a Philosophical Zombie
779
The development of these artificial intelligence agents will lead to them displaying more complex cognitive behavior. Already, artificial intelligence systems are more efficient than humans in some tasks, and as their complexity increases, there will be even more such tasks and applications. In turn, this will certainly lead to the issue of the duties, rights, and responsibilities of artificial intelligent systems being put on the agenda. This question is not as simple as it might seem at first glance. The point is that if an object shows intellectual behavior of a rather complex nature, then the problem arises— how much can this object represent the subject, including the subject of law? After all, if something behaves intellectually, looks intellectual, and reports that it is intellectual, then it is most likely intellectual. And here comes the delicate point—how exactly to determine that the object has acquired subjectivity? When does a thing become a being? The novelty of this work is based on the consideration of a specially developed set of criteria for understanding the possibility of a particular object of research to have something that can be described as a “phenomenal internal state”, while the consideration is carried out in the aspect of artificial intelligence systems in their modern understanding by mankind. The relevance of the work is due to the need to obtain a set of operational criteria for determining the possible subjectivity of the considered cognitive agent, both artificial and natural, which will allow using this possibility in the development of modern artificial intelligent systems. It seems that this will allow a more conscious approach to the development of artificial cognitive agents and avoid certain errors in their construction, which, without any doubt, can occur when building artificial intelligent systems of varying degrees of autonomy. Thus, the relevance of the work is based on the desire to offer an operational hypothesis for the recognition and differentiation of “philosophical zombies” and, as a result, vice versa—to determine the presence of a phenomenal internal experience. The article goes on to explain why, in principle, it is necessary to be able to recognize and differentiate the “philosophical zombie”, and also provides a set of preliminary criteria for the possibility of such recognition and differentiation for special conditions. There are also assumptions about whether artificial cognitive agents will ever be able to obtain internal qualitative states (“qualia”) that define phenomenological subjective experience. All this is considered in the aspect of an application to artificial intelligence systems.
2 Why is It Necessary to Think About Philosophical Zombies The philosophical zombie is a thought experiment from the field of philosophy of consciousness, which was first introduced into argumentation by the philosopher Robert kirk, who used this term in his work [4]. Then David Chalmers actively used the arguments of the philosophical zombie in his cornerstone work [5]. The philosophical zombie is an argument against physicalism in the philosophy of consciousness. However, the concept of philosophical zombies, although clearly defined, but from scientific natural worldview it can’t be falsified, so often goes beyond the scientific interest. D. Chalmers defines a philosophical zombie-like being, appearance, and behavior which cannot be distinguished from humans; however, he lacks the qualitative internal states, available to humans, but which can be observed only in the first person. This is a creature that has “everything dark inside” [5] (and calmly—the author adds).
780
R. V. Dushkin
The absence of qualitative states in a hypothetical philosophical zombie cannot be detected from the third person, since by definition on the question “do you have internal qualitative states?”—the philosophical zombie responds positively. This knocks the possibility of verification out of the hands of researchers, and this is why the concept itself does not lend itself to falsification. A “pure” philosophical zombie cannot be recognized from the outside by definition, and this makes the concept an exclusively speculative argument in philosophical thought experiments. At the same time, the concept of a philosophical zombie, according to the author, has a very important applied meaning that can be used in the field of law. The fact is that the question of legal personality in cognitive agents can just be reduced to the presence of internal qualitative states in the agent. And in connection with the development of artificial cognitive agents that show more and more effective intellectual behavior, this question of legal personality already goes beyond purely theoretical speculation and becomes an acute question of the current state of artificial intelligence (as a science). The point is that the argument includes the so-called “duck test”, which was formulated by the poet J. Riley [6]: “When I see a bird that walks like a duck, swims like a duck and quacks like a duck, I call this bird a duck”. This is the apotheosis of the behavioral approach to object identification, which can lead to unexpected consequences if you try to apply it to artificial intelligent agents. After all, if something behaves as a rational being, acts as a rational being, makes decisions as a rational being, and quite possibly calls itself a rational being, then there is no reason not to accept such a cognitive agent as a rational being. Although no artificial cognitive agent has yet passed the full Turing test, human civilization is already on the verge of passing it. The author’s position is that subjectivity can be determined through the presence of internal qualitative states. If something has internal qualitative states, then it is sufficiently developed to receive subjectivity from the law—to receive rights and, if necessary, duties and responsibilities. This approach may resolve the dilemma of whether and when subjectivity can be assigned to artificial intelligence agents. That is why it is necessary to think about philosophical zombies. The philosophical zombie is a thing. Something with qualia, internal qualitative states—is an intelligent being with a basic right to subjectivity.
3 How to Recognize a Philosophical Zombie To recognize a philosophical zombie, it is necessary to narrow the scope of consideration and translate this concept from the category of thought experiments to the category of verifiable phenomena. It is impossible to do this without taking a certain point of view. Since a person already has a vividly realized point of view from which they can study their qualitative states (in the first person), it is from this point of view that the philosophical zombie and possible methods of recognizing it should be considered. Of course, in this case, you will get an anthropocentric view, but it is difficult to get another one in modern conditions. To study the concept of the philosophical zombie, several thought experiments will be proposed, which will reveal step by step the possibility of recognizing and differentiating the philosophical zombie. From these thought experiments, a set of criteria will be
Is It Possible to Recognize a Philosophical Zombie
781
compiled that can be used in the next section to answer the question “Will artificial cognitive agents ever be able to obtain internal qualificative states (qualia)?”. For the first thought experiment, you can imagine a hypothetical ideal device that is designed to completely deprive the sensory systems of the person immersed in it. Although since the middle of the XX century, there are so-called sensory deprivation chambers that allow almost completely disabling all external and part of internal sensory stimuli for a person placed inside [7], however, like any non-ideal device, such cameras do not disable external and internal stimuli by 100%. What happens if you put an ordinary adult in it? The person will be plunged into darkness and silence, they will not feel tastes, smells, and touches. But their inner feelings will remain—they will be able to inflict pain on themselves, they will continue to have a model of their body with the help of the entire complex of proprioceptors, and, most of all, their memory and imagination will draw various images before their mind’s eye. At first vague, then brighter and brighter, and now the entire complex of sensations as they are captured in the interaction of all parts of the sensory cortex of the brain will be available to a person in the chamber of complete sensory deprivation. It will be like a lucid dream. Is a person a philosophical zombie in this situation? The answer to this question in the first person is obvious: “No”. However, after a while, the brain gradually sinks into darkness due to the lack of sensory stimulation, the person increasingly falls asleep without dreams, and his personality is in danger of dissociation. Now one can imagine a baby that has just been born from the mother’s womb, and it was immediately placed in a full sensory deprivation chamber. The mother’s womb itself is such a camera, so this baby will not “notice” the change of environment, so to speak. In these conditions, such a baby needs to apply all necessary means for life support so that it does not feel any of its sensory systems for the intake of nutrients and the removal of waste products. And let this baby grow up in the deprivation chamber to the same age at which a full-fledged adult was immersed in this chamber earlier. Under these conditions, will the poor man have any internal quality states? It is immersed in darkness and silence, and therefore its sensory zones on the cortex of the brain do not work (if they are not already atrophied). Because of the lack of learning through perception, it will not even have any sign system for describing its experiences, and therefore it will not have any experiences either. It will be a being who has everything dark and quiet inside. However, we cannot say that this will be a philosophical zombie in the full definition of this concept since it will not be able to fully interact with the surrounding world after being rescued from the sensory deprivation chamber. Moreover, it is still unclear whether it will get a new subjective experience after leaving the cell-discussions about “Maria’s room” are still underway [8, 9]. The above thought experiment shows that for the possession of qualia, sensory systems of the corresponding modality are necessary, which are turned on and work. Also, certain types of qualia, such as images of memory and imagination, all the feelings generated by them, experienced by a person—all this requires sensory experience, which is imprinted in memory and can be activated in conditions of lack of sensory information. But memory and experience are again based on mass processing of sensory signals in the process of learning, so it seems that the presence of sensory systems is the primary condition for the existence of qualia in humans.
782
R. V. Dushkin
At the same time, the question of the quantitative and qualitative composition of sensory systems and, as a result, a sensory signal that come to the human brain for processing and generate qualitative experiences in it is important. The German philosopher Rudolf Hermann Lotze in his book [10] described a hypothetical animal that has only one sensory system of tactile modality, located as a sensitive point on a mobile tendril. And this hypothetical animal of Lotze with its tendril can feel the world, performing the function of knowledge. According to R. H. Lotze, such an animal could well know the world around it, and it would certainly be able to distinguish moving objects from stationary ones. However, it should be noted here that the presence of just one sensitive point in the tactile modality does not allow you to separate yourself from the surrounding world, and this hypothetical animal of Lotze would have a serious problem—the entire volume of the ability to sensory experiences would be filled with a single tactile sense (namely, only tactile without a sense of temperature), and it would be undifferentiated depending on whether the animal is feeling itself or something else. And such an animal would not be able to separate itself from the surrounding world by merging with it in its internal sensations. It would have the internal qualitative states of tactile modality that would absorb all of its attention throughout its life. This means that the phenomenal consciousness of such a hypothetical animal would be in a state of constant exposure to the “white noise” of a single modality. However, there are examples among humans that are close to the hypothetical animal of Lotze described. We are talking about people who are deafblind from birth. Being doomed to spend their entire lives in darkness and silence, they, at the same time, have the opportunity to become a full-fledged person, as evidenced by the “Zagorsk pedagogical experiment” [11]. Although the experiment itself cannot be considered pure in the context of this discussion, since it involved children who had lost their sight and hearing by some age, it shows another important aspect—to differentiate their personality from the world around them, an “extended sensory system” is necessary. If we consider the skin, the tactile organ of a person, it must be sensitive to touch everywhere, so that simultaneous touching of two places can be realized as “studying oneself”. The question of which sensory systems can be used for differentiated knowledge of the world is beyond the scope of this article, but it can be noted here that of the external human sensory systems, only the sense of smell is not suitable for this purpose. Next, one may want to consider the ant. An ant is a fairly simple creature that has two developed sensory systems that allow it to explore the world around it and somehow successfully perform its functions. These systems—the tactile system and chemoreception (similar to the human sense of smell), and secondly, in contrast to the human sense of smell, can be used for the previously described differentiation as “extended” (in ants, vision is very poorly developed, and in some species, there is no vision at all). However, these two sensor systems are not integrated, and each of them works separately. The processing of sensory information and decision-making based on it occurs in these two sensory systems independently of each other. This has a very important consequence. Interestingly, John Locke in his work “An Essay Concerning Human Understanding” [12] described the so-called “Molyneux’s problem”, which was described to him in a
Is It Possible to Recognize a Philosophical Zombie
783
letter by the Irish natural philosopher William Molyneux: “Suppose a man born blind, and now adult, and taught by his touch to distinguish between a cube and a sphere of the same metal, and nighly of the same bigness, so as to tell, when he felt one and the other, which is the cube, which the sphere. Suppose then the cube and sphere placed on a table, and the blind man be made to see: quaere, whether by his sight, before he touched them, he could now distinguish and tell which is the globe, which the cube?” The answer to this question seems to be “No”. Speculatively, it was answered in the negative by most philosophers who considered this thought experiment, but in the end, testing in practice with blind children who managed to regain their vision shows that the sensory spaces are initially separated—for some time after successful restoration of vision, children had to take an object in their hands and touch it to identify and name it. It was only after some time of training that the sensory integration of the two modalities—visual and tactile—finally took place [13]. According to many researchers, children with sensory integration disorders suffer from a variety of psychiatric and neuropsychiatric problems of varying degrees (often combined with the term “autism spectrum disorder”), depending on which sensory systems are susceptible to pathology [14]. Indeed, in the brain of animals, starting with chordates, there is such an important organ as the thalamus. This organ consists of a large number of nuclei, many of which receive information from several sensory systems of the body, and after processing transmit information to the associative parts of the cortex. In higher animals, including primates and humans, the thalamus is responsible for multisensory integration and association, forming a large number of reciprocal connections with the neocortex [15]. The primate thalamus filters and processes all incoming information from external sensors, proprioceptors, and interoceptors, excluding the olfactory system. Thus, a complete picture of the surrounding reality is formed, given in several modes of sensations and the model of the body concerning the environment. All this suggests the following ideas, which can be formulated as a hypothesis about the necessary set of properties of an object in order that it cannot be considered a “philosophical zombie”: 1. The formation of internal qualitative states requires functioning sensory systems that provide the perception of signals from the environment. 2. Sensory systems should be able to differentiate the observer and the observed, separate the object from the environment. 3. Multi-sensory integration of the processed sensory information should be implemented for a complete representation of the environment. These conditions are necessary. If one of these conditions is missing from the agent, then it is presumably a philosophical zombie. However, are they sufficient? To do this, you need to try to answer the question: “Could artificial intelligent systems get qualia?”.
4 Could Artificial Intelligent Systems Get Qualia According to [1; p. 170] qualia in this work will be understood as “the perceived quality of sensory signals”. Sensory signals are input streams of information perceived by the
784
R. V. Dushkin
body’s sensors from the external or internal environment. For example, it can be photons, acoustic waves, gravity, thermal motion of molecules, and the molecules themselves. The perceived quality of these signals refers to the modality of the signal (visual, acoustic, tactile, etc.) together with the complex of internal “impressions” that this modality excites in the nervous system of the perceiving being. In fact, according to modern views, hypernet complexes of neurons excited in response to sensory stimuli are responsible for the perceived quality of sensory signals in the mammalian cortex [16]. As a result, this allows one to determine the neurophysiological correlates of qualitative states (however, this still does not allow one to explain where these qualitative states come from). According to modern views in the field of analytical philosophy of consciousness, qualia are the supervenient phenomenon over the physical substrate of the nervous system [5]. At the same time, the qualitative states of phenomenological consciousness are completely determined by the activity of the brain, its biochemical, and electrophysiological processes [17]. The phenomenology of the conscious activity of a living being is determined by the excitability of groups of neural networks per se and therefore does not require the use of additional entities for its explanation. However, none of the modern theories of consciousness provides an answer to the so-called “difficult problem of consciousness” [18]. This work also does not claim to provide an answer to this difficult problem, but it attempts to determine the origin of qualia from an informationtheoretic point of view, as well as to answer the question of whether artificial systems will be able to experience internal qualitative states. Figure 1 shows the general scheme of receiving and processing the external sensor signal from the sensors to the associative mechanism responsible for exciting a complex of associations to the received signal.
Fig. 1. General scheme of information transfer from the sensor to the associative neural network
Figure 1 shows the process of transferring information from the sensory system to the associative neural network of living creatures (as we know at the moment) in a very simplified way using general blocks of information processing. An example is the visual modality of perceived information. Photons (sensory signal), reflected from some object in the environment, fall on the retina (sensory system). In the sensory system, the signal is converted into a sequence of action potentials that first excite interneurons that bring the converted signal to the nuclei of the visual intersection and the thalamus, where the sensory integration of the visual modality with signals from other sensory systems that perceive information at the same time is performed. Then the converted, filtered, and integrated signal enters the visual cortex, where image recognition is performed, from which “symbolic” information (represented by the same action potentials) is finally
Is It Possible to Recognize a Philosophical Zombie
785
transmitted to the associative neural network in the neocortex, where “recognition” is performed by activating a huge number of associations that are related concepts to what was recognized. In order to try to answer the question of where in this chain of information transmission qualitative states can originate, one can use a thought experiment to cut and re-commute the links in this chain. For these purposes, it is necessary to consider the simultaneous functioning of two sensory systems—for example, vision and hearing. Figure 2 shows a simplified scheme of such collaboration.
Fig. 2. General scheme of information transfer from the sensor to the associative neural network simultaneously for two sensory systems: auditory and visual
The diagram in the Fig. 2 shows the area of sensory integration and its connections are shown by dotted lines since this aspect will not be considered in further discussion to simplify conclusions. In the previous section, it was shown that the subsystem of sensory integration is important for the formation of internal qualitative states, and it certainly makes a contribution to this process, but this issue is beyond the scope of this work as a higher-level issue, and its study requires additional thinking and research. One can also see that there is one significant simplification in the diagram: from the auditory and visual cortex, connections lead to the same area of associative neurons. This is considered at a fairly high level of abstraction since in reality the associative cortex of the brain also has many departments responsible for various functions. Nevertheless, the presented abstract switching scheme will allow us to consider some questions of the formation of qualia in neural networks of the nervous system of higher animals. Taking into consideration the simplification and limitations, there are only two possibilities for the cross-switching of signals from sensor systems as illustrated in Fig. 2 diagram so that the processed audio signal falls into the visual cortex and Vice versa-the the processed visual signal falls into the auditory cortex. These two possibilities are shown in Fig. 3, diagrams a) and b), respectively. To think about the next thought experiment, it is also necessary to divide the perception procedure into two variants: perception during the initial training of neural networks
786
R. V. Dushkin
Fig. 3. Possible options for switching signals from sensor systems to recognition systems. options: a) Reconnection occurs directly from sensory systems to intermediate neurons; b) Reconnection occurs from intermediate neurons to pattern recognition zones
and perception of already trained neural networks throughout the information transmission channel from the sensory signal to the associative cortex of the brain. Indeed, it can be two completely different processes, which is also hinted at by the difference between the learning processes and the regular functioning of artificial neural networks [19]. Therefore, if the channels for transmitting sensory information are re-switched for already trained networks, the actual replacement of modalities for the sensory areas of the cortex occurs. Thus, the Table 1 presents the results of a thought experiment on the re-commutation of connections in paired channels for transmitting sensory information in two modes of neural networks. Table 1. A thought experiment on re-commutation of connections A variant of the cross-switching Operating mode Training
Option A
Option B
The simplest option. When switching neural pathways from auditory and visual sensors to intermediate neurons, it seems that nothing should change when learning sensory neural networks, except that these sensory networks will change places. The sensory integration will also be fine—the sensory modalities will integrate the same way as in the case without recommutation
The re-commutation of sensory pathways from auditory and visual sensors after sensory integration may result in the cortical sensory zones not being integrated with each other after learning. The visual and auditory areas of the cortex will change places, but sensory signals from opposite sensors will come to the same area after integration: from the ears to the auditory cortex, from the eyes to the visual cortex. As a result, after training, a person with such a recommutation will not be able to match the name and appearance of objects (continued)
Is It Possible to Recognize a Philosophical Zombie
787
Table 1. (continued) A variant of the cross-switching Option A Perception In this case, the signal from the ears will be sent to the trained visual cortex, and from the eyes—to the auditory cortex, respectively, while the sensory integration will work in “normal mode”. It can be assumed that if you perform this recommutation, the person will see sounds and hear visual images, but due to the lack of isomorphy between the auditory and visual cortex, as well as between the individual components of the channels, such perception will be generally chaotic—the sounds seen will cause a flurry of dots and vague images, and the heard visual images are likely to cause a cacophony in the head. However, since the sensory integration was performed correctly, some correspondence will be established between the visual chaos and the cacophony, which may eventually lead to the possibility of orientation in the surrounding space
Option B Finally, in this variant, perception will also be chaotic from the recommuted sensory modalities, but due to impaired sensory integration, the perception disorders will not be related to each other in any way, and in fact both sensory modalities will fail in a person—they will not be able to see or hear, and there is no question of establishing any correspondence between the perceived signals. This is the most severe violation
The descriptions presented in Table 1 of the consequences of re-commutation of neural pathways in the human sensory information processing system are based on the implicit assumption that the structure of the cortical areas responsible for such processing is uniform and does not depend on the sensory modality of the signals that come there for processing. This is a fairly strong assumption that can be tested in practice in field experiments. However, the low-level structure of all such sensory zones consists of layers filled with so-called “cortical columns” [20], while the composition and structure of columns designed to recognize a single type of signal looks the same for the entire cortex. Therefore, the described assumption for the presented variants of the thought experiment looks correct.
788
R. V. Dushkin
The presented variants of the thought experiment suggest that in the process of training natural neural networks, information about where a particular sensory signal comes from is stored somewhere in them. Neurons of the brain throughout the channel of information transmission from the sensory system to the associative cortex not only adjust the “weight coefficients” in synapses, but also somehow record information about what modality the incoming processed sensory signals have. That is why in “pure”, untrained neurons, there is no chaotic perception during recommutation, and already in trained neurons such chaotic behavior occurs. At the same time, information about the nature of the sensory signal should be recorded by the neurons of the corresponding sensory cortex, but it should be transmitted through intermediate interneurons, that is, it can also be recorded there. What is the purpose of such fixation of information about the modality of the sensory signal? Indeed, if you excite the neural networks of the visual cortex, they will have that phenomenological quality state called “qualia”. As mentioned, according to modern views, qualia occur in neural networks themselves during the excitation of neuron— qualon complexes [16]. And if one imagines a trained neuron of the visual cortex, where the sensory signal comes from the ear, then there is a visual qualia, because this neuron stores information about what sensory modality it was trained on. All this is also confirmed by a simple experiment. When the eyes are open, a bright quality state is formed in the visual sensory network, because all trained neurons in the visual cortex are strongly activated by signals coming from the eyes. But if one closes their eyes—their inner vision will be dark, broken by vague amoeboid images that appear due to stochastic minor activation of certain areas on the sensory cortex. However, during a dream, neurons in the visual cortex are also active and receive signals from other areas of the cortex, although the interneurons from the eyes do not work. In this case, qualia are formed during sleep, which hints that information about sensory modality is stored in the neurons of the visual cortex. This hypothesis may well be based on the genetic mechanisms of information storage by neurons [21]. When training neurons in sensory neural networks, biomolecular complexes can be formed in them, based both on the activation of specialized transcription factors and on the expression of certain genes that mark the perceived modality. Complexes of neurons trained in this way form qualons, and after sensory integration with other perceived modalities—operons and holons, which form a complete picture of the perceived reality in all sensory modalities when they are activated. Finally, how does the author answer the question in the title of this section? Indeed, will artificial intelligent systems be able to get qualia? If we approach the question from the position of the described hypothesis, the answer to it is positive. In fact, the nature of qualia lies in transmitting and remembering the modality of sensory information from sensory systems to neurons in the sensory cortex, and when these neurons are activated, the corresponding sensory modality forms the qualia. If the information transmission pathways of an artificial cognitive agent, which are properly switched and integrated, transmit and record information about the sensory modality, it is likely that internal qualitative states can occur in these networks. However, they may be of a different nature than that of humans.
Is It Possible to Recognize a Philosophical Zombie
789
5 Conclusion This article attempts to outline a set of possible criteria for practical recognition and differentiation of philosophical concepts in order to determine the subjectivity of artificial intellectual agents. A number of thought experiments are presented that reveal the author’s understanding of how internal qualitative states appear in the information system (including the nervous system of higher animals). However, the difficult problem of consciousness still remains unsolved, and the presented set of criteria can be used exclusively for operational purposes for attempts to identify philosophical zombies from the position of human consciousness. At the same time, the thought experiments described in the article can be tested in some part both on model animals and in silico, which allows to outline the direction of further research of the issues presented in the article. At a minimum, the recommutation of sensory pathways and the search for genetic mechanisms of neuronal memory can be performed on biological models, and then transferred to simulation models in artificial cognitive agents. In addition, important objects of observation can also be people with certain perception features, which, first of all, can include deafness (in terms of severe limitations of sensory systems) and synesthesia (in terms of violations and confusion of sensory integration).
References 1. Dushkin, R.V.: Artificial Intelligence, 280 p. DMK-Press, Moscow (2019). ISBN: 978-597060-787-9 2. Crevier, D.: AI: The Tumultuous Search for Artificial Intelligence. BasicBooks, New York (1993).ISBN: 0-465-02997-3 3. Lavington, S. (ed.): Alan Turing and His Contemporaries: Building the World’s First Computers. British Computer Society, London (2012).ISBN: 978-1-9061-2490-8 4. Kirk, R.: Sentience and behaviour. Mind 83, 43–60 (1974) 5. Chalmers, D.J.: The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press, New York and Oxford (1996). ISBN: 0-19-511789-1 6. Heim, M.: Exploring Indiana Highways. Exploring America’s Highway (2007). ISBN: 9780-9744358-3-1 7. Lilly, J.C., Gold, E.J.: Tanks for the Memories: Flotation Tank Talks. Gateways Books & Tapes (2000). ISBN: 0895560712 8. Jackson, F.: What Mary didn’t know. J. Philos. 83(5), 291–295 (1986) 9. Dennett, D.C.: What RoboMary knows. In: Alter, T. (ed.) Phenomenal Concepts and Phenomenal Knowledge. Oxford University Press, Oxford Oxfordshire (2006). ISBN: 978-0-19-517165-5 10. Lotze, R.G.: Medizinische Psychologie oder Philosophie der Seele. Weidmann’sche Buchhandlung, Leipzig (1852). http://bit.ly/2uLhzwT 11. Mitasova, M.: Coming out of the dark: the story of an experiment/“So-unity”. Foundation for the support of the deafblind, 252 p. Eksmo, Moscow (2016) 12. Locke, J.: An Essay Concerning Human Understanding, 1st edn, vol. 1. Thomas Basset, London (1690) 13. Held, R., et al.: The newly sighted fail to match seen with felt. Nat. Neurosci. 14, 551–553 (2011). https://doi.org/10.1038/nn.2795
790
R. V. Dushkin
14. Kranowitz, C.S.: An Unbalanced Child, 396 p. Redaktor, Moscow (2012). ISBN: 978-5-99017512-9 15. Jones, E.G.: The Thalamus: [angl.], vol. 2 pedakci ot 1985 goda, 915 p. Springer, Hopk (2012). ISBN: 978-1-4615-1749-8. https://doi.org/10.1007/978-1-4615-1749-8 16. Anokhin, K.V.: Cognite. Hypernet model of the brain. In: Trofimov, A.G. (ed.) The Collection: Neuroinformatics-2015. XVII All-Russian Scientific and Technical Conference with International Participation: Collection of Scientific Papers (2015) 17. Dennett, D.C., Allen, L. (ed.): Consciousness Explained, 551 p. The Penguin Press (1991). ISBN: 978-0-7139-9037-9 18. Chalmers, D.J.: Facing up to the problem of consciousness. J. Conscious. Stud. 2(3), 200–219 (1995) 19. Nikolenko, S., Arkhangelskaya, E., Kadurin, A.: Deep Learning. Dive into the World of Neural Networks, 480 p. Piter, SPb (2018). ISBN: 978-5-496-02536-2 20. Hawkins, J.: On Intelligence, 1st edn, p. 272. Times Books (2004). ISBN: 978-0805074567 21. Ivashkina, O.I., Vorobyova, N.S., Gruzdeva, A.M., Roshchina, M.A., Toropova, K.A., Anokhin, K.V.: Cognitive indexing of neurons: CRE-controlled genetic labeling and study of cells involved in learning and memory. Acta Naturae (Russian version), tom 10, № 2, pp. 40–51. Publisher Park-Media, Moscow (2018)
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor Jes´ us Jaime Moreno Escobar1(B) , Oswaldo Morales Matamoros1 , Ana Lilia Coria P´ aez2 , and Ricardo Tejeida Padilla3 1
ESIME, Zacatenco, Instituto Polit´ecnico Nacional, Mexico City, Mexico [email protected] 2 ESCA Tepepan, Instituto Polit´ecnico Nacional, Mexico City, Mexico 3 EST, Instituto Polit´ecnico Nacional, Mexico City, Mexico
Abstract. Using of cryptocurrency has boomed in recent years, such as Bitcoin, Ethereum or Ripple. It is interesting to have a Bitcoin forecasting tool to try to understand the trends at global economic level. A virtual currency that can be used as a means of payment just like physical money. Any cryptocurrency uses peer-to-peer technology and is not controlled by any economic or political entity, such as a bank or government. In 2009, Bitcoin was conceived and was priced at 0.39 USD reaching its all-time high in 2017 with a price of 17,549.67 USD, i.e. 45 thousand times more in less than 10 years. This work focuses on predicting bitcoin-price trending will have in 2020 by using a Self-Affine Fractal Analysis as a tool of artificial intelligence. The results provided by present work in first 6 months agree with 98% with those actually obtained despite training only with data from first days of time series.
Keywords: Fractal predictor analysis · Fluctuations
1
· Bitcoin · Cryptocurrencies · Self-affine
Introduction
Cryptocurrencies or virtual currencies are offered through the internet globally and are sometimes presented as an alternative to legal tender, although it has very different characteristics: – It is not mandatory to accept them as a means of paying debts or other obligations. – Its circulation is very limited. – Its value fluctuates strongly, so it cannot be considered a good store of value or a stable unit of account. The strong fluctuations experienced by these cryptocurrencies that seem typical of the classic speculative bubbles are well known. As an example, the average value of bitcoin on the main platforms in which it is traded increased in 2017 from c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 791–804, 2022. https://doi.org/10.1007/978-3-030-82193-7_53
792
J. J. M. Escobar et al.
Fig. 1. Bitcoin Logo.
approximately 979.53 USD per unit at the beginning of the year to more than 17,147.04 euros in mid-December. Figure 1 shows bitcoin logo. Since then, the trend has been downward. As of February 5, 2018, its price was below 6,902.35 USD, which represents a drop of more than 65% from the December highs. A person who had bought bitcoins in late 2017 and sold them today would suffer very noticeable losses [18]. Additionally, numerous fundraising actions are taking place from investors to finance projects through the so-called initial offers of cryptocurrencies or ICOs. The expression ICO can refer to both the actual issuance of cryptocurrencies and the issue of rights of various kinds, generally called tokens. These assets are put up for sale in exchange for cryptocurrencies such as bitcoins or ethers or official currency (for example, US dollars or USD) [10]. The five main aspects to consider before investing in cryptocurrencies or participating in an ICO: 1. To date, no ICO has been registered, authorized or verified by any supervisory body in Spain. Therefore, there are no cryptocurrencies or tokens issued in ICO whose acquisition or possession worldwide can benefit from any of the guarantees or protections provided in the regulations regarding banking or investment products. Investments in cryptocurrencies or in ICOs outside the regulation are not protected by any mechanism similar to that which protects cash or securities deposited in credit institutions and investment services companies.
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
793
2. Before buying this type of digital assets or investing in products related to them, consider all the associated risks and assess whether you have enough information to understand what is being offered. In this type of investment there is a high risk of loss or fraud. 3. Cryptocurrencies lack intrinsic value, becoming highly speculative investments. Furthermore, its strong dependence on poorly consolidated technologies does not exclude the possibility of operational failures and cyber threats that could mean temporary unavailability or, in extreme cases, the total loss of the amounts invested. 4. The absence of markets comparable to organized securities markets subject to regulation can hinder the sale of cryptocurrencies or tokens issued in ICO to obtain conventional cash. 5. In the case of ICOs, the information made available to investors is usually not audited and is often incomplete. Generally, it emphasizes potential benefits, minimizing references to risks.
2
Related Work
In recent years a new type of tradable assets appeared, generically known as cryptocurrencies. Among them, the most widespread is Bitcoin. Bariviera et al. in [2] compared Bitcoin and standard currencies dynamics, analyzing their returns at different time scales. They investigated the long-memory in return time series from 2011 to 2017, using transaction data from one Bitcoin platform. In addition, they computed the Hurst exponent by means of the Detrended Fluctuation Analysis method, measuring long-range dependence. They found changes in the Hurst exponent values over the first years of the studied period, tending to stabilize in the last part of the period. In the aftermath they claim that the Bitcoin market can be described as a self-similar process, displaying persistent behavior until 2014. Caporale et al. in [4] studied the persistence behavior in the cryptocurrency market, applying the R/S analysis and fractional integration long-memory methods and taking as inputs the four main cryptocurrencies (BitCoin, LiteCoin, Ripple, Dash) over the sample period 2013–2017. According to their outcomes, they found that the cryptocurrency market displays persistence or positive correlations between its past and future values. Hence, they claim for a market inefficiency because they did not found random walk behavior (market efficiency) in the cryptocurrency market. Owing cryptocurrencies have acquired a great development and valorization. Costa et al. in [8] analyzed the following four cryptocurrencies, based on their market capitalization and data availability: Bitcoin, Ethereum, Ripple, and Litecoin, using detrended fluctuation analysis and detrended cross-correlation analysis and the respective correlation coefficient. Bitcoin and Ripple seemed to behave as efficient financial assets, while Ethereum and Litecoin displayed some
794
J. J. M. Escobar et al.
persistence. When authors correlated Bitcoin with the other three cryptocurrencies, at short time scales all the cryptocurrencies had been correlated with Bitcoin, although Ripple had the highest correlations. On the other hand, at higher time scales, Ripple was the only cryptocurrency with significant correlation. Quintino et al. in [17] determined the persistence exhibited by the Bitcoin measured by the Hurst exponent from the Brazilian market daily prices from 9 April 2017 to 30 June 2018, and comparing them with Bitcoin in USD. They used Detrended Fluctuation Analysis to analyze in that period the prices of Bitcoins yielded from negotiations made by two Brazilian financial institutions: Foxbit and Mercado. The authors found that Mercado and Foxbit returns followed Bitcoin dynamics and showed persistent behavior, although the persistence was higher for the Brazilian Bitcoin. The operation of all types of cryptocurrency depend on the blockchain. Ribeiro in [19] presented a monograph of Blockchain from the point of view of economic theory, exposing the questioning imposed on central authorities focused on monetary control by cryptocurrencies and analyzing what effects were created in economic theory by the emergence of Bitcoin, especially in monetary theory both orthodox as well as heterodox. As for the blockchain, the author compare the Bitcoin with the monetary theory, making a complete coverage of the central theme of the monograph. Nonetheless, our research focuses on the cryptocurrency and not on the blockchain, therefore, although it is a relevant research in terms of the theory, it is necessary to relate it to Bitcoin. Cajiao and Fonseca in [3] related the Bitcoin currency with the economic and social environment in Colombia, using the investigative interview as the information collection instrument. After obtaining different points of view and opinions on Bitcoins, they concluded that the low level of financial education in not only Colombia but throughout Latin America, delaying a possible implementation of virtual currency in the Latin American financial market. While the about works were carried out on bitcoin and more specifically on its price, the work realized by Ciaian et al. in [7] was one of the first articles to deal with the economics of Bitcoin price formation. The authors consider the traditional determinants of a currency, such as macroeconomic and specific aspects of digital currencies, and they conclude that macro-financial aspects do affect the price of the cryptocurrency, although the bitcoin in 2016 did not suffer its most important growth but in 2017. We also reviewed the Monday effect, a concept handled in finance due to its effects on the price of shares and Treasury bills mainly. Decourt et al. in [9] explained the Monday effect in Bitcoin returns via the t-student test, finding significantly higher returns on Monday than any other day. Therefore, they concluded that the Bitcoin price obeys a behavior similar to several stocks. Charles and Darn´e in [6] used six autoregressive models of series GARCHtype time frames, commonly used to analyze stock price volatility. They mainly focused on analyzing the volatility of the price of Bitcoin, finding that the ARGARCH model is the best specification compared to the other GARCH
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
795
models that were used. However, our research is focused on the valuation of the price of the cryptocurrency and not its volatility. According to the above, the Bitcoin price-returns dynamics has been characterized applying fractal analysis tools such as the Detrended Fluctuation Analysis and R/S analysis. Moreover, the Bitcoin price has also been forecasted by applying econometrics (GARCH) and artificial intelligence (neural network autoregression) tools. Hence, in this work we characterized the Bitcoin price fluctuations by calculating the Hurst exponent (H) from the structure function of the time series of Bitcoin price fluctuations at different time interval of the sample. Additionally, we forecasted the Bitcoin prices applying an artificial intelligence tool. Therefore, we divided this work in the following sections. In Sect. 3, the Theoretical Framework, is exposed the main features of Bitcoin and Fractal Theory. In Sect. 4 the main Proposal is defined along with the Methodology. Finally, In Sect. 5 is shown the experimentation methodology as well as the most important results that prove the value of this work.
3 3.1
Theoretical Framework Bitcoin
Bitcoin is a virtual, independent and decentralized currency, since it is not controlled by any State, financial institution, bank or company. It is an intangible currency, although it can be used as a means of payment just like physical money. Virtual currencies constitute a heterogeneous set of innovative payment instruments that, by definition, lack a physical support to back them up. The term Bitcoin has its origin in 2009, when it was created by Satoshi Nakamoto (pseudonym of its author or authors) [11], who created it with the aim of being used to make purchases only through the Internet. Bitcoin was born with high ambitions: to provide citizens with a means of payment that enables the execution of fast, low-cost transfers of value without being controlled or manipulated by governments, central banks or financial entities. Virtual currency uses cryptography to control its creation. The system is programmed to generate a fixed number of bitcoins per unit of time through computers called miners. Currently, that number is fixed at 25 bitcoins every ten minutes, although it is programmed so that it is halved every 4 years. Thus, starting in 2017, 12.55 bitcoins will be issued every ten minutes. Production will continue until 2140, when the limit of 21 million units in circulation is reached [10]. To make use of this virtual currency, it is need to download software to the computer or our mobile, serving as a virtual wallet to generate a Bitcoin address for sending and receiving money from other users. In addition, the sending of bitcoins is instantaneous and all operations can be monitored in real time. Transactions with this currency involve a transfer of value between two bitcoin addresses, public although anonymous. To guarantee security, transactions are secured using a series of key cryptographies, since each account has a public and
796
J. J. M. Escobar et al.
a private key. As in other virtual currencies, Bitcoin also implies risks that must be highlighted to know exactly the magnitude of this currency. To identify these risks, we again resorted to the report of the General Directorate of Operations, Markets and Payment Systems of the Payment Systems Department of the Wold Bank: – Financing of illegal activities and/or money laundering. Due to the decentralized nature of the scheme, transfers take place directly between the payer and the beneficiary, without the need for an intermediary or administrator. This implies a difficulty in identifying and early warning of possible suspicious behavior of illegal activities. – The fact that organized crime networks are making widespread use of emerging electronic payment systems can create a negative reputation for digital payment methods. – Despite the fact that, in principle, any computer can actively participate in the process of creating new Bitcoin units, the high computational capacity required implies that, in practice, this activity is dominated by a small group of actors. Possible fraudulent transactions. To the extent that the protocols on which bitcoin is based are open software developments, the implementation of its different versions does not have to occur uniformly among all users. – From the point of view of fraud, Bitcoin presents a significant weakness compared to other payment methods in the online world, such as cards. – Impact on price stability and financial stability, since private trading platforms where bitcoins can be exchanged for legal tender currencies are marked by the high volatility of prices due to speculative movements. In the present work we propose to use tools from fractal geometry applied to time series of Bitcoin price, in order to generate forecasts of the Bitcoin price. 3.2
Fractal Theory
Since Benoˆıt Mandelbrot introduced the term fractal into the literature, this term has acquired increasing familiarity among scientists, becoming more and more popular and, to some extent, it has become a fad, mainly for two reasons [12]: 1. First is that a large number of fractal objects have been and are being discovered (electrical discharges, coastlines, river fluids, turbulence). 2. As a second reason, the mathematics involved in fractals is so simple that the corresponding literature can be read by anyone who has a first course in calculus. Fractals appeared in mathematics towards the end of the 19th century, initially with the name of non-differentiable or non-rectifiable curves, being an example of curious objects. These were curves or surfaces that were endlessly
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
797
folded, infinite lines that were compact in a regular way on a finite surface, surfaces that were not differentiable at any point, sets of isolated points isomorphic to the final straight, to exemplify only some non-rectifiable objects [12]. An approximate and functional definition of what will be understood by fractal in this study has been reached, highlighting its fundamental characteristics, such as the irregularity of the forms, the invariance at different scales of analysis and the self-affinity of the parts with the whole. It is possible to extract some fundamental characteristics that, if one or all of them are fulfilled. Thus, a fractal set would be: – A set that is sufficiently irregular because it cannot be described with the usual geometric language, both locally and globally; – A set that has a fine structure, that is, it has details on whatever scale is observed; – A set that presents some form of self-similarity, which can be approximate or statistical; – And, usually, the fractal dimension (defined in some way) is larger than its topological dimension, and does not have to be integer. The authors then further mention that other researchers frequently indicate some additional characteristics of fractal structures. In this way, a fractal structure satisfies one of the following properties: – It has detail at all observation scales; – It is not possible to describe a fractal structure with Euclidean Geometry, both locally and globally; – A fractal structure has some kind of self-similarity, possibly statistical; – The fractal dimension of a fractal structure is greater than its topological dimension; – The algorithm used to describe a fractal structure is very simple, and possibly recursive in nature. To account for fractals and shapes with non-integer dimensionality, issues related to dimensionality in general and fractal dimension have been reviewed, making it necessary to review the Hausdorff exponent and the Box Counting Dimension as one of the best known techniques (but not the unique, since there are others, such as the Hurst exponent, used in this work) to determine the fractal dimension of an object. Among the utilities of fractal geometry is mainly being the appropriate geometry for all those objects that were left out of a description by Euclid’s (traditional) geometry. Nature has an inherent irregularity that, since fractal geometry appeared, can be described and studied in a much more satisfactory way. It is the bet of this study that fractal geometry is precisely an appropriate descriptive model for various behaviors of time series of the cryptocurrency Bitcoin, which until now have only had a purely intuitive approach through training in neural networks.
798
4 4.1
J. J. M. Escobar et al.
Proposal Theoretical Definition
Statistically, the fluctuation or volatility of financial time series p (t) are characterized by their standard deviations v (t, τ ) for a sampling time interval τ considered [21], exhibiting power law correlations, so these complex systems may not respond immediately to a quantity of information that flows towards them, but react gradually in a certain period of time [5,15,21,22]. The analysis of scaling or fractal properties of fluctuations has offered relevant information about the underlying processes responsible for the observed macroscopic behavior of complex systems [5,15,21,24,25]. Moreover, in this paper is studied the long-term correlations displayed by the time series fluctuations by applying their structure function, defined by Eq. 1 as follows: 12 2 σ (τ, δt ) = [ν (t + δt , τ ) − ν (t, τ )]
(1)
where the upper bar denotes average over all times t in the time series of length T − τ with T as the length of the original time series p (t) and triangular parentheses denote average over different realizations of the time window of size δt [22]. The structure function of the fluctuations exhibits the power law H behavior σ ∝ (δt ) with H as the local or roughness exponent, even though the time series fluctuations ν (t, τ ) exhibit apparently randomness [5,15,21,22]. H The scaling behavior σ ∝ (δt ) characterizes the correlations in the time series fluctuations treated as a growing interface in a dimension (1 + 1) [14,16,20]. Accordingly in this paper the structure function was applied to study the dynamics of time series fluctuations associated with the Bitcoin price, by analyzing the behavior of standard deviations for different sampling time intervals. Hence, the time series of standard deviations were treated as interfaces in motion, where the considered sampling time interval τ plays the role of time variable and the physical time t plays the role of the spatial variable [1]. 4.2
Methodology
Within artificial intelligence there are systems that think rationally, these try to emulate the logical thinking of humans, that is, it investigates how to make machines perceive, reason and act accordingly. The Proposed system tries to reason the average fluctuation to propose future fluctuations and therefore the estimation of the price trend [23]. To study the dynamics of time series fluctuations, in this paper the time series of standard deviations ν (t, τ ) of the original series p (t) from the open price (op) and close price (cp) of Bitcoin. For this study the length of each original financial series was Top = 2410 and Tcp = 2410 (usd-dollar versus time). In addition, the sampling rate t = 1 business day, with ranges τm ≤ τ ≤ τm and rates (δt ) from samples of time intervals (δt ) from 3 to 200 standard deviations.
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
799
In Fig. 2(a) is shown the graph of the original time series, p (t) to the Bitcoin open-price, and in Fig. 2(b) is shown the structure function of the the time series of standard deviations ν (t, τ ) with a value or the Hurst exponent (or slope) equals Hop = 0.5675. On the other hand, in Fig. 3(a) is shown the graph of the original time series, p (t) to the Bitcoin close-price, and in Fig. 3(b) is shown the structure function of the the time series of standard deviations ν (t, τ ) with a value or the Hurst exponent (or slope) equals Hcp = 0.5689.
=0
18000
103
y = 43.9730 x0.5675 Power Function Fitting
16000
Fluctuation (V)
14000
12000
10000
8000
6000
4000
102
2000
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
101
2 108
Time (seconds)
102
(t)
(a) Open
(b) Open Fluctuation
Fig. 2. (a) Original time series of bitcoin Open-Price, p (t) (USD), with Top = 2410. (b) Dynamic scaling of the structure function for the volatility time series with Hop = 0.5675.
103
=0
18000
y = 43.7038 x0.5689 Power Function Fitting
16000
Fluctuation (V)
14000
12000
10000
8000
6000
4000
102
2000
0
0.2
0.4
0.6
0.8
1
1.2
Time (seconds)
(a) Close
1.4
1.6
1.8
10
2 108
1
10
2
(t)
(b) Close Fluctuation
Fig. 3. (a) Original time series of bitcoin close-price, p (t) (USD), with Top = 2410. (b) Dynamic scaling of the structure function for the volatility time series with Hcp = 0.5689.
Algorithm to determine fractal behavior of the volatility time series is the following steps:
800
J. J. M. Escobar et al.
1. Collect time series with at least 2000 data of past tendency about Bitcoin open-price and Bitcoin close-price and train the system with the H findings found. 2. Construct one hundred ninety-eight time series of standard deviations (fluctuations or volatility) for every original time series, applying the equation of the standard deviation, for different sample intervals: 3 ≤ τ ≤ 200. Then, construct 198 time series ν (t, τ ) for both samples. 3. Determine the type of correlations displayed by the time series of fluctuations ν (t, τ ), applying the Eq. 1 of the structure function for predicting the future tendencies. 4. If the fluctuation structure function, obtained in Step 3, exhibits a power H law behavior σ ∝ (δt ) , then, from the same equation, obtain the dynamic exponent of roughness or local H, in order to determine if the system displays positive correlations (persistence) in the long-term and to establish at what time the fluctuations move from a power-law behavior to one of saturation (horizontal curve or zero slope). This is done with the purpose of establishing if the behavior of the Bitcoin price fluctuations over the time can be described in the Family-Vicsek fractal model w (L, t) ∼ LH f
t
H
Lβ
, which represents
the dynamic scaling behavior of self-affine surfaces in motion. 5. Values of the dynamic scaling exponent H are obtained for both samples. If, based on the values of the exponents H, the fluctuation behavior is not fitted to the behavior described by the Family-Vicsek model, look for another model that explains and predicts the behavior of fluctuations for both the Bitcoin open-price and the Bitcoin close-price. 6. Finally, trend findings are projected and the future trend of Bitcoin prices is proposed through a power law in a logarithmic space.
5 5.1
Experimental Results Results
The database of the opening and closing prices of the bitcoin-USD price was downloaded from the Markets Insider site, from January 27, 2013 to June 20, 2020 [13]. Then, The original series was splitted into two parts from May 27, 2013 to December 31, 2019, with the aim of predicting the first half of 2020. Quantitatively, the self-affinity of the time series of Bitcoin price fluctuations H was characterized by the scaling behavior σ ∝ (δt ) , as shown in Figs. 3 and H 2. The structure function displays a power law σ ∝ (δt ) with H (τ ) = const within a range of intervals δt . In the Figs. 2(b) and 3(b) the graphs of the dynamic scaling of the structure function for the Bitcoin open-price fluctuations and the Bitcoin open-price fluctuations, respectively, we observe the following values: Hop = 0.5672 and Hcp = 0.5689. It means that fluctuations for both Bitcoin open-price and Bitcoin close-price display and persistent behavior. Therefore, the dynamics of both Bitcoin prices are fitted to the power-law behavior ranging
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
801
Fig. 4. Actual bitcoin price vs predicted bitcoin price, p (t) (USD), with Top = 172. (b) Open and (a) Close.
much more time scales (scale-invariance), and hence claiming a better approximation to the F-V model. Figures 4(a) and 4(b) show the scatter plot of the actual price of the bitcoin versus the predicted one, respectively. The correlation of these data show us that the prediction of the opening of the bitcoin price has an effectiveness of 98.48%, while for the close of the trading day of 98.74%. 5.2
Discussion
According to the values of H for the Bitcoin open-price and the Bitcoin closeprice, it could be pointed out the existence of a dynamic scaling behavior similar to the dynamic scaling of Family-Vicsek (F − V ) for the roughness kinetics of H a moving interface [14], see Figs. 2(b) and 3(b). The scaling relation σ ∝ (δt ) implies that the structure function of the time of fluctuations displays series the dynamic scaling behavior σ (τ, δt ) ∝ τ β f
δt
β
τH
, where the scaling function
behaves like f (u) ∝ uH if u < 1 and like f (u) ∝ const if u 1. That is, the dynamic scaling relations of F − V , expressed in power laws, with critical scaling exponents that reflect the scale invariance of the time series fluctuations of both Bitcoin prices. Finally, it should be noted that, because the values of the scaling exponents Hop and Hcp are greater than 0.5, in the critical area the fluctuations of Bitcoin prices display positive long-term correlations, i.e. persistent behavior. From Fig. 5, the power law increment in the width transitates to a saturation regime (horizontal region) during which the width reaches a saturation value, wsat . As L grows, wsat increases as well, and the dependency likewise follows a power law, wsat (L) ∼ LH with [t tx ]. The exponent H, the roughness exponent, characterizes the degree of roughness of the saturated interface. For small u, the scaling function is increased as a power law. In this regime we have f (u) ∼ uH with [u ux ].
802
J. J. M. Escobar et al.
Fig. 5. Growth of the interface width time for the BD model for a horizontal-sized system [24].
As t → ∞, the width saturates. Saturation is reached for t tx , that is u 1. In this limit f (u) = constant with [u 1]. The saturation time tx , with the saturation width, wsat , increases with the size of the system, which suggests that the saturation phenomenon constitutes a finite size effect. This leads us to affirm that you can predict what the fluctuations will be and therefore the opening and closing price of bitcoin with respect to the USD, our results indicate that there is a correlation greater than 98%. At the beginning of the growth, the sites are not correlated. During deposition correlations between sites grow over time. During the transition from one regimen to another, i.e. in the critical region, the size of the system is reached, so that the entire interface becomes correlated, resulting in saturation of the interface width.
6
Conclusions
Dynamic fluctuations of i) Bitcoin open-price and ii) Bitcoin close-price exhibited persistent behavior. Hence it can be modeled by the family-Vicsek model. This work is a novel algorithm that can predict what the fluctuations will be and therefore the opening and closing price of bitcoin with respect to the USD can be estimated. Our results indicate that there is a correlation greater than 98%, which means that it has high reliability. First, a review of the main machine learning techniques was carried out, with emphasis on the models applicable to forecasting problems on time series. Based on the analysis of results reported in the state of the art, a set of models to be
Dynamic Analysis of Bitcoin Fluctuations by Means of a Fractal Predictor
803
used in solving the problem was defined. A series of experiments were successfully developed to evaluate the effectiveness of Self-Affine Analysis in making short-term forecasts on the price of Bitcoin. The experimentation reports a significantly higher precision than the base models. The effectiveness of the present proposal stands out from the existing deep learning models used: recurrent neural networks LSTM, GRU, convolutional neural networks and, in particular, the comparatively superior performance achieved by the present proposal. For all types of model, it was confirmed that the application of a previous interpretation of the base data, through the incorporation of technical analysis indicators among the input characteristics, considerably improved the results. Finally, in the case of models based on neural networks, the effectiveness of increasing the volume of training data through the incorporation of historical information from other cryptocurrencies was verified. It is observed that, in order to exploit additional features such as volume of data, the complexity of the present model had to be increased. It is concluded that the application of the Self-Affine Analysis to the forecast of the price of Bitcoin is, to say the least, promising. This is especially true considering the wide range of possibilities that remain to be explored following this line of research. Acknowledgment. This article is supported by National Polytechnic Institute (Instituto Poli´ecnico Nacional) of Mexico by means of projects No. No. 20210456, 20210458 and 20210707 and granted by Secretariat of Research and Postgraduate(Secreter´ıa de Investigaci´ on y Posgrado), National Council of Science and Technology of Mexico (CONACyT). The research described in this work was carried out at the Superior School of Mechanical and Electrical Engineering (Escuela Superior de Ingenier´ıa Mec´ anica y El´ectrica) of the Instituto Polit´ecnico Nacional, Campus Zacatenco. The authors declare no conflict of interest. The authors of this article thank M. en C. Hugo Quintana Espinosa (director of ESIME-Zacatenco) and Lic. Pedro Arrechea Alfaro (Head of the Human Resources Department) the for their support in carrying it out.
References 1. Balankin, A.S.: Dynamic scaling approach to study time series fluctuations. Phys. Rev. E 76, 056120 (2007) 2. Bariviera, A.F., Basgall, M., Hasperu´e, W., Naiouf, M.: Some stylized facts of the bitcoin market. Phys. A Stat. Mech. Appl. 484, 05 (2017) 3. Hoyos, M.C., Medellan, D.F.: Analisis de la implementacian de las Bitcoins como matodo de pago en Colombia. Finanzas y Comercio Internacional, January 2016 4. Caporale, G.M., Gil-Alana, L., Plastun, A.: Persistence in the cryptocurrency market. Res. Int. Bus. Finan. 46, 141–148 (2018) 5. Balankin, A.S., Garc´ıa Paredes, R., Susarrey, O., Morales, D., Castrejon, F.: Kinetic roughening and pinning of two coupled interfaces in disordered media. Phys. Rev. Lett. 96(5–10), 101–104 (2006) 6. Charles, A., Darn´e, O.: Volatility estimation for bitcoin: Replication and extension (2018) 7. Ciaian, P., Rajcaniova, M., dˆ aTM Kancs, A.: The economics of BitCoin price formation. Appl. Econ. 48(19), 1799–1815 (2016)
804
J. J. M. Escobar et al.
8. Costa, N., Silva, C., Ferreira, P.: Long-range behaviour and correlation in DFA and DCCA analysis of cryptocurrencies 7, 09 (2019) 9. Decourt, R.F., Chohan, U.W., Perugini, M.L.: La rentabilidad de Bitcoin y el efecto Lunes. Horizontes Empresariales 16(2), 4–14 (2017) 10. Fraser, J.G., Bouridane, A.: Have the security flaws surrounding bitcoin effected the currency’s value? In: 2017 Seventh International Conference on Emerging Security Technologies (EST), pp. 50–55 (2017) 11. Ghimire, S., Selvaraj, H.: A survey on bitcoin cryptocurrency and its mining. In: 2018 26th International Conference on Systems Engineering (ICSEng), pp. 1–6 (2018) 12. Mandelbrot, B.B.: The Fractal Geometry of Nature. W.H.Freeman & Co Ltd., New York (1982) 13. Markets Insider. https://markets.businessinsider.com/currencies/btc-usd 14. Meakin, P.: Fractals, Scaling and Growth Far from Equilibrium, 1st edn. Cambridge University Press, Cambridge (1998) 15. Balankin, A., Matamoros, O., Galvez, E., Parez, A.: Crossover from antipersistent to persistent behavior in time series possessing the generalyzed dynamic scaling law. Phys. Rev. E 69, 036121 (2004) 16. De Queiroz, S.L.A.: Roughness of time series in a critical interface model. Phys. Rev. E 72(6), 104–110 (2005) 17. Quintino, D., Campoli, J., Burnquist, H., Ferreira, P.: Efficiency of the Brazilian bitcoin: a DFA approach. Int. J. Finan. Stud. 8(2), 356 (2020) 18. Rahouti, M., Xiong, K., Ghani, N.: Bitcoin concepts, threats, and machine-learning security solutions. IEEE Access 6, 67189–67205 (2018) 19. Ribeiro, P.V.: Blockchain ´ a luz da teoria econ´ omica (2016) 20. Ramasco, J., L´ opez, J.M., Rodr´ıguez, M.A.: Generic dynamic scaling in kinetic roughening. Phys. Rev. Lett. 84(10), 2199–2202 (2000) 21. Constantin, M., Das Sarma, S.: Volatility, persistence, and survival in financial markets. Phys. Rev. E 72(5), 106–116 (2005) 22. Kantz, H., Schreiber, T.: Nonlinear Time Series Analysis, 2nd edn. Cambridge University Press, Cambridge (2003) 23. Schwartz, T.: Expert focus-estimating manpower requirements for expert system projects. IEEE Expert 3(2), 12–15 (1988) 24. Barabasi, A.L., Stanley, H.E.: Fractal Concepts in Surface Growth. Cambridge University Press, Cambridge (1995) 25. Ashkenazy, Y., Ivanov, P.C., Havlin, S., Peng, C.-K., Goldberger, A., Stanley, H.E.: Correlaciones de magnitud y signo en fluctuaciones de latido del corazan. Phys. Rev. Lett. 86(9), 1900–1903 (2001)
Are Human Drivers a Liability or an Asset? David Sanders(B) , Malik Haddad, Giles Tewkesbury, Alex Gegov, and Mo Adda University of Portsmouth, Portsmouth PO1 3DJ, UK [email protected]
Abstract. This paper investigates the relationship between near misses and human error that leads to a collision. A Smart wheelchair that can detect obstacles is used as an example. Many collisions are the fault of the driver and could be avoided if a sensor system was allowed more control of the vehicle. Some claim that eliminating a human driver from a control loop could eliminate collisions caused by human error. Others caution that human error may not disappear completely with the elimination of the driver and that new incidents may occur because of it. Analysis suggests that a human driver is attributable to many errors but that at the same time a human is vital for avoiding many accidents. The volume of near misses in this analysis is of a sufficient quantity to make some generalized conclusions about the nature of the detection and mitigation of collisions. It would strengthen the analysis if near misses from other types of driving were available. Near misses are a source of information about potential collisions as if a person is having more near misses then they may be becoming tired or distracted. Near misses were analyzed in order to examine what role on-board human drivers play in the occurrence and detection of the initial stages of a collision. Keywords: Obstacle detection · Collisions · Driver · Control · Disabled · Wheelchair
1 Introduction Unmanned vehicles, mobile robots and assistive systems such as semi-autonomous powered wheelchairs are expected to revolutionize society and industry in the coming decades [1]. With improved sensor and computing capabilities, vehicles and powered wheelchairs of the future are expected to be safer and more efficient [2–4]. However, autonomy does not imply unmanned although benefits of autonomous operation may come from the absence of human drivers. One of the most frequently claimed benefits is the elimination or at least reduction of human error. An estimated 75–96% of accidents can be attributed to human error [5] and an obvious solution might be to remove human drivers, and therefore human error. In addition, autonomous vehicles get into accidents because things sometimes go awry, such as an AI software bug or a hardware fault or because a person unexpectedly leaps in front of a vehicle etc. Ahvenjärvi [6] explained that there is human interaction in many aspects of operations, and that all aspects are subject to human error. He explained that removing humans © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 805–816, 2022. https://doi.org/10.1007/978-3-030-82193-7_54
806
D. Sanders et al.
may simply shift the human error from the driver to the designers, programmers and maintenance personnel. Even if the majority of accidents can be partly or fully attributed to human error, the onboard human drivers can be vital to the mitigation of consequences if an accident does happen. Wróbel examined accident reports and assessed whether the accident would have happened on an autonomous vehicle, and once it had - would its consequences have been different if there had been no human to counteract them [7]. They concluded that a number of accidents (such as collisions) may be reduced but that the consequences of any accidents will be greater because there is less ability for a human driver to mitigate the consequences. Wróbel explained that their study was limited in that they were only able to investigate incidents where accidents happened and not when humans were successfully able to prevent an accident from happening. Fortunately, it is the norm for vehicles to operate without incident or accidents most of the time. From time to time incidents happen that require a human driver to react. Most of the time these incidents are resolved without consequences and without any need to involve other people. The literature is a little one-sided in that it mainly contains analysis about instances when things go wrong, not when they go right [1]. The work described in this paper is attempting to partly bridge that gap by considering collisions and near misses that were close to resulting in a collision but where either the human driver, or a collision avoidance algorithm, may have prevented collision. This paper investigates the role played by the sensor system and the human driver in detecting incidents and starts evaluating whether sensors can detect near misses and adjust controllers to intervene to stop them from developing into collisions.
2 Do Near Misses Suggest that Collisions May Occur? The theoretical background for the analysis is presented here. Section 2.1 is a brief explanation of a near miss and the reporting process. Section 2.2 introduces the conceptual accidents model used in this paper. 2.1 Near Misses Detecting and reporting of near misses can serve to improve safety by learning from past experiences. In this work, sensor mounted on a wheelchair recorded near misses and drivers reported near misses. The object was to investigate whether near misses could predict later collisions so that preventative measures could be put in place. This stems from the idea that even though a near miss may not have consequences, there may be proportionality between near misses and collisions. That concept is called the “iceberg model” or “accident pyramid” developed by Heinrich [8]. An example of Heinrich’s Pyramid is in Fig. 1. Accident prevention has moved on since the pioneering work begun by Heinrich in the 1930s. It is now widely accepted that the linear causal understanding of accidents that Heinrich proposed, and that the accident pyramid implied, is simplistic [9]. The ratio of major to minor accidents and near misses varies greatly and the causes of minor incidents and accidents are often not the same [10]. Some, but not all accidents can be predicted by
Are Human Drivers a Liability or an Asset?
807
minor incidents but not all minor incidents have the potential to develop into accidents. Accidents are often a consequence of a combination of circumstances and frequently come as a surprise, according to “Normal Accident Theory” [11]. It follows that the number of accidents that were prevented from happening for one reason or another may never be realized. It may never even be noticed that an accident was close to occurring.
Fig. 1. Heinrich’s accident pyramid [12].
Evaluating safety performance based on the rate of near misses and/or minor incidents has been used [1, 10]. One issue though is the sometimes tenuous relationship between minor incidents and accidents. Another equally important issue is the discrepancy between actual and reported incidents. There are many reasons why near misses are not reported. Sharing the experience of an accident that nearly happened with other people could ideally prevent it from happening again. There is, however, no obvious short-term incentive for an individual to report an incident. An individual may also fear being blamed, disciplined or embarrassed. If individuals feel that carers are unsupportive, complacent or insincere, they may be less inclined to report. In this work sensors were used to automatically report near misses. The definition of a near miss can be ambiguous. Some people distinguish between unsafe conditions, unsafe acts, near misses, minor incidents, near accidents, etc. but others do not. Determining what qualifies as which, if any category, can be challenging. In this work a near miss was defined as detecting an object within a pre-set distance. Near misses are not a perfect representation of how exposed a system or driver is to accidents but there is evidence for the importance of near misses as a tool to improving safety [13].
808
D. Sanders et al.
2.2 Bowties A commonly used risk evaluation method to evaluate and understand causal relationships is the bowtie method, an example of which can be seen in Fig. 2. The method is named after the shape, which resembles a bowtie. Causes are on the left and Consequences are on the right. In the center the bowtie converges in a Critical Event which is also called the top event. The Critical Event can be defined as an accident such as a collision [14]. Between the Causes and the Critical Event and between the Critical Event and the Consequences there are Barriers. A Barrier is in place to prevent, control or mitigate a hazard. Barriers can be physical barriers or other physical equipment, or they can be non-physical, such as procedures or policies. In this work the barrier is a sensor system, sometimes feeding an obstacle avoidance routine.
Fig. 2. Generic example of a Bowtie [15].
There can be several Causes of a Critical Event and a Critical Event can have several Consequences. In this work the Critical Event was a collision. There can be many Causes of a collision and there can be many Consequences depending on the Barriers in place. Driving too fast for either the surface, situation or conditions is still one of the most common contributory factors in collisions. Distraction can be a major contributor (either by driver or carer) and even a quick glance away takes your eyes off the route for a second or two. Wheelchair users (and other drivers) can drive recklessly at least sometimes. And sometimes drivers are not aware of the risks around them (other vehicles and people). Finally, drivers can get tired or just may not have the full ability to drive safely without assistive systems or carers. By focusing on the Critical Event, a collision, it becomes evident whether there are sufficient Barriers in place. Although it is possible to define Causes, Critical Event and Consequences, it can be difficult to distinguish clearly between them [16]. The bowtie can be used both quantitatively and qualitatively to include them all. In this paper, the bowtie model used to understand the concept of developing collisions.
Are Human Drivers a Liability or an Asset?
809
3 Method and Testing Safety-related information such as near misses for a wheelchair user or driver can contain sensitive information about an individual or can be time consuming to assemble and has not normally collected. A series of tests took place at the University of Portsmouth. Testing was undertaken to compare the number and type of near misses and collisions recorded when the system was being jointly controlled using a mixture of computer and human control, and when just being controlled by a human user. Several tests were conducted through a variety of courses. Half with the sensors just recording near misses and collisions and half with the sensors connected to a collision avoidance routine to provide automatic assistance. Obstacle courses were created for each test. The courses were within various corridors with some sloping and flat surfaces, walls, and doorways and extra obstacles were added in staggered formations. Staff and students at the University of Portsmouth volunteered to drive during the testing. They were mainly students. A clear explanation of the study was provided (including benefits and risks) and the University Ethics Committee approved the testing procedure. There were 11 males and 4 females. Figure 3 is a sketch of Corridor Three. Arrows show a general route for a wheelchair. Shaded blocks are the obstacles in the wheelchair path. Corridor Three also included two double-doorways where one door was kept shut and the other open. That meant the chair had to be zig-zagged to pass through them.
Fig. 3. Corridor three.
103 near misses were recorded by the sensor system and examined to evaluate their relationship to collisions. The near misses originated from 13 different volunteer drivers. The near miss reports consisted of a time and a distance. This section describes the method used in the analysis of near misses. The next Section describes the scenario in which the analysis was undertaken, the initial selection of near misses for analysis is described and the further categorization of the near misses is explained.
810
D. Sanders et al.
Some assumptions about the on-board equipment and operation of the wheelchairs was made for the purpose of the analysis. The scenario was based on a powered wheelchair driving through various cluttered environments [17–21]. A microcontroller had access to all the inputs from the monitoring system on the wheelchair. The microcontroller could start, stop and operate the electric motors driving the wheelchair. If the joystick controlling the wheelchair was held in the forward position then control was effectively given to the sensors and the wheelchair operated autonomously; the wheelchair perceived and reacted to objects without any outside control. The added complexity and possible additional sources of failure resulting from the remote monitoring were not considered. The control system described in [22] was implemented where a Lyapunov-based technique was employed to guarantee convergence and robustness of the system. The basic principle of the technique was to account for disturbance such as veer to suitably steer the powered wheelchair on a desired path. The categorization of the near misses was done in four steps. Near misses were first categorized into types of comparable incidents. There were large variations between near misses and exact evaluations of each incident were difficult. In the first categorization step, the outcomes from the initial selection were categorized into three near miss types; Very close, Close, and Marginal. The ranges were actually analogue but were simplified into the three categories.
4 Results The initial selection described in Sect. 3 resulted in 103 near misses from the testing. Categorizing the near misses resulted in the distribution shown in Fig. 4. The largest segment was Marginal, which represented 44% of the near misses. Close, which represented 37%, and the remaining 19% related to Very close.
Fig. 4. Near misses by type.
Are Human Drivers a Liability or an Asset?
811
Fig. 5. The number of near misses and the number of collisions.
Figure 5 shows the number of near misses and the number of collisions. Alarms on sensors and obstacle avoidance routines can warn of close, or projected close proximity to objects and obstacle avoidance routines can take evasive action. The evaluation of whether a situation was dangerous or safe relied on the interpretation of different inputs about the actual situation and about intended future actions. Proximity alarms were often of little value and were often ignored [1]. Considering the causes of near misses recorded during testing: • Human Error was responsible for 48% of the near misses. • Equipment Failure was responsible for 3% of near misses. • 49% were caused by External Influence. Considering the causes of collisions recorded during testing: • Human Error was responsible for 33% of collisions. • Equipment Failure was responsible for 2% of collisions. • 65% were caused by External Influence. The number of near misses caused by Human Error was lower than the corresponding value for incidents that resulted in actual collisions. This discrepancy was probably due to a combination of different factors. Reporting bias must be considered as one of these. Only 15 collisions were reported but 21 collisions were detected. Figure 6 shows the average number of collisions and near misses reported by drivers compared to the average number of collisions recorded by a researcher observing the test. Drivers tended to under-report their near misses and collisions. Individuals were also more inclined to report equipment failures or failures resulting from external influences
812
D. Sanders et al.
Fig. 6. Average number of collisions reported by drivers compared to the average number of collisions recorded by a researcher observing the test.
rather than their own unsafe behavior, since this could reflect badly on them [1]. It may also be easier to blame faulty equipment or other people rather than unsafe behavior. Figure 7 shows the average number of collisions or near misses reported as their own fault compared to the average number of collisions recorded as the driver’s fault by a researcher observing the test.
25 20 15 10 5 0 Reported
Recorded
Fig. 7. Number of collisions reported as their own fault compared to number recorded.
Near misses and collisions are shown in Heinrich’s accident pyramid in Fig. 8. Another factor may be that the experiments and then resulting data used in this paper did not support deeper analysis of underlying or contributing factors. If the experiments had allowed for a more thorough investigation, then more detailed explanations for near
Are Human Drivers a Liability or an Asset?
813
Fig. 8. Data recorded in Heinrich’s accident pyramid.
misses and collisions might have been found. Another contributing factor could be that the causes of collisions tend to be different from near misses. Experienced and skilled drivers often recorded several near misses as they skillfully cut corners to drive more efficiently. Near misses can also be the result of the failure of a single barrier. Other barriers, for example the response of the human driver, might then prevent a near miss from developing into a collision.
5 Discussion and Conclusions The volume of near misses in this analysis was of a sufficient quantity to make some generalized conclusions about the nature of the detection and mitigation of collisions. It would strengthen the analysis if near misses from other types of driving were available. Near misses were not a perfect proxy for the distribution and frequency of collisions. Some incidents might have been milliseconds away from developing into a collision had they not been detected and stopped. Other incidents might have been stopped by several other layers of barriers and would therefore not have had any consequences, even if they had escaped initial detection and mitigation. Any estimation of criticality and severity would just be speculative, because near misses were collisions that did not happen. Near misses were, despite all these reservations, a source of information about potential accidents. For example, if a person was having more near misses then they might be becoming tired or distracted. Human Error might be under-represented in this analysis. The term human error covers a wide range
814
D. Sanders et al.
and variety of faults, one of which was “omission”, that was a failure to act. The discrepancies between near misses and collisions point to the role of the human driver in managing to act to stop collisions. Had a driver failed to act, near misses would often have resulted in a collision that would then have been attributed to Human Error due to omission. Very Close near misses tended to disappeared when the collision avoidance system was activated, but not all. If the collision avoidance system was not connected then potential collisions were detected later. If detected later then there was less time for the driver to stop the development of an incident into a collision. Near misses were analyzed in order to examine what role on-board human drivers played in the occurrence and detection of the initial stages of a collision. The effect of autonomous operation on the ability to stop the development of incidents into collisions was also examined. Near misses were not a perfect representation of the distribution of accidents. Many near misses did not have the potential to develop into a collision and not all collisions could be predicted by near misses. The near misses were divided into three types; Very close, Close, and Marginal. Compared to collisions, near misses might be under-represented. The explanation for this discrepancy might be a combination of factors relating to the way incidents were reported by the drivers and detected by the sensors. The discrepancy might also be seen as evidence that on-board human drivers could sometimes prevent a near miss from developing into a collision. In many cases when the sensors detected objects, the human driver would probably have detected them but possibly only after the incident had become more severe. This was an indication of the need for monitoring and detection equipment on the powered wheelchairs. In almost all near misses, the possibility of stopping the incident developing into a collision might be the same whether the sensor system was assisting or not. This evaluation relates very much to the finding that the majority of 92% of the near misses occurred when the wheelchair was maneuvering close to an obstacle, rather than when something (such as another person or wheelchair) moved into the path of the chair. Not all incidents related to human error could be expected to disappear with autonomous operation. For example, at least some human errors must occur because the driver was not seated correctly, or because of mistakes during maintenance work or because of errors in setting up assistive systems. Some near misses and accidents were direct causes of on-board human driver error that could be prevented if sensor systems assisted, while others would be caused by the introduction of semi-autonomous driving. Whether the introduction of semi-autonomous driving aided by sensors will result in a net decrease or increase in the number of accidents must still be uncertain but it will depend on the technical capabilities of the powered wheelchair systems in the future. Future work is investigating decision making to assist with powered wheelchair control [23, 24], AI systems [25–27], data collection [28–31], image processing [28–31] and systems to reduce veer [3, 22]. On today’s wheelchairs, with existing technical systems, humans are still important for detecting near misses earlier and the ability to stop them developing into collisions.
Are Human Drivers a Liability or an Asset?
815
Acknowledgment. This research was supported by the Engineering and Physical Sciences Research Council (EPSRC).
References 1. Eriksen, S.: On-board human operators: liabilities or assets? In: 19th Conference on Computer and IT Applications in the Maritime Industries, pp. 98–110 (2020) 2. Rødseth, Ø.J., Burmeister, H.-C.: Developments toward the unmanned ship. In: DGON ISIS 2012, Berlin (2012) 3. Sanders, D.A., Langner, M., Tewkesbury, G.E.: Improving wheelchair-driving using a sensor system to control wheelchair-veer and variable-switches as an alternative to digital-switches or joysticks. Ind. Robot Int. J. 32(2), 157–167 (2010) 4. Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016) 5. Allianz: Safety and Shipping Review 2017, Allianz COLINS, D. (2018), What is the Heinrich Pyramid in Safety Management? (2017). https://www.prosapien.com/blog/heinrich-pyramid 6. Ahvenjärvi, S.: The human element and autonomous ships, TransNav. Int. J. Marine Navig. Safety Sea Transp. 10, 517–521 (2016) 7. Wróbel, K., Montewka, J., Kujala, P.: Towards the assessment of potential impact of unmanned vessels on maritime transportation safety. Reliab. Eng. Syst. Safety 165, 155–169 (2017) 8. Heinrich, H.W., Petersen, D., Roos, N.: Industrial Accident Prevension: A Safety Management Approach, New York (1980) 9. Salminen, S., Saari, J., saarela, K.L., Räsänen, T.: Fatal and non-fatal occupational accidents: identical versus differential causation. Safety Sci. 15, 109–118 (1992) 10. Hale, A.: Conditions of occurrence of major and minor accidents: urban myths, deviations and accident scenarios, Tijdschrift Voor Toegepaste Arbowetenschap 15 (2002) 11. Perrow, C.: Normal Accidents: Living with High-Risk Technologies. Princeton University Press, Princeton (1999) 12. Heinrich’s accident pyramid. https://www.pro-sapien.com/blog/heinrich-pyramid 13. Jones, S., Kirchsteiger, C., Bjerke, W.: The importance of near miss reporting to further improve safety performance. J. Loss Prev. Process Ind. 12, 59–67 (1999) 14. Markowski, A.S., Mannan, M.S., Bigoszewska, A.: Fuzzy logic for process safety analysis. J. Loss Prev. Process Ind. 22, 695–702 (2009) 15. https://www.cgerisk.com/knowledgebase/File:Bowtie_Diagram.png. Downloaded Jan 2021 16. De Ruijter, A., Guldenmund, F.: The Bowtie method: a review. Safety Sci. 88, 211–218 EMSA (2014). Annual Overview of Marine Casualties and Incidents 2014, European Maritime Safety Agency, Lisboa (2016) 17. Sanders, D.A., Bausch, N.: Improving steering of a powered wheelchair using an expert system to interpret hand tremor. In: Liu, H., Kubota, N., Zhu, X., Dillmann, R., Zhou, D. (eds.) ICIRA 2015. LNCS (LNAI), vol. 9245, pp. 460–471. Springer, Cham (2015). https:// doi.org/10.1007/978-3-319-22876-1_39 18. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: x. Rule-based Expert System to decide on direction and speed of a powered wheelchair; Sanders, D.A.: Using self-reliance factors to decide how to share control between human powered wheelchair drivers and ultrasonic sensors. IEEE Trans. Neural Syst. Rehabil. Eng. 25(8), 1221–1229 (2016)
816
D. Sanders et al.
19. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys 2018. AISC, vol. 868, pp. 822–832. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-01054-6_57 20. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE, August 2020. pp. 822–832 21. Sanders, D.A., Haddad, M., Tewkesbury, G.E., Thabet, M., Omoarebun, P., Barker, T.: Simple expert system for intelligent control and HCI for a wheelchair fitted with ultrasonic sensors. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 211–216. IEEE (2020) 22. New Control Paper 23. Haddad, M., et al.: Use of the analytical hierarchy process to determine the steering direction for a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 617–630. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_46 24. Haddad, M.J., Sanders, D.A.: Selecting a best compromise direction for a powered wheelchair using PROMETHEE. IEEE Trans. Neural Syst. Rehabil. Eng. 27(2), 228–235 (2019) 25. Sanders, D.A., Gegov, A., Haddad, M., Ikwan, F., Wiltshire, D., Tan, Y.C.: A rule-based expert system to decide on direction and speed of a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 868, pp. 822–838. Springer, Cham (2019). https://doi. org/10.1007/978-3-030-01054-6_57 26. Haddad, M., et al.: Intelligent control of the steering for a powered wheelchair using a microcomputer. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 594–603. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_44 27. Haddad, M., Sanders, D., Ikwan, F., Thabet, M., Langner, M., Gegov, A.: Intelligent HMI and control for steering a powered wheelchair using a Raspberry Pi microcomputer. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 223–228. IEEE, Bulgaria (2020) 28. Haddad, M., et al.: Intelligent system to analyze data about powered wheelchair drivers. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 584–593. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_43 29. Haddad, M., Sanders, D., Langner, M., Omoarebun, P., Thabet, M., Gegov, A.: Initial results from using an intelligent system to analyse powered wheelchair users’ data. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 241–245. IEEE, Bulgaria (2020) 30. Sanders, D., Haddad, M., Tewkesbury, G., Bausch, N., Rogers, I. and Huang, Y.: Analysis of reaction times and time-delays introduced into an intelligent HCI for a smart wheelchair. In: 2020 IEEE 10th International Conference on Intelligent Systems (IS), pp. 217–222. IEEE, Bulgaria (2020) 31. Sanders, D., et al.: Introducing time-delays to analyze driver reaction times when using a powered wheelchair. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 559–570. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_41
Negative Emotions Induced by Non-verbal Video Clips Flavia De Simone(B) , Simona Collina, and Manuela Nuzzo Università degli Studi Suor Orsola Benincasa, Napoli, Italy [email protected]
Abstract. The present study pursued a dual purpose: the primary aim of the study was to test the role of non-verbal cues in eliciting emotions; the second aim was to verify if the participants’ ability in communication decoding was related to an unconscious coherent emotional activation. Participants were invited to observe a mute video with a job interview. Three different conditions were considered: one clip where the interview was characterized by racial prejudice, one with sexual prejudice and a control condition. An ad hoc questionnaire was administered to test participant’s observations. Participants’ facial expressions were also recorded. Results evidenced a great participants’ ability in detecting emotional content of a communication by concentrating only on non-verbal signals and a corresponding unconscious emotional activation. Both evidences will be framed in the most recent theoretical debates. Keywords: Emotion elicitation · Non-verbal communication · Video clips
1 Introduction Is it possible to induce negative emotions in humans by means of non-verbal video clips? This research question is highly considerable for psychologists interested in nonverbal communication. Non-verbal communication is a field of study that includes the majority of the channels through which a human being can communicate. Knapp [8] described seven categories of non-verbal behavior research as related to communication. The first category is kinesics, commonly named ‘body language’, A second category is paralanguage and involves content-free vocalizations and patterns associated with speech. A third category includes physical contact. A fourth category is proxemics, which involves space and distances organizing. A fifth category concerns the physical characteristics of people. Related to physical characteristics is the category of artefacts or adornments. Environmental factors are included in the last category and deal with the influences of the physical context in which the behavior occurs. This study is focused on the first Knapp’s category, kinesics, which includes movements of the hand, arm, head, foot, and leg, postural shifts, gestures, eye movements, and facial expressions. Non-verbal behavior could be imagined as a continuum from unconscious automatic reactions (the hearth rate acceleration in case of fear) to controlled and planned behaviors (the use of a conventional gesture during a conversation). Non-verbal channels are the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 817–822, 2022. https://doi.org/10.1007/978-3-030-82193-7_55
818
F. De Simone et al.
privileged way of the emotional communication. Emotional communication is often a conscious process but most of the time it is induce and expressed in an unconscious manner [4]. The topic of emotion induction is also outstanding in affective computing, the branch of artificial intelligence dedicated to emotions. This field can be divided into three main application areas: 1. Emotion expression: the purpose of the researchers in this ambit is to equip computers with interfaces able to reproduce emotional expressions; 2. Emotion recognition: the scope of these studies is to make the machines able to recognize the emotional states of users and adapt, facilitating human performances, as the emotional state plays a fundamental role in human actions; 3. Emotion manipulation: this line of research aims at studying the ways in which the human emotional state can be influenced in the interaction with the machine. In affective computing domain understanding how emotions can be elicited in human beings in fundamental to the aim of develop emotional interacting systems. As suggested by Hazer [7], emotion recognition also requires a valid emotion elicitation method to train the classifier. The use of video clips to induce emotional states and its advantages have been extensively described in Gilman et al. [5]. In this study participants passively watched three mute video clips. Emotion elicitation was evaluated by means of a facial expression analysis whereas emotion recognition was assessed by means of a questionnaire.
2 Experiment The aim of the present study was to test the role of non-verbal cues in eliciting emotions. In particular, the study was aimed at evaluating the ability of viewers to detect the emotional content of a conversation between actors staying out of context and concentrating only on non-verbal signals. In addition, the emotional state of viewers was monitored by means of a face analysis software to verify if the ability in communication decoding was related to an unconscious coherent emotional activation.
3 Method 3.1 Participants Thirty students, twenty-four females and six males, volunteered in the experiment. Participants were students at the University of Naples Suor Orsola Benincasa. The age range was between 18 and 36. All participants had normal or corrected to normal vision. None of the participants received money or course credits for participation.
Negative Emotions Induced by Non-verbal Video Clips
819
3.2 Materials A silent video was used as experimental material. The video, centered on the theme of the job interview, is composed of three different scenes: one showing racial prejudice with the candidate discriminated by the recruiter for his skin, one focused on sexual prejudice where man grader causes difficulties for the woman candidate and a control clip in which a serene job interview is represented. It is important to underline that in the original scenes there was verbal communication between participants, the selectors and the candidates; the verbal component of the communication was eliminated precisely in order to verify the effectiveness of the nonverbal communication channel even in a non-exclusively non-verbal context. The three different scenes had comparable lengths. The sequence of prejudices’ scenes was randomized to generate different video clips, as shown in Table 1. If a participant saw Video 1 the following participant was presented with Video 2 and so on; in that way the collected data were not affected by participants learning effects or fatigue effects resulting from a fixed order of scene presentation. Table 1. Scenes order in the three different video clips Video Sequence of scenes of prejudice in job interview 1
Ratial - Sexual - Baseline
2
Sexual - Baseline - Ratial
3
Baseline - Ratial - Sexual
3.3 Procedure Participants were tested individually in a quiet room. They were instructed by the experimenter and were also invited to carefully read the instructions on the computer screen. They were asked to look closely at the silent video. Subjects were informed that data would be collected in an anonymous manner and sensible, identifiable data would not be analyzed or divulgated for the scope of the study. Once the oral and informed consents were acquired from participants, the experimental procedure could start. During the task, participants facial expressions were recorded to recognize emotional states with FaceReader software, a software for facial analysis. It can detect facial expressions. FaceReader has been trained to classify expressions in one of the following categories: happy, sad, angry, surprised, scared, disgusted, and neutral. These emotional categories have been described by Ekman [1–3] as the basic or universal emotions. Before the experiment started, we used the automatic calibration procedure for each participant with live camera input to tailor the analysis of facial expressions to a specific person. The individual calibration function enables the correction of person specific biases towards a certain emotional expression. To evaluate the decoding ability of participants, at the end of the video they were asked to compile an ad-hoc prepared questionnaire indicating the type of behavior they observed and the non-verbal cues they mostly focused on. The questionnaire consisted
820
F. De Simone et al.
of two open-ended questions. Specifically, participants were asked to indicate what was the behavior of the participants in the different scenes that made up the video clip and what elements of the scene they focused on to formulate their response.
4 Results Responses to the questionnaire were analyzed in terms of frequencies. 21 out of 30 participants correctly responded to the first question, indicating that they witnessed scenes where non-verbal behaviors revealed an attitude of prejudice. This result was analyzed by means of a statistical Chi square test: Chi squared equals 4.800 with 1 degrees of freedom. The two-tailed P value equals 0.0285. By conventional criteria, this difference is considered to be statistically significant. Of the 21 participants who had correctly answered the first question, all but one responded to the second question by asserting that they focused their attention primarily on the actors’ facial expressions; 17 out of 21 participants indicated that they also observed the actors’ posture. FaceReader data were analyzed in term of most elicited emotions during the observation of the sequences that made up the video. To compute this measure, we made a sum of the number of times each of the six basic emotions identified by Paul Ekman (disgust, fear, happiness, sadness, and surprise) was recognized by the software in any participant for each scene in the video. The obtained data were showed in Fig. 1: the graph shows the activation of the six basic emotions per scenes. On the base of FaceReader data, racial prejudice elicited manly disgust (upper lip is raised, upper teeth may be exposed, nose is wrinkled, cheeks are raised), sexual prejudice aroused mostly anger (the eyebrows are lowered and drawn together, vertical lines appear between the eyebrows, lower lid is
EmoƟons per Scenes 30 25 20 15 10 5 0 Anger
Disgust
RaƟal Prejudice Scene
Fear
Happiness
Sexual Prejudice Scene
Sadness
Surprise
Baseline Scene
Fig. 1. The graph shows the frequency of recognition of the six basic emotions by the FaceReader software for the scenes of the video.
Negative Emotions Induced by Non-verbal Video Clips
821
tensed, eyes are in hard stare or bulging, lips can be pressed firmly together, with corners down, or in a square shape as if shouting, nostrils may be dilated, the lower jaw juts out). The frequencies of Anger and Disgust in the three Scenes were analyzed by means of a statistical Chi square test: the chi-square statistic is 7.0364. The p-value is 0.029653. The result is significant at p < .05.
5 Conclusion The present study had both practical and theoretical purposes: the main objective of the study was to test the role of nonverbal communication in eliciting emotions; the second objective was to test the interplay between the participants’ ability in communication decoding and the unconscious emotional activation consistent with the context. To these aims, we realized an in-lab experiment in which 30 participants were engaged in a passive watching task of a silent video. While watching the video, participants were videotaped and facial expressions were analyzed using FaceReader emotion recognition software. At the end of the viewing, in addition to the emotional state of the participants, their understanding of the dynamics presented in the video was also assessed using an ad-hoc prepared questionnaire. The purpose of this dual assessment procedure was to test the parallelism between automatic processes of emotional induction and deliberate processes of information processing and comprehension. Consistently with the previous literature [5], the results evidenced a possibility to induce negative emotions in human beings by means of mute video clips. In addition, the data suggest an interplay between encoding and decoding processes also when subjects are observers and not actors in communication. These evidences are particularly relevant for the psychological debate, opening new theoretical scenarios in the field of non-verbal communication [6], and for Affective Computing, a field of study in continuous search of valid methods of research and development [7]. Other experiments are in order to test the consistency of the results and to investigate positive emotion induction.
References 1. Ekman, P.: An argument for basic emotions. Cogn. Emot. 6(3–4), 169–200 (1992) 2. Ekman, P.: Are there basic emotions? (1992) 3. Ekman, P.: Basic emotions. In: Handbook of Cognition and Emotion, vol. 98, no. 45–60, p. 16 (1999) 4. Geltner, P.: Emotional Communication: Countertransference Analysis and the Use of Feeling in Psychoanalytic Technique. Routledge, London (2012) 5. Gilman, T.L., et al.: A film set for the elicitation of emotion in research: a comprehensive catalog derived from four decades of investigation. Behav. Res. Methods 49(6), 2061–2082 (2017). https://doi.org/10.3758/s13428-016-0842-x 6. Gordon, R.A., Druckman, D.: Nonverbal behaviour as communication: approaches, issues, and research. In: The Handbook of Communication Skills, pp. 81–134. Routledge (2018)
822
F. De Simone et al.
7. Hazer, D., Ma, X., Rukavina, S., Gruss, S., Walter, S., Traue, H.C.: Emotion elicitation using film clips: effect of age groups on movie choice and emotion rating. In: Stephanidis, C. (ed.) HCI International 2015 - Posters’ Extended Abstracts: International Conference, HCI International 2015, Los Angeles, CA, USA, August 2–7, 2015. Proceedings, Part I, pp. 110–116. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-21380-4_20 8. Knapp, M.L., Harrison, R.P.: Observing and recording nonverbal data in human transactions (1972)
Automatic Recognition of Key Modulations in Symbolic Musical Pieces Using Information Theory Michele Della Ventura(B) Department of Music Technology, Music Academy “Studio Musica”, Treviso, Italy
Abstract. Extracting information of musical attributes such as harmony and key modulation from the symbolic music text is a critical process in Music Information Retrieval (MIR) systems. This research paper presents a new approach to automatically detect key modulations in a musical piece considered in its symbolic musical level. The method does not entail any preset rules respect to the musical grammar. The Markov process is used in order to identify the relations between the sounds and their movement within the harmonic structure. The information theory is used to define the tonality (or key) when sounds are changed by means of accidentals (sharp, flat or natural). The method has been tested on tonal polyphonic compositions and gave encouraging results, with an accuracy ranging between 82% and 91% on the base of the historical period of the analyzed music composition. Future improvements of the method are discussed briefly at the end of the paper. Keywords: Chord segmentation · Information theory · Markov process · Modulation · Symbolic music
1 Introduction A musical composition is characterized by two elements: melody and harmony. Harmonic analysis represents the most complex phase of the study of a composition and consists in determining the real structure of the piece: harmony is made up by the succession of chords on which the melody is built. Harmonic analysis implies the identification of the chords, their functions within a musical phrase, the bonds that determine their concatenation and the passage from one key to another key [1, 2]. Composers often change keys during a piece of music to delineate its structure: this change of key, which helps create interest and tension, is called modulation. These elements are often associated with certain emotions, stylistic belonging, historic period, and sometimes even particular authors [3]. Modulation is one of the most important technical-expressive tools that can be used in the context of tonal music in order to vary its substances and articulate its forms and contents [4]. Modulation represents the dynamism of the tonality [5].
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 823–836, 2022. https://doi.org/10.1007/978-3-030-82193-7_56
824
M. D. Ventura
The change of key means that the hierarchical framework of the sounds and of the chordal entities, highlights different directions and dependencies, in more or less clear contrast with respect to the previous ones depending on the tangent tones [5] (see Fig. 1): connections that were first primary become secondary, previously transitory chords now function as central pivots of the musical discourse, the sounds that were accessories are placed in particular prominence - and so on. This "divergence of similar orders" (the contrast between roles, functions, hierarchies and colours of the present in relation to those just abandoned) constitutes the essence of the modulation. According to Charles-Henri Blainville (1711–1769) “Modulation is the essential part of the art. Without it there is little music, for a piece derives its true beauty not from the large number of fixed modes which it embraces but rather from the subtle fabric of its modulation” [6].
Fig. 1. Modulations among keys.
Modulations usually occur between closely related keys (see the example in Fig. 1, with the keys highlighted with the red line) which include the relative key and the ones with one accidental more or less in their signatures [1]: “C” is the tonic, “a” is the relative key, “G” and “e” the keys with one more accidental, “F” and “d” the keys with one less accidental. To achieve harmonic variety, composers often use the “change of mode” that implies an extended passage in the parallel key [1]: a passage from a major key to its parallel minor key (for example, from “C major” to “c minor”) changing only the third grade of the first key (see the example in Fig. 1, with the keys highlighted with the green line). Composers often use a change of mode to prepare for a modulation to a key that is distantly related to the original key, but is closely related to its parallel major or minor (see the example in Fig. 1, with the keys highlighted with the blue line) [1]. Musical grammar provides other tools to change key to the composer, but this paper is not meant to explain all the musical grammar rules (because it is beyond the scope of the discussion), on the contrary wants to illustrate a method for the identification of the modulation in a musical piece (considered in its symbolic level), which is independent of any rule.
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
825
This paper presents an original approach to the automatic recognition of key modulations, based on the information theory. A key is defined by first identifying all the possible keys in which a single chord can be found, and then by calculating the information content of each individual key based on the sounds present in the musical score. This paper is organized as follows. In the next section there is a brief overview about the state of the art of automatic recognition of key modulation. Section 3 describes a theoretical background of the most important concepts needed to understand the concept of modulation. Section 4 describes the information theory and the approach used in this research. Sections 5 and 6 focus respectively on the experimental results, and on conclusions and future works.
2 Related Work The possibility to detect key modulations by means of computer algorithms has attracted a lot of interest in the last few decades [7–14]. The first interesting approach is the tone-profile technique introduced by Krumhansl and Kessler [7]: a histogram to highlight the different pitches in a key. Lerdahl introduced tonal pitch space [8] that can be used as a pitch profile for key finding algorithms and to compare chords and keys. The approach based on different pitch profiles to identify the key of a musical composition was also used by Albrecht and Shanahan [9], and Nápoles López at all. [10]. Other researches related to the key detection investigate the global key or home key detection [11–13]. Korzeniowski and Widmer [12] proposes a global key finding algorithm with a convolutional neural network (CNN); Chen and Su [13] proposed an algorithm based on a recurrent neural network (RNN). Feisthauer et al. [14] introduced a new way to model modulations with the help of features based on musicological knowledge, as well as an algorithm estimating the tonal plan of a piece. This paper aims to present a different approach to the automatic recognition of key modulations, based on the information theory. Starting from the definition of meaning provided by Cohen [15] (“…anything acquires meaning if it connected with, or indicates, or refers to, something beyond itself, so that its full nature points to and is revealed in that connection”), the proposed method carries out a segmentation of a musical composition in order to identify the key modulations [16]. Based on the sounds of each segment, the possible keys are identified. The sounds make up the alphabet (set of symbols) [17] that permits to calculate the information value (entropy) for each possible key [18]: the key with the highest information value is considered the key of the segment (see paragraph 4). The next paragraph presents an overview of the concepts of musical scale and chord, which are fundamental for the recognition of a key.
826
M. D. Ventura
3 Key and Modulation Modulation represents the dynamism of the tonality. The tonality is the set of relationships that link a series of notes and chords to the central note called tonic [1, 19]. The basis of these relationships is the (music) scale [20, 21]. A scale is a set of notes with a specific pattern of tones, whole steps (W), and semi-tones, half steps (H). The formula for forming a major scale is W-W-H-W-W-W-H (Fig. 2a) whole step - whole step - half step - whole step - whole step - whole step - half step and the formula for forming a (harmonic) minor scale is W-H-W-W-H-W 1/2-H (Fig. 2b) whole step - half step - whole step - whole step - half step - whole step and a ½ step - half step
Fig. 2. Example of a major and (harmonic) minor scale.
Appendix A shows all tonal scales (major and minor). The tonalities of tonal music compositions are determined on the basis of the relationship of the sounds with the major and/or minor scales, that is to say, on the basis of the belonging of the sounds to a major and/or minor scale [20, 21]. In the example of Fig. 2, the sounds C-E-G belong to the C-major scale, but the sounds A-C-E belong to both scales, C-major and a-minor. In a polyphonic composition, to identify the tonality it is necessary to consider the chords that the sounds of the various voices compose in each movement. A chord is a combination of two or more notes, played at the same time, which are related by thirds: the three members of a chord are the root, third and fifth (see Fig. 3). The sounds of a chord must belong to a specific tonality or scale.
Fig. 3. Example of chord.
Figure 4 shows an example of segmentation of a musical composition that consists in identifying the chords for each movement.
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
827
…
…
Fig. 4. Harmonic segmentation.
3.1 Harmonic Analysis One of the fundamental aims of the musical analysis is the identification of the tonality and modulations of a composition. This is an important step because permits to define the chords and their functional roles: function describes the harmonic relationship between a chord and other chords and influences the musical interpretation of a musician. Based on the above concepts, this paragraph illustrates the method used for the harmonic analysis in order to identify the tonalities in a musical composition.
Fig. 5. Harmonic segmentation of an excerpt of the choral N.1 of bach.
The method consists of three steps. The first step consists in segmenting the musical piece by scanning the sounds, starting from the beginning of the piece, in order to identify the sounds that are altered compared to the previous ones (see Fig. 5). The altered sound represents the beginning of a new musical segment that will extend to the chord preceding a new altered sound (see Fig. 5b). This procedure repeats itself up to the end of the musical piece.
828
M. D. Ventura
After that, for each segment obtained in the previous step, the chord of the first movement is identified (see Fig. 6). The sounds of the chord are ordered so they are related by thirds, eliminating repeating sounds. The harmonic operator [22] can be indicated with V : → P( ), and is an operator defined as follows: VT (h) : AT = AT U{h} if T(h) = T otherwise VT (h) : I(AT ) In the example of Fig. 6, the sounds are (from the bottom up) A-E-A-C# and after eliminating the repeated sound (A) the other sounds A-E-C# are ordered by third obtaining A-C#-E.
Fig. 6. Harmonic segmentation of a single musical segment.
In the last step, the scales in which the three sounds are present are identified. If the sounds belong to a single scale, this immediately defines the tonality. If the sounds belong to more than one scale (in the example of Fig. 6, the sounds A-C#-E are present in the following scales: A major, D major, d minor, E major, c# minor), the tonality is defined by calculating the information value of each scale on the basis of the sounds present in the segment considered in the analysis. The scale with the highest information value represents the tonality of the segment. The next paragraph describes the information theory and the approach used in this research in order to define the tonality of a musical segment in the case there are more than one possibility.
4 Information Theory The analysis based on information theory sees music as a linear process supported by syntax of its own [17]. However, it is not a syntax formulated on the basis of musical grammar rules, but rather based on the occurrence probability of every single element of the musical message compared to their preceding element [23]. The first problem of the information theory is how to define, and therefore measure, the amount of information emitted by a source [18, 24], also known as the entropy of the source. By imposing two - intuitively desirable - simple conditions, namely that the less the message is expected (i.e. likely), the higher the amount of information carried by a message must be and that the information of a couple of independent messages must be the sum of the respective amounts of information, Shannon defines the information of a specific message x i having the probability P(x i ) [17] as the nonnegative amount
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
I (xi ) = log2
829
1 P(xi )
Given that we want to define the information or entropy of the source, we must take into account the average value, on the whole alphabet, of I(x i ): H (X ) = E[I (xi )] =
n i=1
I (xi ) · P(xi ) =
n i=1
P(xi ) · log2
1 P(xi )
Information is measured by the randomness of the choices possible in a given situation. If a situation is highly organized and the possible consequents in the pattern process have a high degree of probability, the information (or entropy) is low. If, however, the situation is characterized by a high degree of shuffledness so that the consequents are more or less equi-probable, then information (or entropy) is said to be high [17, 23]. To compare the recognized tonalities between them, the entropy of each scale is calculated. This calculation necessarily implies the reference to a specific alphabet. For the definition of the tonality the various notes were classified as symbols of the alphabet (Table 1). Table 1. Example of an alphabet related to the musical segment of Fig. 5a. Note Number of notes C C#
3
Db D
2
Cx D# Eb E
6
E# F F#
1
Gb G Fx G#
1
Ab A
8
Gx A# (continued)
830
M. D. Ventura Table 1. (continued) Note Number of notes Bb B
1
Cb B#
After defining the alphabet, a transition matrix is created (Table 2) so as to calculate the entropy of each scale: it is necessary to take into account the manner in which the notes follow one another within a single musical segment. Whit the aim of doing this, the Markov’s stochastic process is used: continuous sequence of states of a process in which the probability of passing from one state to another in a unitary time depends probabilistically only on the state immediately preceding it and not on the overall “history” of the system. Table 2. Example of transition matrix related to the musical segment of Fig. 5a. C
C#
Db
D
Cx
D#
Eb
E
E#
F
F#
Gb
G
Fx
G#
Ab
A
Gx
A#
Bb
B
C C#
1
1
1
Db D
1
Cx D# Eb E
1
1
4
E# F F# Gb G Fx G#
1
Ab A
1
1
3
Gx A# Bb B Cb B#
1
1
Cb
B#
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
831
By means of the alphabet table and the transition matrix it is possible to calculate the information value of each scale and define the tonality of the musical segment, considering the one with the high information value. Table 3 shows the results of the analysis for the segment of Fig. 5a, from which it can be deduced that the tonality of the segment is A major. Table 3. Example of analysis results. Possible tonalities Information value A major
1,93402906
D major
1,71799789
d minor
1,34299789
E major
0,88633641
c# minor
0,03914517
The same procedure is repeated for each musical segment identified: the numerical values relating to the alphabet and the transition matrix are (reset and) updated according to the musical notes present in the new segment under analysis.
5 Application and Analysis The method proposed in this article, in order to identify the tonality and the modulation of a musical composition, was verified by realizing an algorithm the structure of which takes in consideration the aspect described above. The algorithm was tested on a set of 200 musical score of different historical period (from 17th to 19th Century), written in choral form (4 voices). The algorithm performs the analyzes without referring to the musical grammar rules (Strength): the only comment concerns the ordering by thirds of the notes of the chords, which only serves to facilitate the reading of the results but has no significance on them. For the definition of the possible kyes of a musical segment, the algorithm does not take into consideration any scheme relating to the relationships between near and distant tonalities (Strength) (see Fig. 1): this allows a broad-spectrum analysis, which also includes enharmonic modulations (Strength). Two sounds are enharmonic when they have the same frequency but different names (such as C# and Db), therefore the chord of C# major is the same of Db major (see Fig. 7). This explains why the alphabet table and the transition matrix contain the enharmonic notes in a distinct way.
832
M. D. Ventura
Fig. 7. Example of enharmonic modulation.
Finally, the algorithm does not provide any limitation with respect to the dimensions of the table representing the alphabet and the transition matrix that will be automatically dimensioned in every single analysis on the basis of the notes of the analyzed musical segment (Strength). The method proposed in this article depends on an accurate and complete basic data set, represented by the major and minor musical scales built on the different sounds of the temperate scale (see Appendix A). There is no need to enter any settings to perform the analysis (Strength). 5.1 Results The strength of the proposed method is that it is a free approach where harmonic categories emerge naturally from data. The results of the tests are shown in Table 4: they show a high degree of similarity between the automatic data provided by the algorithm and the handmade results. It is possible to note that the results, conducted on musical compositions of different periods, are satisfactory: an interesting observation about them is that they present a good accuracy for the music of the 17th Century than the music of the 19th Century. This is due to the fact that composers in the 19th Century made extensive use of altered chords to give more tension to the harmony. Table 4. Accuracy of the method. Historical period Accuracy 17th Century
91%
18th Century
89%
19th Century
82%
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
833
These preliminary results are encouraging, since they show that the proposed method can be used on data sets (musical scores) without simplifications (such as the elimination of melodic figures) [19]. However, the proposed method has two weaknesses that affect the results of the analyzes: 1. if a musical segment is composed of few movements (i.e., few notes), the tonality identified through the information value may not be correct: this is due to the fact that with few notes the Markov’s principle that “the probability of each event depends only on the state attained in the previous event” loses its effectiveness; 2. if the musical segment contains notes altered according to the principles of the musical grammar (such as the Neapolitan sixth chord or the Augmented sixth chord), the algorithm may fail to define a key.
6 Discussion and Conclusions The goal of this work was to build a tool able to segment a polyphonic musical piece (considered in its symbolic musical level) in order to automatically identify the key modulations. The algorithm was developed always trying to maintain the necessary balance between the need to provide clear results and the awareness of not being able to code the entire value of a piece of music with rational concepts. The method proposed in this article depends on an accurate and complete basic data set, represented by the major and minor musical scales built on the different sounds of the temperate scale. The method allows to obtain satisfactory results compared to techniques based only on the musical grammar rules formalized at a mathematical level. In particular, information theory and the concept of entropy have shown to have great potential also in this field. Even if the main drawback of this method is the dependence on the creation of a specific alphabet, it is robust with respect to the compositional characteristics of the musical piece. In order to improve the quality of modulation identification, the next step may be to include a data set for the algorithm, based on the construction of the different types of chords on the individual degrees of the musical scale. This would allow us to consider the altered chords (such as the Neapolitan sixth chord or the Augmented sixth chord) typical for both modes, major and minor.
834
M. D. Ventura
Appendix A C major
F major
a minor (melodic)
d minor (melodic)
a minor (harmonic)
d minor (harmonic)
G major
Bb major
e minor (melodic)
g minor (melodic)
e minor (harmonic)
g minor (harmonic)
D major
Eb major
b minor (melodic)
c minor (melodic)
b minor (harmonic)
c minor (harmonic)
A major f# minor (melodic)
Ab major f minor (melodic)
f# minor (harmonic)
f minor (harmonic)
E major
Db major
c# minor (melodic)
bb minor (melodic)
c# minor (harmonic)
bb minor (harmonic)
B major
Gb major
g# minor (melodic)
eb minor (melodic)
g# minor (harmonic)
eb minor (harmonic)
F# major
Cb major
d# minor (melodic)
ab minor (melodic)
d# minor (harmonic)
ab minor (harmonic)
C# major a# minor (melodic) a# minor (harmonic)
Automatic Recognition of Key Modulations in Symbolic Musical Pieces
835
References 1. Schoenberg, A.: Theory of Harmony (tran: Carter, R. E.) University of California Press, Berkeley (1978) 2. de la Motte, D.: Harmonielehre. Kassel: Bärenreiter (1976) 3. Davies, S.: Philosophical perspectives on music’s expressiveness. In: Juslin, P.N., Sloboda, J.A. (eds.) Music and Emotion: Theory and Research, pp. 23–44. Oxford University Press, Oxford (2001) 4. Koelsch, S., Fritz, T., v. Cramon, Y., Müller, K., Friederici, A.D.: Investigating emotion with music: an fMRI study. In: Human Brain Mapping, vol. 27, no. 3, pp. 239–250 (2006) 5. Pallesen, K.J., et al.: Emotion processing of major, minor and dissonant chords: an fMRI study. The Neurosciences of Music. Annals of the New York Academy of Science, vol. 1060, pp. 450–453 (2005) 6. Chevaillier, L.: Les théories harmoniques. In: Lavignac, A., de la Laurencie, L. (eds.) Encyclopédie de la musique et Dictionnaire du Conservatoire, Part 2, Delagrave, Paris, vol. 1, pp. 519–590 (1925) 7. Krumhansl, C.L., Kessler, E.J.: Tracing the dynamic changes in perceived tonal organisation in a spatial representation of musical keys. Psychol. Rev. 89(2), 334–368 (1982) 8. Lerdahl, F.: Tonal pitch space. Music Percept. 5, 315–350 (1988) 9. Albrecht, J., Shanahan, D.: The use of large corpora to train a new type of key-finding algorithm: an improved treatment of the minor mode. Music Percept. Interdisc. J. 31(1), 59–67 (2013) 10. Nápoles López, N., Arthur, C., Fujinaga, I.: Key-finding based on a Hidden Markov Model and key profiles. In: Digital Libraries for Musicology (DLfM 2019), pp. 33–37 (2019) 11. Chai, W., Vercoe, B.: Detection of key change in classical piano music. In: Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR), pp. 468–473 (2005) 12. Korzeniowski, F., Widmer, G.: End-to-end musical key estimation using a convolutional neural network. In: Proceedings of the 25th European Signal Processing Conference (EUSIPCO), pp. 966–970 (2017) 13. Chen, T., Su, L.: Functional harmony recognition of symbolic music data with multitask recurrent neural networks. In: International Society for Music Information Retrieval Conference (ISMIR 2018), pp. 90–97 (2018) 14. Feisthauer, L., Bigo, L., Giraud, M., Levé, F.: Estimating keys and modulations in musical pieces. In: Sound and Music Computing Conference (SMC 2020), Simone Spagnol, Andrea Valle, Torino, Italy, June 2020 15. Choen, M.R.: A Preface to Logic. Henry Holt & Co., New York, p. 47 (1944) 16. Meyer, L.B.: Meaning in music and information theory. J. Aesthetics Art Criticism 15(4), 412–424 (1957) 17. Weaver, W., Shannon, C.: The Mathematical Theory of Information. Illinois Press, Urbana (1964) 18. Della Ventura, M.: Speech assessment based on entropy and similarity measures. In: Le Thi, H.A., Le, H.M., Pham Dinh, T., Nguyen, N.T. (eds.) ICCSAMA 2019. AISC, vol. 1121, pp. 218–227. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-38364-0_20 19. Coltro, B.: Lessons of complementary harmony, Zanibon, Padua – Italy (1979) 20. Wilding-White, R.: Tonality and scale theory. J. Music Theory 5(2), 275–286 (1961) 21. Forte, A.: Tonal Harmony in Concept and Practice. Holt, Rinehart and Winston, London (1974) 22. Ni, Y., McVicar, M., Santos-Rodriguez, R., De Bie, T.: An end-to-end machine learning system for harmonic analysis of music. IEEE Trans. Audio Speech Lang. Process. 20(6), 1771–1782 (2012)
836
M. D. Ventura
23. Angeleri, E.: Information, Meaning and Universality. UTET, Turin (2000) 24. Ventura, M.D.: Voice separation in polyphonic music: information theory approach. In: Iliadis, L., Maglogiannis, I., Plagianakos, V. (eds.) AIAI 2018. IAICT, vol. 519, pp. 638–646. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92007-8_54
Increasing Robustness for Machine Learning Services in Challenging Environments: Limited Resources and No Label Feedback Lucas Baier1(B) , Niklas K¨ uhl1 , and J¨ org Schmitt2 1
Karlsruhe Institute of Technology, Karlsruhe, Germany [email protected] 2 Robert Bosch GmbH, Gerlingen, Germany
Abstract. The importance of deployed machine learning solutions has increased significantly in the past years due to the availability of data sources, computing capabilities and convenient tooling. However, technical challenges such as limited resources and computing power arise in many applications. We consider a scenario where a machine learning model is deployed in an environment where all computations need to be performed on a local computing unit. Furthermore, after deployment, the model does not receive any ground truth labels as feedback. We develop a two-step prediction method which combines an outlier detection with a robust machine learning model. This approach is evaluated based on a data set from a large German OEM. We can show that the prediction performance is increased significantly with our approach while fulfilling the restrictions in terms of memory and computational power. This way, we contribute to the practical applicability of machine learning models for real-world applications.
Keywords: Machine learning deployment Limited resources · Robustness
1
· Concept drift · No labels ·
Introduction
Due to the explosion of data in recent years, supervised machine learning plays an important role in nearly all fields of business, ranging from marketing to scientific-, health- and security- related applications [14]. Many companies rely on deployed machine learning models for increasing process efficiency or for introducing innovative services [30]. Besides the mentioned increased availability of data, this growth in popularity can also be explained by a massive increase in computation power in recent years [22]. However, there are also areas of application for supervised machine learning where computational resources are strictly limited. This especially applies to machine learning models in production systems. For instance, companies might c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 837–856, 2022. https://doi.org/10.1007/978-3-030-82193-7_57
838
L. Baier et al.
generally restrict the connection of sensible data and machine learning models to the infrastructure of cloud providers due to the fear of loosing data and corresponding intellectual property [37]. In other cases, it might be just technically unfeasible to connect specific parts or components to cloud services. In both cases, it is necessary to rely on available local computing resources. Applications on mobile devices are a typical example for this case [26]. Therefore, we regard resource considerations as one constraint for motivating this work. The second constraint relates to a machine learning model in operation which does not receive any ground truth labels for the predictions that are issued [29,36]. This problem needs to be considered in the context of data streams because basically all models which are deployed in productive information systems receive input data as data streams and continuously issue predictions over months or even years for those very data instances. Unfortunately, data streams usually evolve over time leading to changes in the underlying data patterns [40] which require the adequate adaption of a prediction model [9]. However, usual adaptation strategies for model updates [31] cannot be applied because no feedback with regard to the prediction performance is received [17]. Both constraints are expressed in the central research question of the work at hand: RQ: How to increase robustness for supervised machine learning services without label feedback and limited infrastructure resources? We introduce a novel prediction method artifact (“machine learning service”) which addresses both mentioned constraints: limited computational sources and no availability of true labels after deployment. To achieve this, our artifact combines an outlier detection with a robust machine learning model. We assess available outlier detection models as well as prediction models with regard to their suitability based on several criteria. Subsequently, we select appropriate models for our prediction method. Our suggested method is evaluated based on a use case from a large global automotive supplier. We show that we can increase prediction performance significantly with our method compared to normal prediction models while at the same time restricting necessary memory and computation time requirements. By the introduction of the method, this work aims at contributing to increase the acceptance of machine learning solutions for real-world applications. So far, many companies are still skeptical about trusting deployed machine learning solutions for automated decision-making in production environments [38]. The remainder of this work is structured as follows: The upcoming Sect. 2 presents foundations on which we base our research. Section 3 introduces the artifact requirements, while Sect. 4 gives an overview of different options to fulfill these specified requirements. Section 5 describes the evaluation of our approach with an industry use case. Section 6 discusses our results, describes implications and outlines future research.
Machine Learning Services in Challenging Environments
2
839
Foundations
To lay the necessary foundations for the remainder of this work, we first briefly introduce machine learning followed by an overview on concept drift and outlier detection. 2.1
Machine Learning
Traditionally, machine learning approaches are divided into supervised, unsupervised as well as reinforcement learning [3]. Supervised machine learning depends on labeled examples in the training data. In contrast, unsupervised machine learning aims at detecting unknown relationships and patterns in the data. In reinforcement learning, an agent learns to interact with its environment by receiving specific rewards. We focus on supervised approaches, as most real-word applications of machine learning are of supervised nature [22]. However, many machine learning models need to be operated in environments with limited resources. For instance, the training process of machine learning models with gradient-descent optimization has been analyzed to improve computational efficiency [19]. Other approaches in literature aim to reduce the computational requirements for calculating the actual prediction in operation, e.g. for the application of support vector machines for activity recognition [5] or for computer vision tasks in cars [6]. Specialized distributed approaches have been developed for the training and deployment of machine learning on network edges [33]. A lot of research in this context has been dedicated to the application of machine learning in wireless sensor networks [4,21] where computations have to be performed on small independent and distributed sensors [3]. Additional challenges arise when machine learning models are deployed on local and isolated computing units such as mobile devices. Challenges for mobile devices are a high power consumption as well as limited computational power and memory storage [26]. 2.2
Concept Drift
Companies usually apply machine learning models to make predictions for specific services on a stream of unseen incoming data. However, data tends to evolve and change over time. Therefore, predictions issued by models which have been trained on past data may become less accurate or opportunities for performance improvements might be missed [41]. In computer science, the challenge of changing data streams and its implications for machine learning are described with the term concept drift [34]. A concept p(X, y) is the joint probability distribution over a set of input features X and the target variable y. The definition of concept drift refers to a change of a concept between two time points t0 and t1 with pt0 (X, y) = pt1 (X, y) [17]. An example for an application with concept drift over time is a machine learning service which monitors the output quality in a chemical production process and predicts corresponding failures [41]. Sensors connected to the machine will
840
L. Baier et al.
generate the necessary data input. However, sensors wear out over time leading to different data measurements. The previously deployed machine learning model is not prepared for this drift as this data pattern has not been included in the training data set. Thus, correct quality predictions are difficult to make in the long run [23]. Detecting concept drift usually relies on a continuous evaluation of the prediction performance over time [17]. Less popular approaches focus on the computation of drift detection features on the input data [12]. Examples for concept drift handling can be found in various domains such as mobility [8] or fraud detection [15]. 2.3
Outlier Detection
Outliers are described as data instances that appear to be inconsistent with the remaining data instances in the same sample [10]. They are also referred to as abnormalities or anomalies. The detection of outliers is an important task in many different application scenarios. These range from fraud detection to the detection of unauthorized intruders in computer networks up to fault diagnosis in complex machine parts such as detecting aircraft engine rotation defects [20]. Different outlier detection methods are usually divided into unsupervised and supervised methods based on the availability of true labels with regard to the outlierness of data instances [2]. In real-world settings, the detection of outliers typically brings along many challenges [20]. For instance, data contains natural noise which is similar to actual outliers and therefore it is difficult to distinguish between noise and outliers. Furthermore, true outlier labels for data instances are rarely available and therefore complicate the proper validation of different outlier detection methods [13]. This fact is one of the reasons for the popularity of unsupervised approaches. Of those, especially distance- or neighborhoodbased techniques have gained a lot of popularity because they are relatively parameter-free and easy to adapt [27]. However, the computational complexity during application is tremendous. In contrast, statistical approaches allow for easier computation due to their model-based structure, but they rely on an assumption regarding the underlying data distribution which often does not hold true [13]. There are numerous reviews of existing, mostly unsupervised, outlier detection methods (e.g. [13,20]) which we use as input for the development of our prediction method.
3
Problem Definition and Requirements
In this work, we consider a machine learning application with two specific problems. First, no ground truth labels during deployment can be acquired and therefore monitoring the prediction performance is impossible. This can happen for instance when models are deployed to enhance the functionality of a small entity which is integrated into a larger system (e.g. a large machine consisting of several different components). Due to the embedding into the larger system, necessary sensors cannot be installed and no ground truth labels can be acquired.
Machine Learning Services in Challenging Environments
841
Still, also in this case, machine learning models are exposed to concept drift which needs to be addressed. These facts are summarized in problem 1 (P1). P1: No ground truth labels are available after deployment of the machine learning model. Second, the necessary computations for the execution of the prediction method need to be performed on a local computing unit. Neither the integration of a more powerful computing unit nor the connection of the prediction to cloud services are feasible solutions. Therefore, only limited resources are available which is depicted in problem 2 (P2), a typical problem in real-world applications of machine learning [25,32,35]. P2: The prediction method needs to run on a local computing unit. Based on the problem characteristics (P1, P2), we derive the following design requirements (R) for the prediction method. R1: The developed method must show robust behaviour to concept drift and continuously deliver acceptable prediction performance. Ground truth labels for the predictions issued by the method in operation are not accessible. R2: The prediction method needs to operate with limited storage space for saving necessary data (e.g. model weights and other parameters) as the storage space on the local computing unit is limited (memory requirement). R3: The computational complexity of the prediction method is constrained by the computing power of the local computing unit. This requirement refers to the amount of operations that need to be carried out for a prediction during operation and is therefore closely related to time constraints for the computation of a prediction.
Fig. 1. Operating principle of the prediction method
The design of the prediction method is depicted in Fig. 1. The individual components are designed to ensure that the overall method aligns to the requirements (R1-R3). At first, streaming data (sensor measurements) is transferred to a computing unit. As described above, no ground truth labels for any of the predictions can be acquired. Therefore, it is impossible to continuously adapt the model to data changes. Instead, we suggest to control the input data for the prediction model (step 1). An outlier detection is implemented before the prediction model which ensures the proper validity of the new incoming data.
842
L. Baier et al.
Data validity implies that new data is similar to the data that the prediction model has been trained on in the training process. Data instances that exhibit unusual patterns compared to the training data will be filtered out and no prediction for those data instances will be issued. In this case, the whole system will rely on the previous prediction. With regard to time series, this behavior is acceptable because data instances are highly auto-correlated, e.g. we do not expect any discontinuities in the prediction target. In general, it is preferable to receive a prediction for every new data instance. However, it is better to receive no prediction than an entirely false prediction. Otherwise, the control unit of the whole system will adapt its behavior based on this false prediction and trigger corresponding actions which might have negative effects. Due to this dependency, it is better to accept the limitation that predictions are only issued for normal data instances. We prefer to skip adaptations of the whole system at some time points instead of performing a false adaptation based on a false prediction with unforeseeable consequences. In case the prediction method is constantly detecting outliers, an alert will be created. The second step in the prediction method is the application of the prediction model (step 2). In case a data instance is not marked as an outlier, the relevant data is transmitted to the prediction model. One evaluation criteria for the prediction model needs to be the achieved prediction performance. Additionally, it is crucial that the prediction model computes robust predictions which means that the prediction performance does not change significantly over time. According to R2 and R3, potential methods for step 1 and step 2, respectively need to have low computational as well as memory requirements so that these can be deployed on the local computing unit.
4
Design Options
In the following section, we introduce the different available methods which can be implemented to build the two-step prediction method. 4.1
Step 1: Data Validity
The first step in the prediction method (Fig. 1) needs to ensure the data validity of the incoming data instances (R1). This means that the new incoming data in operation must be similar to the training data. The objective of outlier detection is the identification of data instances that are different to other data instances. Therefore, outlier detection is a suitable approach to guarantee validity of new data instances. Since no labelled data regarding the outlierness is available, we rely on unsupervised models in the following. Unsupervised outlier detection models can be differentiated into the following groups [2]: Extreme value models, Clustering models, Distance-based models, Density-based models, Probabilistic models and Information-theoretic models. Considering these groups of algorithms, we need to make a selection that is suitable for the overall prediction method. Distance-based and density-based
Machine Learning Services in Challenging Environments
843
models require the storage of all initial training instances for a proper functionality in practice. This means that the whole initial training dataset needs to be stored on the computing unit. Due to the memory and computational requirements (c.f. design requirement R2 and R3 in Sect. 3), these groups of models are not feasible for the overall approach. However, we will select one representative of those models as a benchmark for the other selected models. Information-theoretic models inherently only allow to indirectly compare the outlierness of different data instances which makes the application difficult and also computationally expensive. Aggarwal (2015) [2] suggests that other models should be preferred if they can be applied. Consequently, we constrain the selection of suitable outlier detection models to the groups of clustering as well as distance-based, probabilistic and extreme values models. The individual models will be evaluated in detail in the next section. 4.2
Step 2: Model Robustness
The second step in the proposed approach is the identification of a robust machine learning model. It is desirable that the chosen prediction algorithm is robust to changes in the input data and keeps the same prediction performance over time (R1). This can be interpreted as a second safety measure in addition to the outlier detection. It is crucial that the machine learning model is robust and issues ongoing reliable predictions. For the selection of the prediction models, we choose the available prediction model out of the following groups in scikitlearn [28]: Generalized Linear Models, Kernel Ridge Regression, Support Vector Regression, Stochastic Gradient Descent, Neural Network Regression, Ensemble methods.
5
Evaluation
We evaluate the overall prediction method with data from a large global automotive supplier. The objective is to optimize the performance of an integrated engine part with a small computing unit within a vehicle. It is necessary to measure the flow quantity of a liquid [mm3 ] as exactly as possible for an overall good driving performance of the engine. The estimation is crucial to optimally align the behaviour of other components in the vehicle. However, from a technical perspective, it is difficult to install a sensor in the affected engine component for the direct measurement of the liquid’s quantity. Furthermore, such a sensor is very costly and economically not feasible. Therefore, surrounding sensors in other parts of the vehicle are used as input to estimate the flow quantity. These consist of a high frequency pressure sensor as well as a sensor in a feed pump which measures the number of revolutions per minute. We apply a supervised machine learning approach in order to predict the flow quantity. After the engine component is built into the vehicle during production, it is impossible to measure the flow quantity (no ground truth labels). However, flow quantity measurements can be derived on a specialized test bench which is used
844
L. Baier et al.
for research and development activities. This gives us the opportunity to train and evaluate the proposed prediction method. The test bench allows to simulate the real-life usage of the corresponding vehicle. The final prediction method needs to exhibit good performance but also at the same time high robustness to outliers. Furthermore, model updates can only be performed when the vehicle is brought to the garage for maintenance work. Therefore, a robust prediction model which does not require regular adaptations needs to be implemented. Restrictions with regard to the available infrastructure for computing the predictions within the vehicle also need to be taken into account. Due to space limitations as well as shock resistance requirements, the built-in computing unit is rather small. This leads to two additional challenges: First, the computing power of the computing unit is limited. Therefore, the computation of the prediction itself should not be very complex since otherwise the computing unit will take too long to issue a prediction. The same limitation applies to the outlier detection. However, this does not restrict the initial training of the model which can be performed outside of the local computing unit, e.g. on a large cluster. Second, there is only a limited amount of storage space available within the computing unit. This restricts not only the size of the prediction model but also leads to limitations for the outlier detection, e.g., it is impossible to store a large data set of acceptable data instances on the computing unit.
Fig. 2. Normalized pressure trajectory.
The data provided by the company is split into two data sets. Data set 1 consists of data instances which are the result of a systematic testing of different operation points which are occurring in real driving behavior of the vehicle. This data set is therefore used as training and validation set in order to optimize the chosen prediction method. In total, it consists of 9,992 data instances. Data set 2 consists of a random sample of data instances with different operating modes and contains 13,354 data instances in total. This is used as a test set to examine the performance of the chosen prediction method. Each data instance has 21 variables. One variable is the number of revolutions of the feed pump. Additionally, there are 20 variables which are related to the trajectory of the pressure values. A normalized pressure trajectory is depicted
Machine Learning Services in Challenging Environments
845
in Fig. 2. Every pressure trajectory is related to exactly one quantity flow which is the prediction target. 5.1
Evaluation of Data Validity (Step 1)
The first step in the proposed prediction method is a dedicated outlier detection approach. As explained in Sect. 4, we focus on the following classes of outlier detection algorithms: Extreme values, clustering, distance-based and probabilistic models. All outlier detection models require the definition of an outlier threshold in order to identify outliers. In this work, the thresholds are derived from the behavior of the outlier detection models on the training set. Thresholds are defined by considering the 95%-quantile of the outlier distance measure on the training set instances. This way, we also define some of the data instances in the training set as outliers. We do not have any information regarding the real outlierness of the data instances in the training set but we somehow need to define a threshold to identify outliers in operation of the general prediction approach. Due to the potentially critical impact of outliers on the behavior of the prediction model, we prefer to falsely classify a few normal data instances as outliers compared to the probability of missing real outliers. For the extreme values model, we compute the data distribution over all data instances in the training set for each input feature respectively. Subsequently, we derive the thresholds for each input feature based on the training data. However, we need to adjust the quantile value q for each input feature since an application of q = 0.95 leads to overly strict thresholds as 0.9521 = 0.341. Therefore, the loge 0.95 quantile for each input feature is updated to q = e 21 = 0.997. This value of q is evenly distributed to low and high values for each input feature. All data instances which have a measurement which is either higher or lower than one of the thresholds are considered outliers. From the clustering models, we apply the k-means algorithms with euclidean distance on the training data in order to identify suitable clusters [24]. Parameter k is determined by applying the elbow criteria which leads to the selection of k = 3. We then use this parameter to apply a clustering on the training data with three clusters. Subsequently, for each of the clusters, the distance to the corresponding cluster centroid for every data instance in the training set is computed. We order these distances by size and determine the 95%-quantile of these distances per cluster as a threshold. kNN [2] is chosen as representative from the distance-based models since it is the most common used algorithm from this group. We choose the value of k based the size of the training set N . Therefore, k is computed as follows: √ on √ k = N = 9992 ≈ 100. For each data instance in the training set, the euclidean distance to its k th neighbor is computed. Those distances are ordered by size and the corresponding 95%-quantile is derived as threshold. With regard to the probabilistic models, we compute the Mahalanobis Distance (MD) [16] which is basically defined by the covariance matrix of the entire training data set. The MD between a data instance and the mean value of all data instances can be understood as a metric for the outlierness of a data instance.
846
L. Baier et al.
In comparison to the Euclidean Distance, the MD also takes into account the covariance structure of the data and thereby normalizes the influence of each input feature on the overall distance computation. If a specific threshold for the MD is fixed in a two-dimensional space, we basically determine an ellipse around normal data instances which allows to differentiate outliers from nonoutliers. The threshold value is based on the 95% - quantile of the MD in the training set. We call this approach confidence ellipse in the following. As a fifth outlier detection method, we combine clustering with probabilistic models. Instead of fitting just one ellipse around the data instances in the training set, a k-means clustering with k = 3 is performed. Afterwards, an ellipse is estimated around each of the identified clusters. The threshold is computed based on the training data. This approach is called clustering and confidence ellipses in the following. To evaluate the five outlier detection methods, we have defined two distinct evaluation criteria: The first criterion is the amount of memory on the computing unit which is necessary to enable the outlier detection method (R2). We approximate the memory requirements by deriving how much floating point numbers (F P N ) need to stored. The second criterion is the necessary computation time in operation (R3). This refers to the calculations which need to be carried out in order to determine the outlierness of a new data point. For the remainder of this work, we differentiate into simple operations (SO) and complex operations (CO). Simple operations refer to mathematical operations such as multiplications, additions or root calculations. On the contrary, complex operations mean the computation of more sophisticated functions such as tanh or logistic. The first two criteria refer to the limited computing resources in our use case. Unfortunately, it is impossible to evaluate the general performance of the various outlier detection models at this point. In general, evaluation of outlier models are based on some kind of ground truth. However, neither the company nor we possess any ground truth labels with regard to the outlier detection and it is impossible to automatically derive the labels from available data. One solution is the manual labelling by a domain expert, however in our case with over 23,000 data instances, this approach seems unfeasible. Apart from this approach, there exist no general evaluation approach for outlier detection [39]. Therefore, we decide to evaluate the performance of the outlier detection models by comparing the overall prediction performance of various combinations of outlier detection and prediction models. This is reflected in the results of Sect. 5.3. The preliminary evaluation of the chosen outlier detection approaches is depicted in Table 1. The first row shows the evaluation of the extreme values method [2]. With regard to the storage space requirement, it is necessary to save exactly two values per input feature (one lower and one upper threshold). This results in 2 ∗ 21 = 42 F P N . The computation time in operation is determined by comparing 21 values of a newly arriving data instance with the given maximum and minimum thresholds resulting in 42 Boolean expressions. Subsequently, the results of each of these 42 Boolean expressions need to be evaluated which results in 84 simple operations in total. For both, memory as
Machine Learning Services in Challenging Environments
847
Table 1. Evaluation of different outlier detection approaches.
well as computational requirements, we have indicated approximate values since small changes in the parameter space also lead to changes in these numbers. The memory requirements for kN N are depicted in third row of the table. Since it is necessary to store the entire training set in memory, we need to have enough storage space for 9, 992 ∗ 21 = 209, 832 F P N . Regarding the computation time, for every data instance, it is necessary to compute one subtraction as well as one square computation per input feature and one square root operation per data instance: 9, 992 ∗ (21 ∗ 2 + 1) ≈ 430, 000.
Fig. 3. Sensitivity of the Mahalanobis Distance regarding a phase shift
The difference in memory requirements between the outlier detection methods and kN N is significant. Even the most complex method (clustering with confidence ellipses) requires only around one percent of the storage space of kN N . Among the four remaining outlier detection methods, confidence ellipse as well as clustering with confidence ellipses require the most memory and computation time. Furthermore, we perform a sensitivity analysis of the outlier detection approach with the creation of artificial outliers. Several types of artificial outliers are created by shifting the minimum and the maximum value of the pressure
848
L. Baier et al.
trajectory as well as by adding random noise and by introducing a phase shift. An example for an artificial outlier with phase shift is depicted in Fig. 3. This was evaluated manually on single data instances in order to further assess the functionality of the outlier detection model. The left figure shows an original pressure trajectory (in blue) and one which is shifted by six time units (in orange). The right figure shows the corresponding development of the Mahalanobis Distance (MD) with the amount of phase shift on the x-axis. The red horizontal line marks the outlier threshold for the MD. The MD rises significantly with the introduced phase shift. It is interesting to note that the MD decreases again with a phase shift of 18 or 19. The shifted pressure trajectory at this point is very similar again to the original trajectory since a phase shift of 20 refers to the exact same trajectory. 5.2
Evaluation of Model Robustness (Step 2)
With the outlier detection of step 1 successfully evaluated, we now regard the evaluation of the prediction model. In order to make a first selection of suitable prediction algorithms, we conducted pretests with the groups of algorithm that we introduced in Sect. 4. For every group, we implemented one prediction model with the standard parameter configuration in scikit-learn [28]. Due to the large number of available options for the generalized linear models, we implemented two prediction models, namely linear regression and polynomial regression (degree = 2). A random train-test split with 30% test data is performed on the training data (data set 1) for the model evaluation. We apply the mean percentage error (MAPE) as evaluation metric because it is the common metric for engineers involved in the use case and provides meaningful and interpretable results [7]. Furthermore, MAPE is independent of scale and therefore comparable among different data sets. Table 2. Results of pretest with various prediction models.
The results are depicted in Table 2. There are large performance differences between the three best performing prediction models with polynomial regression outperforming the other models by far. However, neural network (NN) and support vector regression (SVR) usually increase their performance significantly
Machine Learning Services in Challenging Environments
849
after parameter optimization. Therefore, we include polynomial regression, NN and SVR for the next steps of a more detailed investigation. Subsequently, we perform a grid search [11] with a 3-fold-cross-validation [18] on the three prediction models. Since the polynomial regression does not possess any parameters except the degree for the polynomial transformation, we also evaluate the lambda parameter for the ridge shrinkage method on the polynomial regression. The grid search is performed on the entire training set (data set 1). Best parameter combinations for each prediction model are shown in Table 3. Table 3. Parameter results of grid search.
For the evaluation of the different prediction models, several criteria need to be considered. Similar to the evaluation of the outlier detection methods, it is necessary to consider the memory requirements (R2) as well as the computation time in operation (R3). Additionally, prediction performance (R1) is measured by applying the prediction models on the test set (data set 2). Furthermore, the robustness of the prediction model (R1) is an important evaluation criterion. We measure robustness by randomly splitting the test set (N = 13, 354) in 10 batches of equal size and by computing the prediction performance on each of those batches. Subsequently, the range of the MAPE values between the different batches is assessed. For each of the prediction model classes, we implement two instantiations with different parameter settings. One instantiation refers to a simple model (fewer parameters), whereas the other instantiation represents are more complex model (more parameters). This approach allows to test the sensitivity between model simplicity and prediction performance [1]. As an example, we implement a simple neural network with 4 neurons in the hidden layer and a more complex one with 32 neurons which was the best number of neurons during grid search. In total, 3 · 2 = 6 different prediction models are evaluated. Similar to the evaluation of the outlier detection models, we give approximate values for the memory and computation time requirements since those values heavily depend on the parameter choice. However, the values in the table provide a pretty good indication of the differences between various prediction models. The results are depicted in Table Table 4. The simple neural network consists of only 4 neurons in the hidden layer and the input vector has 21 features.
850
L. Baier et al. Table 4. Evaluation of different prediction models.
Therefore, we need to save 21 · 4 = 84 weights and 1 · 4 = 4 biases between the input and the hidden layer as well as 4 weights and 1 bias between hidden layer and the output layer. With regard to the computation time, it is necessary to perform one multiplication and one addition per weight as well as one addition per bias leading to ((21+1)·4)·2+(4+1)·2 = 186 SO. In each of the neurons, one tanh needs to be evaluated resulting in 4 CO. The MAPE of 4.51% is indicated in the third column and the range of the MAPE values among the different batches is shown in the last column. The differences in memory requirements and computation time between the prediction models are remarkable. Compared to the simple neural network, the complex SVR is represented by 1.448 support vectors (SV) which results in 1.448 ∗ 22 + 1 = 31.857 F P N to store. With regard to the computational complexity, due to the chosen rbf -kernel, it is necessary to compute 1.448 exponential functions (CO) for every new data instance. Compared to the prediction results in the pretest (c.f. Table 2 on data set 1 on page 12), the prediction performance of the different prediction models seems to be discouraging. This is especially true since a parameter optimization has been carried out. Presumably, the test data (data set 2) which is used to compute the MAPE values is more diverse and complex to predict than the training set. This impression is also confirmed by the company which describes data set 2 as randomly sampled whereas instances for data set 1 were systematically selected. Considering the performance differences between groups of algorithms, SVR now clearly outperforms its counterparts whereas Polynomial Regression which was the best prediction model in the pretest performs worst. Probably, this behavior can be explained by the fact that SVR contains more parameters which can be optimized in a parameter optimization. Interestingly, the performance difference between the simple and the complex model per prediction model class are rather small. This indicates that also models with few parameters might be powerful enough to generalize well in this use case. Regarding the robustness, the MAPE range for a simple SVR is only 0.4% which indicates a rather robust prediction model. In contrast, the performance of the polynomial regression fluctuates significantly with 1.7%.
Machine Learning Services in Challenging Environments
5.3
851
Evaluation of Overall Prediction Method
As we evaluated the two main steps of the proposed method in an isolated way, we now evaluate the performance of the overall approach with outlier detection and a subsequent prediction model on the test set (data set 2). Table 5 displays the MAPE values for each prediction model (in rows) in combination with each of the chosen outlier detection models (in columns). The first row in the table indicates the number of outliers which are identified in the test set by each of the models. Clustering with confidence ellipse detects only 322 outliers, whereas kNN flags nearly 1,300 instances as outliers. It is important to keep those numbers in mind for the overall evaluation of different combinations. An approach which just marks nearly all data instances as outliers will not be applied in practice. Table 5. Performance of outlier detection in combination with prediction model (MAPE).
Each prediction model in the table is described by three rows. The first row depicts the prediction performance (MAPE) when no outliers are removed from the test data. The second row refers to the prediction performance when the outliers identified by the detection models are removed from the data set. The third row is introduced to illustrate the effectiveness of the outlier detection models: In this case, we randomly remove the same amount of data instances as indicated by the outlier detection models and compute the MAPE based on the remaining data instances. To illustrate this, we will consider the complex Neural Network in combination the extreme values approach. The cell “All” is
852
L. Baier et al.
computed based on all 13, 354 data instances, the cell “OutlierDet” is based on 13, 354 − 877 = 12, 477 data instances and the cell “Random” is also based on 12, 477 instances but which are randomly selected from the entire test set. Considering the overall performance, all models improve their performance significantly when the outlier detection is introduced, e.g. the MAPE of the complex neural network improves from 5.05% to 3.06% in combination with the extreme values approach. As expected, the MAPE for randomly selected data instances remains on a similar level (5.03%). Table 6 shows the five combinations of outlier detection and prediction model with the lowest MAPE. Extreme values and clustering with confidence ellipses seem to be promising approaches for the outlier detection. Complex neural network and complex support vector regression perform best among all prediction models. With all this information at hand (Table 1, Table 4, Table 5), an informed choice about the best possible combination can be made. In case decision-makers focus on the lowest overall MAPE, extreme values in combination with complex neural network should be chosen. In case, it is important that not too many data instances are considered outliers, clustering with confidence ellipses with for instance complex support vector regression should be considered. Additionally, one can also consider the memory as well as computation requirements of each combination for the decision making process. Table 6. 5 Combinations with the lowest MAPE.
Figure 4 shows the distribution of MAPE (y-axis) with the liquid quantity (target) on the x axis based on the computation of complex neural network with clustering and confidence ellipses. Data instances which are marked in red are detected as outliers, whereas data instances marked in blue are considered normal data. The left plot shows that data instances with a high error are also detected as outliers. We can exactly identify those data instances as outliers which also lead to high errors during application of the prediction model. This clearly underlines the performance of our prediction method. The right plot zooms in on the middle of the left plot. The algorithm detects some outliers which have a faulty prediction, however there also seem to be some normal data instances (with a low error) which are detected as outliers. These data instances are especially located at the far right side of the plot. In total, the outlier detection method only detects 2.4% (322/13354 = 0.024) of data instances as outliers in the test set. The prediction model will not issue
Machine Learning Services in Challenging Environments
853
Fig. 4. Error plot with complex neural network
a prediction in this case because this data instance has been marked as outlier beforehand. Other parts of the vehicle can therefore not adapt their behavior because they do not receive the necessary feedback. In this case, the system has to rely on previous predictions and adjust accordingly. This solution is certainly not optimal, but definitely better than adapting on a false prediction which might lead to devastating results. Usually, sudden changes from one measurement to another are not observed in this system and therefore change happens incrementally. This also allows to skip some required predictions in between. However, in case of real concept drift in the system which results in a persistent new concept over time, the proposed prediction method will not be of use. Due to the missing label feedback, the system will not be able to adapt to the new concept and will therefore continuously classify new data instances as outliers. We can monitor the percentage of outliers compared to all incoming data instances as a metric for the sanity of the system. If an increase in the percentage of detected outliers is observed, the driver will be notified with a request to bring the vehicle to the garage where model updates can be performed.
6
Conclusion
This work introduces a method for designing robust machine learning models in local environments without label feedback. Traditional concept drift adaption mechanisms [17] cannot be applied in this case because they rely on feedback from true labels. We propose a prediction method (machine learning service) combining a dedicated outlier detection with a subsequent prediction model. Predictions are only issued for data instances similar to the training data. Dissimilar data instances are marked as outliers and no prediction is computed in this case reducing the probability of receiving erroneous predictions. Infrastructure resources such as limited computing power and storage space are also considered. We evaluate our approach by applying it on a data set from a large global automotive supplier. We show that the prediction performance based on our prediction approach is increased significantly compared to an approach without outlier detection and at the same time available memory and computation resources are sparsely applied.
854
L. Baier et al.
Our contribution is twofold: First, we introduce a prediction method which combines an outlier detection in combination with a robust prediction model to ensure the proper functionality in local areas of application without label feedback. Second, we develop a set of requirements for the evaluation of outlier detection and prediction models for similar settings. For each requirement, we derive and develop corresponding metrics and then apply these methods on our use case to demonstrate the feasibility of the method. Regarding the managerial implications, this work can be used as a guideline for practitioners on how they can implement machine learning solutions in settings with similar constraints. The developed method allows to assess and evaluate a variety of different requirements relevant for prediction approaches implemented in practice. Thereby, this work contributes to ensuring the correctness of machine learning results in industry settings. Our approach certainly has limitations. The prediction method discards data instances classified as outliers and does not provide a prediction for those. In case of many subsequent outliers, the system will therefore not receive any prediction and will not be able to adapt appropriately. Additionally, we evaluated the approach only with one data set. Even though this data set represents a realistic driving scenario, it is still created on a test bench. Therefore, we cannot evaluate the prediction method during unexpected situations that can arise in reality. Furthermore, it is difficult to assess the performance of the outlier detection method since there are no true labels with regard to the outlierness of data instances. In future work, we aim at performing a field study with a vehicle participating in normal road traffic. Additionally, the proposed approach should be evaluated in a different context in order to prove its effectiveness. The prediction method presented in this work can help to increase the acceptance of machine learning in real-world contexts with the aim to effectively deploy models in productive environments. So far, many companies are still skeptical about taking automated decisions based on deployed machine learning solutions. This is due to a fear of extreme decisions based on unusual input data. This work shows how existing methods can be combined in order to increase the reliability of machine learning solutions and therefore aims at increasing the practical relevance of this powerful technique.
References 1. Adler, P.S., Clark, K.B.: Behind the learning curve: a sketch of the learning process. Manage. Sci. 37(3), 267–281 (1991) 2. Aggarwal, C.C.: Data Mining. The Textbook. Springer (2015). https://doi.org/10. 1007/978-3-319-14142-8 3. Alsheikh, M.A., Lin, S., Niyato, D., Tan, H.P.: Machine learning in wireless sensor networks: algorithms, strategies, and applications. IEEE Commun. Surv. Tutorials 16, 1996–2018 (2014) 4. An, X., Zhou, X., L¨ u, X., Lin, F., Yang, L.: Sample selected extreme learning machine based intrusion detection in fog computing and MEC. Wireless Commun. Mob. Comput. 2018 (2018)
Machine Learning Services in Challenging Environments
855
5. Anguita, D., Ghio, A., Oneto, L., Parra, X., Reyes-Ortiz, J.L.: Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine. In: Bravo, J., Hervas, R., Rodriguez, M. (eds.)International Workshop on Ambient Assisted Living, pp. 216–223. Springer, Heidelberg (2012). https:// doi.org/10.1007/978-3-642-35395-6 30 6. Anguita, D., Ghio, A., Pischiutta, S., Ridella, S.: A hardware-friendly support vector machine for embedded automotive applications. In: International Joint Conference on Neural Networks (IJCNN), pp. 1360–1364 (2007) 7. Armstrong, J.S., Collopy, F.: Error measures for generalizing about forecasting methods: empirical comparisons. Int. J. Forecasting 8(1), 69–80 (1992) 8. Baier, L., Hofmann, M., K¨ uhl, N., Mohr, M., Satzger, G.: Handling concept drifts in regression problems–the error intersection approach. In: International Conference on Wirtschaftsinformatik (2020) 9. Baier, L., K¨ uhl, N., Satzger, G.: How to cope with change?-preserving validity of predictive services over time. In: Proceedings of the 52nd Hawaii International Conference on System Sciences (2019) 10. Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1994) 11. Bergstra, J., Yoshua, B.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012) 12. Cavalcante, R.C., Minku, L.L., Oliveira, A.L.: FEDD: feature extraction for explicit concept drift detection in time series. In: International Joint Conference on Neural Networks (IJCNN), pp. 740–747 (2016) 13. Chandola, V., Banerjee, A., Kumar, V.: Outlier detection: a survey. ACM Comput. Surv. 14, 15 (2007) 14. Chen, H., Chiang, R., Storey, V.: Business intelligence and analytics: from big data to big impact. MIS Q. 36(4), 1165–1188 (2012) 15. Dal Pozzolo, A., Boracchi, G., Caelen, O., Alippi, C., Bontempi, G.: Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE Trans. Neural Netw. Learn. Syst. 29(8), 3784–3797 (2017) 16. De Maesschalck, R., Jouan-Rimbaud, D., Massart, D.L.: The Mahalanobis distance. Chemometr. Intell. Lab. Syst. 40, 1–18 (2000) ˇ 17. Gama, J., Zliobait˙ e, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014) 18. Golub, G.H., Heath, M., Wahba, G.: Generalized cross-validation as a method for choosing a good ridge parameter. Technometrics 21, 215–223 (1979) 19. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: International Conference on Machine Learning, pp. 1737–1746 (2015) 20. Hodge, V.J., Austin, J.: A survey of outlier detection methodologies. Artif. Intell. Rev. 22, 85–126 (2004) 21. Fei, H., Hao, Q.: Intelligent Sensor Networks: The Integration of Sensor Networks, Signal Processing and Machine Learning. CRC Press, Boca Raton (2012) 22. Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015) 23. Kadlec, P., Gabrys, B.: Local learning-based adaptive soft sensor for catalyst activation prediction. AIChE J. 57(5), 1288–1301 (2011) 24. Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 881–892 (2002)
856
L. Baier et al.
25. Malikopoulos, A.A., Papalambros, P.Y., Assanis, D.N.: A learning algorithm for optimal internal combustion engine calibration in real time. In: ASME International Design Engineering Technical Conferences, pp. 91–100. American Society of Mechanical Engineers (2007) 26. Oneto, L., Ghio, A., Ridella, S., Anguita, D.: Learning resource-aware classifiers for mobile devices: from regularization to energy efficiency. Neurocomputing 169, 225–235 (2015) 27. Orair, G.H., Teixeira, C.H., Meira Jr., W., Wang, Y., Parthasarathy, S.: Distancebased outlier detection: consolidation and renewed bearing. In: Proceedings of the VLDB Endowment, vol. 3, pp. 1469–1480. VLDB Endowment (2010) 28. Pedregosa, F., Varoquaux, G., Gramfort, A.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) 29. Raykar, V.C., et al.: Supervised learning from multiple experts: whom to trust when everyone lies a bit. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 889–896 (2009) 30. Sch¨ uritz, R., Satzger., G.: Patterns of data-infused business model innovation. In: Proceedings of IEEE 18th Conference on Business Informatics (CBI), vol. 1, pp. 133–142. IEEE (2016) 31. Tsymbal, A.: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106(2) (2004) 32. Vong, C.-M., Wong, P.-K., Li, Y.-P.: Prediction of automotive engine power and torque using least squares support vector machines and Bayesian inference. Eng. Appl. Artif. Intell. 19(3), 277–287 (2006) 33. Wang, S., et al.: When edge meets learning: adaptive control for resourceconstrained distributed machine learning. In: IEEE Conference on Computer Communications, pp. 63–71 (2018) 34. Widmer, G., Kubat, M.: Learning in the presence of concept drift and hidden contexts. Mach. Learn. 23(1), 69–101 (1996) 35. Xie, Y., Kistner, A., Bleile, T.: Optimal automated calibration of model-based ECU-functions in air system of diesel engines. Technical report, SAE Technical Paper (2018) 36. Yu, H.-F., Jain, P., Kar, P., Dhillon, I.: Large-scale multi-label learning with missing labels. In: International Conference on Machine Learning, pp. 593–601 (2014) 37. Zhang, Q., Cheng, L., Boutaba, R.: Cloud computing: state-of-the-art and research challenges. J. Internet Serv. Appl. 1, 7–18 (2010) 38. Zhou, Z.-H.: Machine learning challenges and impact: an interview with Thomas Dietterich. Nat. Sci. Rev. 5(1), 54–58 (2017) 39. Zimek, A., Schubert, E., Kriegel, H.-P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. ASA Data Sci. J. 5(5), 363–387 (2012) ˇ 40. Zliobait˙ e, I.: Learning under concept drift: an overview. arXiv preprint arXiv:1010.4784 (2010) ˇ 41. Zliobaite, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. Big Data Analysis: New Algorithms for a New Society, pp. 91–114 (2016)
Development Support for Intelligent Systems: Test, Evaluation, and Analysis of Microservices Charline von Perbandt1 , Matthias Tyca1 , Arne Koschel1 , and Irina Astrova2(B) 1
2
Hochschule Hannover, University of Applied Sciences and Arts, Hannover, Germany [email protected] Department of Software Science, School of IT, Tallinn University of Technology, Tallinn, Estonia [email protected]
Abstract. Increasingly, software systems such as all kinds of intelligent systems are developed based on so called small microservices. The goal is, to make the resulting software system more flexible and to make it better to handle. Despite the many advantages, the use of such an approach increases the complexity of the system. This also has an immense impact on the procedure and the effort of the analysis, evaluation and especially the test of a system. As contributions of this paper, we will first discuss the challenges that a microservice architecture brings compared to a monolithic approach. Subsequently, it will be discussed how the analysis and the test of individual microservices could be realized and finally, some tools will be presented, which should support the developer with these two tasks. We limit ourselves to an initial analysis of microservices for the development (also) of intelligent systems by means of an example and an overview of some popular development and test strategies, like those from Google’s development team.
Keywords: Microservices
1
· Development support · Analysis
Introduction: Microservices in General
Compared to monolithic software systems, microservices are small, independent Services, which communicate and work with each other. Together they represent the total system (see Fig. 1). These individual services follow the “principle of the single responsibility” [32], which Robert C. Martin describes as follows: “Grasp things together, which are changed for the same reason and divide things, that are changed for different reasons”. So services are usually small code fragments that focus on individual functionalities. Each of these code fragments is a separate process and can run almost independently of other services. c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 857–875, 2022. https://doi.org/10.1007/978-3-030-82193-7_58
858
C. von Perbandt et al.
Fig. 1. Differences between microservices and monoliths
As a result, it is possible to select a separate technology for each service and thus optimally align the tools that are used with the function to be implemented. It would be also possible, to store the data of each service in a different way and to provide the functionality over different interfaces. However, it is recommended to keep the offered interfaces as soon as a service has been put into production, as otherwise the interface changes can lead to unwanted changes and additional costs for dependent services. Due to the high flexibility of the technology selection and the fact that services represent own processes, an own development team for each service can be provided. The development teams can work independently from each other, which allows the unattached development of the individual microservices. So a service can brought in production, without the need to re-deploy other microservices. Once microservices are in production, they can easily be scaled and, if necessary, replaced or re-created in the very short time. In the best case it is even possible, that a system can tolerate a complete failure of a microservice, which would makes the system quite robust. By means of a small microservices development case study as well as examples for testing strategies, we aim to show the advantages but also drawbacks of microservices based system development (also) for intelligent systems. Our study is limited due to its small size, but since we look at a variety of examples for development and testing strategies, the results should be transferable and useful in practical (at least smaller) microservices based software development. The remainder of this article is structured as follows: In the next section we will first discuss the challenges that a microservice architecture brings compared to a monolithic approach. Subsequently, it will be discussed how the analysis (see Sect. 3) and the test (see Sect. 4) of individual microservices could be realized. Finally, some tools will be presented, which should support the developer with these two tasks (see Sect. 5) and a conclusion is drawn in Sect. 6.
Analysis of Microservice
2
859
Challenges with Testing Microservices
Nevertheless, with a microservice system, you also have to accept a few challenges. A lot of the mentioned advantages, can lead to a high complexity, when used carelessly. In the worst case, it can make it even more difficult to work with such a system compared to a monolith. Because the architecture of microservices resembles that of a distributed application, one of the biggest challenges is the dependencies between the individual services. Since only the interaction of the individual services results in the overall functionality. The more dependencies there are and above all, the more confusing the dependencies are, the more difficult it becomes to isolate the individual services. However, a isolation of the microservices is indispensable in order, for example, to be able to carry out a test which is solely aimed at the functionality of the microservices themselves. Beside the dependencies, the free choice of technologies can also lead to difficulties. Different programming languages can cause the need of various analysis and testing tools. In addition, there is no single person who understands the entire system, because each development team is usually specialized in one or at most a few microservice. So it could happen that, if an employee (let alone a full microservice team) for whatever reason leaves the company, a complete microservice has to be implemented again, since the other developers don’t know the programming language in which the microservice is written. If in such a case the microservices’ functionality is documented poorly, that might well become an issue. In addition, it is difficult to test the system as a whole, since the individual microservices are deployed independently of each other and thus different versions of the services used can be present. To test the total system with every version combination is too expensive and takes too much effort. All of these aspects have to be considered, when is comes to the analysis and the testing of microservices.
3
Analysis of Microservices
Unlike a monolith, microservices have not just one source of error. Because of the communication between the microservices, it can be a challenge to find and correct the error. To ensure the stability of the overall system, it is necessary to monitor and analyze the microservices. The analysis is carried out with the help of monitoring. The monitoring refers both to the hosts and to the applications themselves [1]. 3.1
Collecting Key Figures
One reason for monitoring is the collection of key figures of the system. This is important for two reasons. First, metrics for the execution of speed tests can be obtained from this. With the speed test and the obtained key figures it is possible
860
C. von Perbandt et al.
to compare the performance of the test system with that of the productive system [2]. Furthermore it is important to collect key figures of the productive system in order to evaluate where the limits of the system are. This way you can react to these limits and possibly scale the application [1]. 3.2
Approach
The monitoring procedure changes depending on the size of the system and number of services. Starting from a host with a service, monitoring is still easy to practice. Key figures such as CPU-, working memory- or hard disk utilization of the host and the protocols of the host are procured and evaluated with the help of appropriate tools. On the application side, the response times of the service are evaluated. If the same service is distributed across multiple hosts, this procedure is applied to each host and service. If a load balancer such as ribbon is used in this case, it should also be monitored. It should be noted whether the load balancer only forwards to active services. If a problem occurs, the source of the error can be quickly identified. If, for example, a performance problem occurs with all microservices, the error seems to be due to the application itself. If only one microservice has performance problems, the error is probably due to the host, which should be investigated first. If you have several different services on several hosts, the analysis is more difficult. The communication between the services has to be considered. In general, it can be said that in this case the log files of the services have to be collected, aggregated and visualized. However, there are still special procedures that are explained below [1]. Semantic Monitoring. In semantic monitoring, pseudo transactions are executed on the system to test the stability of the system. It is important with this procedure that the transactions are not stored in the actual database, but in a separate database. You can use the written automatic tests to perform semantic monitoring [1]. Correlation Id and Tracking. In order to detect errors in the communication between the microservices, it is necessary to examine the communication between the microservices. For better traceability, a so-called correlation id is assigned in the logs. As shown in Fig. 2, this id is passed through the entire request chain. If you have to communicate with service 2 for a request started at service 1, this request receives id 123. In the next step, service 2 sends a request to service 3. The id 123 created at service 1 is now also appended to this request. If there is a problem with the request, the request chain can now be traced and analyzed [1,3].
Analysis of Microservice
861
Fig. 2. Using the correlation id
Healthcheck API. With this method, each service provides a health endpoint. All information about the service can be viewed under this endpoint. This includes, for example, the status of connections or the status of the host with regard to storage space. The status page can be called by other microservices or a load balancer. In the event of any problems, a request is not sent to this service [4]. Audit Logging. User behavior is stored in order to be able to track the steps of the user [5]. Audit logs are used for diagnostic performance and error correction, but can also be used to find security vulnerabilities [6]. Canary Environment. The canary environment method is used to gradually introduce the test system into the productive environment and to find possible errors without system failure. In the first step, the production system is not replaced by the test system. Instead, the test system is introduced in parallel to the production system. As a rule, 5 to 10% of the requests are redirected to the test system. In the following, the test system will continue to be analyzed. If the test system works in the production environment, the number of requests is increased step by step. If the test system is fully functional, the former production system is completely replaced by the new system. This procedure has the advantage that you can access the production system at any time if there are
862
C. von Perbandt et al.
problems with the new system, and the new system can be completely withdrawn again [7]. Analysis of Interfaces. Another important aspect that falls under the analysis is the static analysis of the interfaces of the microservices. Due to the inter dependencies of the microservices, it is necessary to identify on which data the microservice to be analyzed depends as well as which dependencies the microservice to be analyzed offers to other services. Microservices also offer the option of using different technologies for each individual service. Externally, each microservice acts as a black box. The technologies used in the microservice to be examined must be analyzed in order to select the appropriate tools for testing. Only after these analyses is it possible to implement appropriate mocks and stubs and test the microservice to be examined in isolation.
4
Test Concepts
Test concepts are basically procedures, which are suggested by experts (cf. [31]), general literature about testing (cf. [27,28]) or companies (cf. [26]). In these procedures, the authors describe how a microservice and the overall system resulting from the microservices should be tested. One component that appears in many of these process models is the test pyramid by Mike Cohn [31]. This test pyramid (see Fig. 3) was originally constructed for monolithic systems and has been adapted in the published test concepts for the use with microservice architectures. The pyramid is separated in three categories and should clarify to what extent a system is to be tested. In the lowest and biggest part of the pyramid are the unit tests or also called component tests. These tests focus on the smallest possible units of the system, like methods or functions and check if these units work as intended by the developer. The units should have no dependencies, because you want to focus the test only on one component/unit and don’t want to include any dependent units in this test. This isolation of a unit is reached by mocks and stubs. Mocks deliver on specific method calls, specific parameters as result while stubs simulate a whole method or function with restricted functionalities. Since such unit tests need to know what dependencies exist and how a unit is build, they fall into the category of white-box-tests. An immense advantage of these tests is that they run through within minutes. So it is possible for the developer to do such a unit test after every change within a method or function. Above the unit tests are the integration tests. These integration tests check if multiple components of the system communicate like they supposed to do or whether there may be errors in the interfaces. The tests don’t need to know anything about the units or the system itself, which makes integration tests to black-box tests.
Analysis of Microservice
863
Fig. 3. Test pyramid by Mike Cohn
The last category and thus the tip of the pyramid is formed by the UI tests. The UI tests are by far the slowest tests among the three test categories, because it takes the entire system to perform such a test. A UI test or also end-to-end test checks if the user interface works correctly. Like the integration test, a UI test doesn’t need any information about the system and is thus also a black-box test. As a result of the fact that the tests becoming slower and requiring larger parts of the total system, the higher you are in the pyramid, the number of tests that are carried out continues to decrease. 4.1
Test Concept of Eberhard Wolff
A test concept that is strongly based on the test pyramid by Mike Cohn, is the test concepts of Eberhard Wolff [25]. While a monolith usually requires only one test pyramid, Wolff recommends using several test pyramids when testing microservices. One pyramid for the entire system and then a test pyramid for each implemented microservice (see Fig. 4). The test pyramids of the individual microservices corresponds to the original test pyramid by Mike Cohn and are only extended by manual tests and load tests. Manual tests are limited to exploratory tests, which means that they examine problems in specific areas of the system. Manual tests can minimize the risk of errors in new features or test each feature for safety, quality or performance.
864
C. von Perbandt et al.
Fig. 4. Test pyramids for the entire system
In addition, load tests are used to check how the systems behaves under a given load and whether errors can occur under this load. In contrast to the pyramids of the individual microservices, the test pyramid of the overall systems differs more clearly from Cohn’s pyramid. Here the pyramid consists of integration tests, UI tests and manual tests. The unit tests are completely eliminated. The reason for this is, that the microservices on which the overall system is composed represents the units. These units have already been tested by the individual test pyramids of the microservices and therefore do not need to be tested again in the overall system. The tests that are carried out in the overall system therefore concentrate mainly on the integration of all existing microservices. They serve to ensure that the interaction of the microservices does not cause problems. This statements become clear when you take a look at the flow of the tests in a deployment pipeline in Fig. 5).
Fig. 5. Deployment-pipelines for the entire system
Every microservice goes through its own tests consisting of commit phase, acceptance tests, capacity tests and exploratory tests. Unit tests are part of the commit phase, UI tests can either appear in the acceptance tests or run in the commit phase, capacity tests can be considered as integration tests and
Analysis of Microservice
865
the exploratory tests are the manual tests. These tests are represented as single stands in the deployment pipeline. After a microservice has completed its four test phases, it goes into a joint integration test. However, according to Wolff, this integration test can lead into complications. If new versions of several microservices were running together in the integration test, there would be no way to determine which microservices was responsible for a possible failure of the test. For this reason, microservices have to enter the integration test separately. If a microservice is undergoing an integration test, other microservices will have to wait for the test so run through. This slows down the deployment of microservices immensely. For this reason, Wolff proposes to shift the focus of integration tests in the overall system to the integration tests of the individual microservices. Since it is already tested in the integration tests of the individual microservices whether they communicate with their dependent microservices as it is desired by the developer, it is sufficient in the whole system to test whether the individual microservices can even reach each other. As a result, integration tests of the entire system will take less time and deployment into production will become faster again. 4.2
Test Concept of Sam Newman
Just as in the test concept of Wolff, Sam Newman build his concept [29] on Mike Cohn’s test pyramid (see Fig. 6).
Fig. 6. Test pyramid by Newman
The end-to-end tests represent the UI tests and also the unit test remain unchanged. However, the integration tests found in Mike Cohn’s test pyramid are being replaced by service- and consumer-driven-tests. Service tests check a single microservice for its function. As well as the unit tests, they have to be completely isolated in order to prevent any dependent microservices from being included in the test. This isolation is also achieved with mocks and stubs.
866
C. von Perbandt et al.
In addition to the service tests, their is also the category of consumer driven tests. Consumer driven tests focus exclusively on the demands a consumer places on the service. These tests are defined by consumer driven contracts, which records the claims of a consumer for a particular service. Interestingly, Newman also recommends largely abandoning end-to-end-tests. One reason is, that the overhead of performing an end-to-end test is extremely high. This effort is due to the need for a complete system, because there is no isolation in this test. Moreover, these tests also depend heavily on their environment, which makes them non-deterministic. For example, a temporary network error could cause a test to fail that would normally have passed successfully. In order to counteract such end-to-end tests are often carried out again when failing which leads to more effort than such tests anyway bring with it. 4.3
Test Concept of Google
The test concept, that James Whittaker has presented in the book “How Google tests Software” [26] is actually not explicitly designed for microservices. However, it has many parallels in comparison to the test pyramid of Mike Cohn, which is used in many of the microservice-test concepts. This makes it possible to adapt the google concept to the testing of microservices. In his test concept, Google uses three categories of tests that can be carried out manually or automatically. But Google also recommends running as many test automatically as possible. The three test categories are called small-, medium- and large tests, although according to google it isn’t important what these tests are called. It is only important that everyone in the company knows what the terms are meant and what he has to imagine under that terms. Small tests are, as the name implies, the smallest tests of the concept. In most cases, they are automated and runs within few seconds. In addition, they must be performed completely isolated in a simulated environment. The main task of these small tests is to check if a code does, what it supposed to do. A medium test focuses on the interactions and examines whether a set of “nearest neighbor functions” interoperate with each other, like they should. With “nearest neighbor functions” Google refers to features that call each other or communicate directly with each other. While medium tests only check two or at most a few features, a large test looks at several features at once. The goal of large tests is to ensure that the end product or the tested features fulfill the expectations of the users. To do this, the large tests use real user data and go through realistic user scenarios. Since a large part of the system is being tested here, the performance of such tests may take several hours or more. As already mentioned, these tests can be compared with those from the test pyramid. So a small test is basically a unit test, medium tests are integration tests and a large test can be compared with the UI test. But in contrast to the test pyramid, Google presents this tests in a trapezoid (see Fig. 7).
Analysis of Microservice
867
Fig. 7. Example for an test trapez
The trapezoid makes clear, which tests must be present to an extent. The further down a test type is in the trapezoid, the more tests of this type should be done. These statements are in turn the same as in the test pyramid. 4.4
Test Concept of Netflix
A test concept that deals more with the tests that run after the microservices have already gone into production is the concept of Netflix. The goal of this test concept is to optimize the Mean Time of Recovery (MTTR) of microservices in order to ensure the best possible availability and reliability of the system. To achieve this goal, Netflix has developed the so-called “Simian Army”. The Simian Army consists of several types of bots, which are called monkeys. The basic idea behind these monkeys, describes a developer of Netflix in a blog post [30] as follows: “Imagine getting a flat tire. Even if you have a spare tire in your trunk, do you know if it is inflated? Do you have the tools to change it? And, most importantly, do you remember how to do it right? One way to make sure you can deal with a flat tire on the freeway, in the rain, in the middle of the night is to poke a hole in your tire once a week in your driveway on a Sunday afternoon and go through the drill of replacing it. This is expensive and time-consuming in the real world, but can be (almost) free and automated in the cloud.” It should therefore be possible for one component of the system to fail, without affecting the availability of the entire system in any way. This is where the monkeys come into play. They are started in a monitored environment and are intended to detect possible weaknesses of the system. In this way, developers are able to develop automated recovery mechanisms to respond optimally in an emergency. In total, there are eight different monkeys that manipulate or monitor the system in a variety of ways.
868
C. von Perbandt et al.
– The Chaos Monkey is one of the first systems Netflix has developed in the context of this concept. He randomly destroys instances or services within the architecture to test how tolerant the system responds to such failures. In the best case, the components should work so independently that eliminating individual components has no effect on the other components. – Comparable to the Chaos Monkeys is the Gorilla Monkey. While the Chaos Monkey only destroys individual instances or services, the Chaos Gorilla shoots a complete Amazon Availability Zone. The goal is to determine if the functionality of the system is automatically moved to another Availability Zone or if manual intervention is required. – The Latency Monkey simulates performance degradation by inducing artificial delays. So it is possible for him to simulate a complete failure of an instance. In this way, it can be tested whether new services would work even without their dependencies. The remaining five monkeys are used more for checking the system than actively manipulating it. – The Conformity Monkey checks if the produced services fulfill the bestpractices requirements of Netflix. If an instance found by the Monkey is found to be insufficient, he shuts it down and informs the service owner. The service owner then has the opportunity to adjust its service accordingly. – The Security Monkey behaves similar, since he also checks the running services. However, the Security Monkey does not check if certain requirements are sufficient, but whether any weaknesses or security vulnerabilities exist within the instances. Affected instances will also be shut down. – A Doctor Monkey reviews health checks that are in each instance. He also keeps track of features like CPU-usage and things like that. Again, the service owner will be notified if their are any inconsistencies in their instances. – Garbage, unused resources or redundant instances are found by the Janitor Monkey and then eliminated. – The eighth and last Monkey is the 10-18 Monkey. It detects configuration and runtime issues of instances with customers in different regions. 10-18 stands for Localization-Internationalization or simply I10n-i18n.
5
Tools for Analysis and Testing
In the previous sections, methods for testing and analyzing microservices were presented. This section deals with tools that support testers and analysts. It introduces tools that are subject to a fee as well as tools that are covered by the Open Source license. What is striking is the fact that many tools have open source licenses and can therefore be used free of charge.
Analysis of Microservice
5.1
869
Tools for Isolated Testing
Due to the dependencies on other microservices, it is necessary to fake interfaces to other microservices during isolated testing. It is important to fake the data expected from other microservices as well as to check the correctness of the data of the microservices to be tested if they are consumed by another microservice. SoapUI. SoapUI is an REST and SOAP testing tool. Among other functions like load testing and API monitoring, SoapUI allows you to define predefined responses for any endpoint. The format of the answers can be entered as well as the port and address where the answer is expected. When the microservice accesses this endpoint, the predefined response is returned. In principle, it is possible to mock all dependencies to other microservices with SoapUI. SoapUI is a freemium tool that provides basic functionalities free of charge and offers additional services for a fee [8,9]. Restito. Restito is a tool with which it is possible to write and run automatic tests [10]. Restito has an open source license and can be used free of charge. With this tool the answers of the microservice can be checked for correctness. The answers are stored in an object and checked for equality with a predefined object. The point here is whether the answer of the microservice to be examined contains the required information correctly and completely so that it can be processed by the consuming microservice without errors [11]. 5.2
Analysis
This section covers tools that can be used during operation to analyze both the microservice application and the hosts to provide metrics. Zipkin. Zipkin [12] is a tool for the analysis of requests. It can be used with Spring Cloud Sleuth and analyzes both the time a request takes and the individual stations of a request. Thus it is possible to identify speed problems as well as to analyze the request chain, i.e. which microservice has sent a request to which microservice. Detailed information can be retrieved for each request. Zipkin is open source and can be downloaded as a .jar file. If Spring Cloud Sleuth is implemented in the application, no further configuration is required. Only the Zipkin server has to be started. Figure 8 shows a Zipkin analysis. On the left side, the services that have been called are displayed in the order in which they were called. A span represents a method call of the service and indicates how long the call lasted. A trace is a collection of spans. With this view you can evaluate the duration of the individual requests. In addition, Zipkin offers another view as shown in Fig. 9. With this view, the dependencies of the individual microservices on each other can be graphically displayed. In Fig. 9 service a sends a request to service b and service e. Service b forwards the request to service c and service d. It is also
870
C. von Perbandt et al.
Fig. 8. Using the correlation id [13]
Fig. 9. Using the correlation id [14]
possible to enter a time span in which the dependencies are to be displayed. In this example the dependencies from 10-08-2017 at 22:31 o’clock are displayed. Elasticsearch, Logstash, Kibana. Elastricsearch, Logstash and Kibana are three tools that are often used together and are therefore also known as ELK Stack. Logstash is a log pipeline that can retrieve and forward data from multiple sources. Elasticsearch is a search engine that can filter data and Kibana is a visualization tool that visualizes the data obtained in Logstash. Since the tool Beats, a log shipper, was integrated into the stack in the meantime, the ELK stack was renamed to Elastic Stack. [15] With the help of the Elastic stack log files can be visualized and evaluated. Thus it is possible to graphically display how many errors occurred at which point in time [1]. Graphite. Tool for collecting and graphically displaying performance data [16]. With Graphite, real-time key figures can be queried and visualized. In order to save storage space, more up-to-date data can be stored at shorter intervals and older data only sporadically. Filters can be used to retrieve either key figures of the entire system or only individual services [1].
Analysis of Microservice
871
Nagios. Nagios [17] is a tool for monitoring network activity and hosts. Nagios can be used to obtain key figures such as CPU and memory usage as well as log files from hosts. Furthermore, Nagios can be used to perform pseudo transactions (compare semantic monitoring) [1]. The monitoring tool Prometheus also gains increasing popularity. 5.3
Netflix
As a pioneer of microservices, Netflix is known above all for its stable and failsafe system. It should be mentioned that Netflix provides many tools as open source solutions for the general public. A selection is presented in this section. Hystrix. Hystrix is a popular tool from Netflix that implements the circuit breaker pattern (note, Hystrix is still fully available in open source, but is meanwhile in maintenance mode - Resillience4J provides similar means). This is to ensure that the application is not slowed down by failed requests [18]. Figure 10 explains the circuit breaker pattern. As shown in Fig. 10, service A communicates with service B and the request is forwarded from service B to service C. The request from service B to service C fails or raises an error. The circuit breaker is switched on to prevent the request from being executed again and again. If service A sends a request to service B and service B would have to forward a request to service C, this does not happen but service B lets the request fail directly (see step 2). In the following process, service B sends a test request to service C to check whether service c responds correctly again (see step 3). If a correct answer is received from service C, the circuit breaker is reset and the requests are redirected to service c again. If the request fails, service B continues to fail the requests immediately and sends a new request to service B at a later time to check if service C is responding correctly again (see step 4) [20]. Hystrix not only offers the implementation of the circuit breaker pattern, but also offers a dashboard with graphical representation. The dashboard can be used to evaluate the status of services for analysis purposes. The dashboard is shown in Fig. 11. As shown in Fig. 11, important information can be read from the dashboard. This allows you to see at a glance whether the circuit breaker is active or closed. In addition, the number of correct requests, thread timeouts and thread pool rejections can be read off. Hystrix also displays the error percentage of the last 10 s. With this information, it is possible to react early to failures and identify potential for improvement.
872
C. von Perbandt et al.
Fig. 10. Circuit breaker pattern [19]
Eureka. With Eureka, Netflix provides an open source discovery service for the registration of microservices. The advantage of this solution is that any number of microservices and instances of microservices can be registered and found. This means that developers do not have to specify the addresses of the microservices in the code beforehand, allowing them to scale services without having to specify the source code. A load distributor sends requests for a specific service to Eureka and in return receives the address for the service it is looking for. Eureka also
Analysis of Microservice
873
Fig. 11. Hystrix dashboard [21]
Fig. 12. Excerpt from Eureka dashboard [22]
provides a graphical interface to analyze the system. Figure 12 shows an excerpt from the Eureka dashboard. As can be seen in the excerpt, all registered instances are displayed in Eureka. Each service has its own line in the table. In the left column the names of the services are listed. In the column Status the number of instances is indicated and the addresses of the respective instances are indicated. If another service is instantiated and the service logs on to Eureka, it will appear in the table at the same time. This also enables analysis and error detection. If, for example, 10 instances of a service are to run and less than 10 are displayed, the address can be used to identify exactly which instance is not running or which service is not running. Netflix Monkeys. As already described in section “Test concept of Netflix” Netflix uses the so-called Simian Army to test its system. Parts of the Simian Army are made available as open source projects. However, the project will not be developed further. The official wiki contains documentation on Chaos
874
C. von Perbandt et al.
Monkey, Janitor Monkey and Conformity Monkey [23]. However, Chaos Monkey is available as a stand-alone application, while Janitor Monkey has been replaced by the Swabbie service. Conformity Monkey is meanwhile part of Spinnaker backend services [24].
6
Conclusion
All in all, one can say that microservices, when used correctly, offer some advantages over the classic concepts of monoliths, but that development, test, and operation of microservcies based systems can bring quite some challenges. By means of examples. We have shown that existing process models such as the test pyramid can be adapted to the challenges of microservices. So developers do not have to learn completely new procedures in the tests and analysis, but can base them on the test procedures already used in most cases. There are many tools that support the analysis and testing of microservices. These are mainly the evaluation of log files or the analysis of latency times and dependencies. It is noticeable that many tools support the analysis at runtime and not at design time. Another striking feature is that most tools are subject to the open source license and can therefore be used free of charge. Netflix, amongst others, provides some of its tools as free open source software. Utilizing a small microservices development case study, as well as examples for testing strategies, we could show the advantages but also some drawbacks of microservices based system development (also) for intelligent systems. Given its small size, we could only initially touch the practicability of microservices based software development (also) for intelligent systems. Thus, several more practical applications need to be examined in future work.
References 1. Newman, S.: Microservices Konzeption und Design, 1. Auflage, pp. 205–220. MITP Verlags GmbH & Co. KG (2015) 2. Newman, S.: Microservices Konzeption und Design, 1. Auflage, pp. 202–203. MITP Verlags GmbH & Co. KG (2015) 3. Richardson, C.: https://microservices.io/patterns/observability/distributed-tra cing.html. Accessed 19 Jan 2019 4. Richardson, C.: https://microservices.io/patterns/observability/health-check-api. html. Accessed 17 Jan 2019 5. Richardson, C.: https://microservices.io/patterns/observability/audit-logging. html. Accessed 17 Jan 2019 6. Berman, D.: https://logz.io/blog/audit-logs-security-compliance/. Accessed 18 Jan 2019 7. Newman, S.: Microservices Konzeption und Design, 1. Auflage, pp. 198–200. MITP Verlags GmbH & Co. KG (2015) 8. https://www.soapui.org. Accessed 15 Jan 2019 9. https://javacodehouse.com/blog/mock-rest-apis-soapui/. Accessed 16 Jan 2019 10. https://github.com/mkotsur/restito. Accessed 16 Jan 2019
Analysis of Microservice
875
11. Glowinski, R.: https://allegro.tech/2014/11/testing-restful-service-and-clients. html. Accessed 16 Jan 2019 12. Zipkin Home. https://zipkin.io. Accessed 17 Jan 2019 13. Soundcloud Developers Blogs. https://developers.soundcloud.com/blog/usingkubernetes-pod-metadata-to-improve-zipkin-traces. Accessed 17 Jan 2019 14. https://github.com/aio-libs/aiozipkin. Accessed 18 Jan 2019 15. Elastic Home. https://www.elastic.co/de/elk-stack. Accessed 19 Jan 2019 16. Graphite Home. https://graphiteapp.org. Accessed 19 Jan 2019 17. Nagios Home. https://www.nagios.com. Accessed 19 Jan 2019 18. Hystrix. https://github.com/Netflix/Hystrix. Accessed 20 Jan 2019 19. According to S. Newman, Microservices Konzeption und Design, 1. Auflage, p. 268. MITP Verlags GmbH & Co. KG (2015) 20. Newman, S.: Microservices Konzeption und Design, 1. Auflage, pp. 266–268. MITP Verlags GmbH & Co. KG (2015) 21. Christensen, B.: https://medium.com/netflix-techblog/hystrix-dashboard-turb ine-stream-aggregator-60985a2e51df. Accessed 20 Jan 2019 22. http://callistaenterprise.se/blogg/teknik/2015/04/10/building-microserviceswith-spring-cloud-and-netflix-oss-part-1/. Accessed 20 Jan 2019 23. SimianArmy Wiki. https://github.com/Netflix/SimianArmy/wiki. Accessed 20 Jan 2019 24. SimianArmy. https://github.com/Netflix/SimianArmy. Accessed 20 Jan 2019 25. Wolff, E.: Microservices: Grundlage flexibler Softwarearchitekturen. 1. Auflage. dpunkt.verlag GmbH (2016). (ISBN:978-3-86490-313-7) 26. Whittaker, J., Arbon, J., Carollo, J.: How Google Tests Software. 1. Auflage. Addison-Wesley Professional (2012). (ISBN:978-0-32180-302-3) 27. Spillner, A., Linz, T.: Basiswissen Softwaretest. 4. Auflage. dpunkt.verlag GmbH (2010). (ISBN:978-3-89864-642-0) 28. Gnoyke, H.: Tests erst in Produktion? - Was wir von Tests bei Microservices lernen k¨ onnen. https://www.informatik-aktuell.de/entwicklung/methoden/tests-erst-inproduktion-was-wir-von-tests-bei-microservices-lernen-koennen.html 29. Newman, S: Microservices: Konzept und Design. 1. Auflage, pp. 177–204. MITP Verlags GmbH & Co KG (2015). (ISBN:978-3-95845-081-3) 30. Medium.com: The Netflix Simian Army. https://medium.com/netflix-techblog/ the-netflix-simian-army-16e57fbab116 31. Cohn, M.: Agile Estimating and Planning. 1. Auflage. Prentice Hall (2005). (ISBN: 978-0-1347-941-8) 32. Martin, R.: Agile Software Deelopment, Principls, Patterns, and Practices. 1. Auflage. Person (2002). (ISBN: 978-0-3597-444-5)
An Analysis with Dynamics Between Human Motivation and Messaging on Social Networking Services Hidehiro Matsumoto(B) and Akira Ishii Tottori University, 4-101 Koyama-minami, Tottori-City, Tottori 680-8552, Japan [email protected]
Abstract. Some Social Networking Services (SNS) transmit and communicate personal opinions to our society with messages from our phycological motivation. By analyzing the messaging mechanism of SNS, you can expect to efficiently control and manage the transmission of rumors, advertisements, public relations, and word-of-mouth. Many previous studies have analyzed the transmission and diffusion of information by assuming that SNS members are passive nodes in the fields of information networks and complex networks. However, some studies on networks consisting of passive nodes do not consider about their intention and motivation of actual SNS members. In this paper, we observed and analyzed the transmission and diffusion of information that dynamics on real SNS occur by setting our mathematical model in which each active node has intention and motivation as a real SNS member. Keywords: Motivation · Social Networking Service · SNS · Opinion dynamics · Simulation
1 Introduction 1.1 Back Ground and Purpose The purpose of our research is to provide a way to clarify “why people communicate with each other by sending and receiving messages to SNS members on the network without actually meeting each other.” Previous research has explained the transmission and spread of information on the premise of communicating on SNS without giving a direct and clear answer to the above questions. In addition, some studies treat network dynamics with SNS building blocks (one member of SNS) as passive nodes, but because they do not consider the intentions and motivation of members, they are excessive on SNS. It cannot fully explain the transmission and propagation of information. On the other hand, it is difficult to find factors that change the intentions and motivations of members with a passive node mathematical model, as opposed to dynamic network mechanisms such as public relations and advertisements that intentionally spread information [8]. © The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 876–893, 2022. https://doi.org/10.1007/978-3-030-82193-7_59
An Analysis with Dynamics Between Human Motivation and Messaging
877
We have extended to a mathematical model with active nodes that contains the intent and motivation to perform messaging to solve these passive node limitations. In this paper, we have gradually adapted passive nodes to the behavior of SNS members. In other words, we introduced active elements such as motivation, reliability of information, and reliability of human relationships to each node. The “personalization” analysis for each node reproduced the spread of information that occurs on actual SNS. 1.2 Structure of This Paper We present the passive node issues that were problematic in previous studies. We show that the node is active, which is the difference between this study and the previous study. Introducing our mathematical model that we introduce to solve the problem. We will also “personalize” each node step by step. We will show the analysis of the calculation results and describe the points of discussion and future development from the analysis results.
2 Issues of Previous Studies and Our Approach 2.1 Issues of Previous Studies Some previous studies related to human motivation for communication has been addressed in the field of behavioral psychology, but has been unsuitable for quantitative analysis. In the field of information technology, it was difficult to analyze messaging with time-varying probability models, transfer function models, energy transfer models, and distributed models. The percolation and decision network models that allow quantitative analysis are based on complex network theory that focuses on the network configuration used [4, 5, 7]. Messaging describes the transferring and diffusion of information related to passive nodes and networks using internode links (see Table 1). Some problems with these studies are treating real social networking relationships as passive nodes. Passive nodes are based on assumption that you can already send some messages. This does not provide an answer as to why messaging occurs. 2.2 Our Approach We have replaced active nodes with messaging intent and motivation as factors to solve problems with passive nodes. Suppose the node is a member of an SNS. A node determines how to convey information to another node based on source reliability, information reliability, and content. At this time, the intent and motive of the node also influence the judgment. The content of the transmitted information depends on the intent and motivation of the active node. In this way, active nodes can reflect intent and motivation to form a more realistic network. In addition, by using elements within the node such as intention and motive as control factors, we apply it to solve many social problems (rumors, slander, discrimination and fake news) on SNS. We also expect applications in advertising, publicity, marketing and public information.
878
H. Matsumoto and A. Ishii Table 1. A comparison between mathematical models [4, 5, 7]
Models
Pros
Cons
Time-variant probabilistic model
Easy to create model for diffusion, transferring and convergence
Difficult to analyze of motivation
Transfer function model
Getting well the transfer characteristics for each message
Difficult to specify each message
Dispersion model
Getting well the transfer characteristics for each message
Not easy to determine specific message from geranial transfer characterizes
Percolation model
Getting well macro transfer characteristics
Not easy to analyze some explosive expansions
Decision network model
Getting well process of messaging
More complex to analyze factors for each decision
Energy transfer model
Better description of factors for motivation and time characteristics
Not getting useful results for analysis for each message
3 The Mechanism of Our Messaging Model 3.1 Event Driven Based Actual SNS has many functions but we introduce abstractly simple structure and mechanism to focus on messaging. A node is recognized as a member of SNS and has capability of communication to other nodes on SNS. A node can also receive event information via other nodes and source of events directly. Figure 1 shows a single node (a) with some links and information of the event, E1. Figure 2 shows double node (a) and (b) with a link.
Fig. 1. Single node (a)
In Fig. 3, the description of an event, E. E has “open term” from t = t1 through t = t2 and “stock term” from t = t2 .
An Analysis with Dynamics Between Human Motivation and Messaging
879
Fig. 2. Node (a) has n links and trust level tr. Node (b) recieves the effects both from node (a) and even E1
Fig. 3. Event E on SNS with open and stock terms. Node (c) has the links from E and node (a), node (d) has links from node (a) and node (b), where node (b) has already receive the information of E and node (a)
“open term” of E1 means that information providers keep event information on SNS. “stock term” of E1 means that event information is closed by information providers but users can observe the information from network cashes and stocks. Those terms are useful to distinguish effects of information providers for reliability of contents and event information. 3.2 Messaging and Motivation Some previous studies and papers [2, 3], were presented motivation for messaging as a decision table as Table 2 [6]. Table 2. Example of decision by motivations Type Motivation Node (a)
Node (b)
The link level (ρ)
1
(+) Positive
(+) Positive
+ 1.0
2
(−) Negative (−) Negative −1.0
3
(+) Positive
4
(−) Negative (+) Positive
(−) Negative + 0.5 −0.5
880
H. Matsumoto and A. Ishii
In the case of Node (a) has positive information and impression against Node (b) in Table 2, we could decide Type 3 link between both nodes if Node (b) would have negative information and impression. This way of decision has some problems: The links are explained relations or connections only. But motivation of messaging between nodes are not explained. Positive and negative information and impression must be separated information and human relation for analysis motivation between both nodes. To solve the above problem, we introduce the model with information and impression (reliability and trust) separately as the follow: I (t) = q(t), r(t), tr(t) (1) where we denote a node (a member of SNS) at timet by a message, I(t) from another node or information sources of events, q(t) is presented the quantity of the modification and editing with positive or negative opinion. The reliability r(t) is the reliability of received message, tr(t) is the trust level of a message source. At time t, a node makes a message to others with motivation against impression of other members and will in (1) as the following information: q(t + 1) = f (q(t), M _q(t))
(2)
r(t + 1) = g(r(t), M _r(t))
(3)
tr(t + 1) = h(tr(t), M _tr(t))
(4)
where the functions f (.), g(.), and h(.) are called modification and editing functions for q(t), r(t), and tr(t). The functions f (.), g(.) and h(.) are also related with the motivation factors, M _q(t), M _r(t) and M _tr(t). The motivation factor M _q(t) is a scale factor indicated the quantity of modification based on original I(t) message. M _r(t) and M _tr(t) are indicated factors of information reliability and trust. 3.3 Messaging Strategy A node may receive messages from other nodes at the same time in Fig. 3. We assumed that the latest message is set by the following strategy: 1. Each node has threshold ranges for M _q(t), M _r(t) and M _tr(t) individually. 2. The selected message has the highest value of M _q(t), M _r(t) and M _tr(t). 3. Even though motivation and intention are high, messages might not be sent with depended on reliability of received information and/or level of trust of human relation. In this case, no message may be sent. In the following simulations, we analyzed the message mechanism from personal motivation with the above strategy.
An Analysis with Dynamics Between Human Motivation and Messaging
881
4 Simulations 4.1 Initial Conditions In the following calculation, we need set lots of initial conditions for each node and the initial event. The values and switch flags of the initial conditions are the follows: st: the number of experience (Iteration, steps): usually st = 50. I(0): The initial information: I (0) contains of Q(0), R(0) and Tr (0). m(0): Modification factors: m(0) contains of M_q(0), M_r(0) and M_Tr(0). q_max: The limit size of Q(t): usually q_max = 50,000. m_max(0): Maximum sizes of M_q(0), M_r(0) and M_tr(0): This factors are effective when each random switch for M_q(t), M_r(t) and M_Tr(t) is ‘ON’. 6. m_th(0): Thresholds of M_q(0), M_r(0) and M_tr(0). 7. random_sw: The switch flags for M_q(t), M_r(t), M_Tr(t) and thresholds of M_q(t), M_r(t) and M_tr(t): These are switches to personalize each node without positive and negative reaction. 8. Mq_pn_sw: The switch flags for positive and negative reaction. 1. 2. 3. 4. 5.
4.2 Validation Test We tested the validation of this calculation of the reference model and trivial conditions (see Fig. 4 and Fig. 5).
Fig. 4. No editing and no modification as a reference model of the simulation. Each motivation factors are flat.
882
H. Matsumoto and A. Ishii
Fig. 5. M_q shows exponential increase. This condition is indicated the diffusion of information for all of nodes (M_q(t) = 1.2).
In Fig. 5, we show M_q(t) = 1.2 during 50 steps with 100% reliability and 100% trust level. This is obviously a phenomenon of information diffusion with big modification like gossip news. We confirmed the validation of our simulation. 4.3 Increments of Modification, M_q(t) To observe dynamics of motivation, we set M_q(t) = 2.0 with 100% reliability and 100% trust level (see Fig. 6). In Fig. 6, we show the speed of diffusion is higher than the condition of Fig. 5. This means the size of modification is one of diffusion speed factors expectably.
Fig. 6. Diffusion is more rapid than Fig. 5 (M_q(t) = 2.0)
An Analysis with Dynamics Between Human Motivation and Messaging
883
4.4 Variable Reliability Factor, M_r (t) and Trust Level, M_tr(t) We show the effect of variable reliability factor, M_r(t) = 50%, M_tr(t) = 80% (see Fig. 7).
Fig. 7. M_r(t) = 50%, M_tr(t) = 80%
Both M_r(t) and M_tr(t) decrease exponentially. It means that the original information come from the initial event include many modifications with low reliability and trust during transmitting. Information in this situation is no more meaningful. 4.5 Personalization (Level 1): Random Variation of M_q(t), M_r(t) and M_tr(t) The above calculations are assumed that all of nodes in a SNS does not work dependently as a real commutation and looks like telecommunications between network devices with errors and troubles. We introduce additional characteristics for the semination. One of addition is that each node works independently as human behaviors and M_q(t) is one of factors related with those behaviors. We set random variation of M_q(t) firstly with constant M_r(t) and M_tr(t). We observe that personal behaviors simulate information diffusion (see Fig. 8, 9, and 10). In the comparison along Fig. 8, 9 and 10, we show M_q(t) are decrease rapidly and the maximum values are randomly with 100% reliability and 100% trust level. It will mean the diffusion of news and event information may not expand rapidly for members of SNS. We show Fig. 11 has random M_q(t) and M_r(t). We found more rapid termination of the diffusion for the original information than random M_q(t) only.
884
H. Matsumoto and A. Ishii
Fig. 8. M_q(t) is variable randomly with M_q(0) = 1.2.
Fig. 9. M_q(t) is variable randomly with M_q(0) = 1.2.
An Analysis with Dynamics Between Human Motivation and Messaging
885
Fig. 10. M_q(t) is variable randomly with M_q(0) = 1.2 as the same as Fig. 5
Fig. 11. M_q(t) and M_r(t) are variable randomly with M_q(0) = 1.2
We show some samples of nodes behaviors with random variables M_q(t), M_r(t) and M_tr(t) from Fig. 12 through Fig. 14. Samples of the results show us as the follows: 1. The diffusions of the original information are terminated more rapidly than constant M_q(t), M_r(t) and M_tr(t) (Fig. 5). 2. Early terminations give smaller quantity of information come from the original events than constant M_q(t).
886
H. Matsumoto and A. Ishii
Fig. 12. M_q(t), M_r(t) and M_tr(t) are variable randomly with M_q(0) = 1.2
Fig. 13. M_q(t), M_r(t) and M_tr(t) are variable randomly with M_q(0) = 1.2
An Analysis with Dynamics Between Human Motivation and Messaging
887
Fig. 14. M_q(t), M_r(t) and M_tr(t) are variable randomly with M_q(0) = 1.2
Analysis of (1) and (2) will mean that you need to keep the level of M_q(t) = 1.0 if you want wider diffusions. This will also mean some members have need to keep M_q(t) = 1.0 as influencers or spokespersons in real SNS. 4.6 Personalization (Level 2): Random Variation of Thresholds for Motivation of Each Node Variations of other motivations are based on impression of contents and human relation with thresholds for M_q(t), M_r(t) and M_tr(t). These variations are also factor of each personality. We show semination with random level of thresholds for each node. We show samples of the results from Fig. 15 and Fig. 16 are based on impression of contents
Fig. 15. A sample: thresholds of M_q(t) are variable randomly with M_q(0) = 2.0 and M_r(t) = 0.5
888
H. Matsumoto and A. Ishii
Fig. 16. Thresholds of M_q(t) are variable randomly with M_q(0) = 2.0 and M_r(t) = 0.5
Fig. 17. Thresholds of M_q(t) and M_r(t) are variable randomly with M_q(0) = 2.0 and M_r(0) = 0.5
and samples of Fig. 15 and Fig. 16 are based on factors of both contents and human relations. These results are indicated that the terminations of diffusions are occurred shorter time than each node has fixed thresholds in spite of M_q(t) = 2.0 (see Fig. 6). M_q(t) = 2.0 is easy to expand in Fig. 6, however thresholds will work as a speed break depending on thresholds of nodes. This situation looks like member’s personality for other persons. We also show the difference between Fig. 15 and Fig. 16 is presented delicate situation from the difference of personality when the original information is expanded (Fig. 6).
An Analysis with Dynamics Between Human Motivation and Messaging
889
4.7 Personalization (Level 3): Random Variation Both M_q(t) and Thresholds We also show some results when each node has random variation both M_q(t) and its thresholds for contents and human relations in Fig. 18 and Fig. 19. The results of both samples are indicated more oppressed for transmitting and commutating to others. In this situation these results will mean each node sometimes works a gate keeper for messages to others in SNS.
Fig. 18. Thresholds of M_q(t) and M_r(t) are variable randomly with M_q(0) = 2.0 and M_r(0) = 1.0
Fig. 19. Thresholds of M_q(t) and M_r(t) are variable randomly with M_q(0) = 2.0 and M_r(0) = 1.0
890
H. Matsumoto and A. Ishii
4.8 Personalization (Level 4): Random Variation Both M_q(t) and Thresholds with P/N Opinions We show some samples of the results which each node has positive or negative opinion against received information and fixed M_q(t) and thresholds (see from Fig. 20, 21, 22 and 23). This situation is closer to human decision making. Positive or negative (P/N) opinion is given by polarity of M_q(t) and select from −1.0 < M_q < 1.0 in the standard normal distribution. Note this situation includes of M_q(t) = 0 which means messaging to others is stop and does not send [1]. The behavior of M_q(t) with P/N opinion has a vibration and/or tone down in Fig. 20, Fig. 21 and Fig. 23. If the polarity of M_q(t) is the same, the diffusion proceeds, but otherwise it converges rapidly.
Fig. 20. M_q(t) = 1.2, random is OFF, M_r(0) = 0.5, M_r(0)_max = 0.8, M_tr(0) = 0.5, M_tr(0)_max = 0.8 P/N switch on M_q(t) is ON (a sample of the results).
Fig. 21. M_q(t) = 1.2, random is OFF, M_r(0) = 0.5, M_r(0)_max = 0.8, M_tr(0) = 0.5, M_tr(0)_max = 0.8 P/N switch on M_q(t) is ON (a sample of the results).
An Analysis with Dynamics Between Human Motivation and Messaging
891
Fig. 22. M_q(t) = 1.2, random is OFF, M_r(0) = 0.5, M_r(0)_max = 0.8, M_tr(0) = 0.5, M_tr(0)_max = 0.8 P/N switch on M_q(t) is ON (a sample of the results).
Fig. 23. M_q(t) = 1.2, random is OFF, M_r(0) = 0.5, M_r(0)_max = 0.8, M_tr(0) = 0.5, M_tr(0)_max = 0.8. P/N switch on M_q(t) is ON
In Fig. 24 we show the results of the calculation is the closest to human messaging with P/N opinion, random M_q(t) and random thresholds for M_q(t), M_r(t) and M_tr(t). It is very difficult to find out the result of Fig. 24 because of activating random factors at the same time in each node. Most of results are terminated rapidly and the diffusion is not occurred in lots of calculations (Fig. 24).
892
H. Matsumoto and A. Ishii
Fig. 24. M_q(0) = 1.2, M_r(0) = 0.5, M_r(0)_max = 0.8, M_tr(0) = 0.5, M_tr(0)_max = 0.8, M_q, M_r and M_tr are random. The thresholds of M_q(t), M_r(t) and M_Tr(t) are random. P/N switch on M_q(t) is ON.
5 Discussions and Future Works We are considering some extensions to the messaging model with active nodes [1, 9]. 1. Our messaging model has been applied to a random network structure, but we can update to more realistic structures of SNS such as scale-free networks. Furthermore, similar to the collective opinion formations studied in opinion dynamics theory, it may be possible to analyze dynamics such as affirmative and negative for a certain message by forming a cluster on SNS. 2. Our model sets and assumes initial values, maximum sizes, and thresholds for M_q(t), M_r(t), and M_tr(t). These factors (motivations) change depending on memories and desires of personal experience. By adding factors that map some histories and desires to the “personalization” of each node, we are trying to bring our messaging model closer to the real SNS. 3. We are expanding to identify factors in our model by collecting actual SNS messages and their information transmission conditions and comparing them with the results of our computational simulations.
6 Conclusions We proposed a mathematical model with active nodes that has motivations such as unprecedented message content and authenticity, and reliability of human relationships on SNS. By changing the motivational factors in our model, we observed and confirmed the assumed information transmission as in the case of real SNS.
References 1. Ishii, A.: Opinion dynamics theory considering trust and suspicion in human relations. In: Morais, D.C., Carreras, A., de Almeida, A.T., Vetschera, R. (eds.) GDN 2019. LNBIP, vol. 351, pp. 193–204. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21711-2_15
An Analysis with Dynamics Between Human Motivation and Messaging
893
2. Ishii, A., Yomura, I., Okano, N.: Opininion dynamics including both trust an distrust in human relation fo various network structure. In: Proceedings of TAAI (2020) 3. Nowak, M.A., Sasaki, A., et al.: Emergence of cooperation and evolutionary stability in finite populations. Nature 428, 646–650 (2004) 4. Moran, P.A.P.: The Statistical Processes of Evolutionary Theory. Clarendon Press, Oxford (1962) 5. Yoshino, Y., Masuda, N.: Evolution of cooperation is a robust outcome in the prisoner’s dilemma on dynamic networks. In: 7th Network Symposium 2011 (2011) 6. Tanaka, M., Murakami, T.: ‘Co-evolution of strategies and players’ network in the reputation based prisoners’ dilemma (RPD). Meeting abstracts of the Phys. Soc. Jpn. 64(2–2), 207 (2009) 7. Pacheco, J.M., Traulsen, A., Nowak, M.A.: Coevolution of strategy and structure in complex networks with dynamical linking. Phys. Rev. Lett. 97(25), 258103 (2007). https://doi.org/10. 1103/PhysRevLett.97.258103 8. Ishii, A., Okano, N.: Sociophysics approach of simulation of mass media effects in society using new opinion dynamics. In: Arai, K., Kapoor, S., Bhatia, R. (eds.) IntelliSys. AISC, vol. 1252, pp. 13–28. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-55190-2_2 9. Easley, D., Kleinberg, J.: Networks, Crowds, and Markets: Reasoning About a Highly Connected World, pp. 563–687. Cambridge University Press, New York (2010)
Author Index
A Aburukba, Raafat, 556 Adams, Julie A., 573 Adda, Mo, 805 Ahmed, Khaled R., 168 Albegov, Zaurbek Valerievich, 739 Alqithami, Saad, 513 Amad, Mourad, 12 Arco, Leticia, 682 Astrova, Irina, 857 B Baier, Lucas, 837 Barbosa, Ricardo, 375 Barnes, Laura E., 31 Borsi, Zsolt Richárd, 129 Boudries, Abdelmalek, 12 Boukhechba, Mehdi, 31 Bouzidi, Zair, 12 Bozkurt, Ahmet Fevzi, 100 C Cai, Lihua, 31 Chin, Cheng Siong, 1 Collina, Simona, 817 Coulter, Duncan, 464 Cullinan, Michéle, 464 D De Simone, Flavia, 817 Deng, Yu, 248 Dessalles, Jean-Louis, 719 Diaconescu, Ada, 719 Dols, James, 593 Dushkin, R. V., 778
E Ehrhardt, Daniel, 148 Elliott, Clark, 526 Emiola, Iyanuoluwa, 499 Erkan, Kadir, 100 Escobar, Jesús Jaime Moreno, 791 F Fang, Zong Rui Dexter, 700 Fasfous, Nael, 148 Fatima, Arooj, 309 Ferjani, Imen, 72 Flores, Anibal, 330 Frickenstein, Alexander, 148 Frickenstein, Lukas, 148 Frihida, Ali, 72 Fuentes, Ivett, 682 Fujii, Akihiro, 51 Fujii, Makoto, 409 G Gegov, Alex, 630, 805 Goodrich, Michael A., 573 Grabaskas, Nathaniel, 425 Gregorics, Tibor, 129 Gupta, Yash, 85 H Haddad, Malik, 613, 622, 630, 640, 805 Haghbayan, Mohammad-Hashem, 391 Hao, Yuexing, 543 Hidri, Minyar Sassi, 72 Hino, Takanori, 344 Holmes, Bruce J., 355 Homaifar, Abdollah, 355
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 K. Arai (Ed.): IntelliSys 2021, LNNS 294, pp. 895–897, 2022. https://doi.org/10.1007/978-3-030-82193-7
896 Houzé, Étienne, 719 Huang, Lei, 261 Huang, Ya, 613 I Ibrahim, Youssef Youssry, 236 Ishii, Akira, 409, 438, 453, 876 J Jacuinde-Alvarez, Daniel, 593 Jannesari, Ali, 115 K Kadampur, Mohammad Ali, 764 Kagawa, Tomomichi, 344 Kamimura, Ryotaro, 184 Kato, Shigeru, 344 Katrompas, Alexander, 217 Khan, Adil Mehood, 236 Ko, Kyungtae, 115 Koschel, Arne, 857 Kühl, Niklas, 837 Kuhn, Sabine, 148 Kulathunga, Nalinda, 261 Kume, Shunsaku, 344 Kumeno, Hironori, 344 L Langenkamp, Wouter H., 668 Langner, Martin, 622, 640 Lei, Hang, 248 Lennartz, W., 291 Leventi-Peetz, A.-M., 291 Li, Longfei, 650 Li, Meng, 650 Li, Ruisi, 205 Li, Xiaoyu, 248 Lifton, Samuel, 630 Lin, Yiou, 248 Lu, Si, 205 Luca, Cristina, 309 M Mahboob, Huma, 391 Maktab-Dar-Oghaz, Mahdi, 309 Matamoros, Oswaldo Morales, 791 Matsumoto, Hidehiro, 876 Menga, David, 719 Mesbah, Yusuf, 236 Metsis, Vangelis, 217 Mungloo-Dilmohamud, Zahra, 318 Munoz-Avila, Hector, 484
Author Index N Nagaraja, Naveen-Shankar, 148 Nápoles, Gonzalo, 682 Nateghi, Shamila, 355 Niaraki, Amir, 115 Nishikawa, Masaru, 453 Nobuhara, Hajime, 344 Novais, Paulo, 375 Nuzzo, Manuela, 817 O Okano, Nozomi, 438, 453 Oosterman, Dion T., 668 Östreich, T., 291 P Padilla, Ricardo Tejeida, 791 Páez, Ana Lilia Coria, 791 Pan, Wuming, 280 Pintér, Balázs, 129 Plosila, Juha, 391 Purmah, Balkrishansingh, 318 R Ranasinghe, Nishath, 261 Reifsnyder, Noah, 484 Riyaee, Sulaiman Al, 764 Roghair, Jeremy, 115 S Sala, Ramses, 744 Sanders, David, 613, 622, 630, 640, 805 Santos, Ricardo, 375 Sarkar, Mrinmoy, 355 Sathikh, Peer, 700 Scheutz, Matthias, 573 Schmitt, Jörg, 837 Schumann, Mathieu, 719 Seenundun, Viswanathsingh, 318 Semwal, Vijay Bhaskar, 85 Shi, Qitao, 650 Shota, Fujioka, 344 Simandjuntak, Sarinova, 640 Stechele, Walter, 148 Szalontai, Balázs, 129 T Tan, Guan Yi, 700 Tayeb, Shahab, 593 Tewkesbury, Giles, 613, 622, 630, 640, 805 Tito-Chura, Hugo, 330 Tuskaeva, Zalina Ruslanovna, 739 Tyca, Matthias, 857
Author Index U Unger, Christian, 148 V Vadász, András, 129 Vamvoudakis, Kyriakos G., 355 van Bergen, Ellen L., 668 Vanhoof, Koen, 682 Várkonyi, Teréz A., 129 Vatchova, Boriana, 613 Vaysiberg, Mark, 543 Vemparala, Manoj-Rohit, 148 Ventura, Michele Della, 823 von Perbandt, Charline, 857 Vrinceanu, Daniel, 261 W Wahaishi, AbdulMutalib, 556 Wamambo, Tinashe, 309 Wang, Yunjiao, 261 Wang, Zhizhen, 425
897 Weber, K., 291 Wu, Yuankai, 148 X Xiang, Kun, 51 Y Yalçın, Barış Can, 100 Yamamoto, Hitoshi, 453 Yan, Xuyang, 355 Yana-Mamani, Victor, 330 Yang, Xinxing, 650 Yasin, Jawad N., 391 Yasin, Muhammad Mehboob, 391 Z Zhang, Jianhua, 1 Zhang, Ya-Lin, 650 Zhao, Qi, 148 Zhou, Jun, 650